We depend on BigQuery for our dataware house. All the events from different sources are streamed into BigQuery
We had following requirements:
- Data can be inserted in streaming way.
- We wanted a serverless mechanism.
- We should be able to scale based on number of events
What we tried first
What we started was Google Cloud Dataflow. Though it worked and its pretty simple it didn't fit our bill of running serverless. Cloud Dataflow takes care of the instance but you are still charged for running instance. Another limitation which we ran into is we wanted to have all events process through a single pubsub topic, which means additional custom workflow has to be written.
In our system at the time of writing we had three types of events.
- Database entity creation events.
- Website business logic events.
- Archive messages for objects
Latency Impact/Failure handling
To avoid the application having a latency impact writing to pubsub we write the through a async thread. If we fail to publish to pubsub we use RocksDB as a embedded database and write to it. Messages are later replayed from RocksDB. Essentially think of it as a fast append only buffer for our use-case.
- Assessment lifecycle (Started, Inprogress, Completed)
- Candidate events (Giving assessment for Company XYZ etc)
- Assessment Results
Serverless is very use-ful for low scale and event based systems. When designing such system make sure all part of application can use it without lot of changes.