Data pipeline at InterviewParrot

We are going to discuss about how different systems in InterviewParrot publish the data to our Datawarehouse.

We depend on BigQuery for our dataware house. All the events from different sources are streamed into BigQuery

Our Requirements

We had following requirements:

  1. Data can be inserted in streaming way.
  2. We wanted a serverless mechanism.
  3. We should be able to scale based on number of events

What we tried first

What we started was Google Cloud Dataflow. Though it worked and its pretty simple it didn't fit our bill of running serverless. Cloud Dataflow takes care of the instance but you are still charged for running instance. Another limitation which we ran into is we wanted to have all events process through a single pubsub topic, which means additional custom workflow has to be written.

Our Design

In our system at the time of writing we had three types of events.

  1. Database entity creation events.
  2. Website business logic events.
  3. Archive messages for objects
We created a base event message which some properties which we wanted to go in data warehouse regardless of event type. We create a single pubsub topic for all the datawarehouse events. Each Datawarehouse event contained some header fields and the payload field which was the actual data for the table in datawarehouse. We had a Google cloud functions which was subscribed to the pubsub topic. The job of the Cloud functions is to parse the header information from the Message and redirect the message to the right bigquery table

Image
Latency Impact/Failure handling

To avoid the application having a latency impact writing to pubsub we write the through a async thread. If we fail to publish to pubsub we use RocksDB as a embedded database and write to it. Messages are later replayed from RocksDB. Essentially think of it as a fast append only buffer for our use-case.

Few events
  • Assessment lifecycle (Started, Inprogress, Completed)
  • Candidate events (Giving assessment for Company XYZ etc)
  • Questions
  • Assessment Results

Final words

Serverless is very use-ful for low scale and event based systems. When designing such system make sure all part of application can use it without lot of changes.

Get in touch

If you would like to contribute or have any questions, feel free to contact us and we will get in touch for sure. Write to us at developers@interviewparrot.com, or ask us for a slack invite to our developer zone.