In the recent years there has been a trend moving from using APIs where you have to constantly request data from a third party service, towards Webhooks where the third party service you are using instead will send a POST request to your url of choice whenever there is any change you subscribed to (eg. a new Github issue being created, a new comment…).
That’s what we are doing at 🌟Gitstart🌟as well, and here’s how we are building our infrastructure to run webhooks from multiple sources, at scale.
For simplicity purposes we will be using Github webhooks as an example 👀
Most code examples are in NodeJS & Typescript. We are using Hasura as our backend GraphQL engine to scale our PostgreSQL database, allowing our webhooks solution to be truly scalable!
A key point about Hasura is that it allows us to utilise subscriptions which is “essentially a query where we receive an update whenever the value of any field changes upstream”.
0. How to Webhooks…?
As we mentioned earlier, webhooks basically allow you to subscribe to events you’re interested in from third-party APIs or services.
And here’s a quick overview on “How to Webhooks”:
- First you create a webhook on your third party service and point it to end endpoint (eg. https://your.url/subscription/webhooks/[some_id]).
- Now that the webhook is initialised on the service (Jira/ Github/ Zapier/ Zoom… you name it!), we need to first list out all the different webhooks we are going to work with
- then we will deal with the webhook events coming in to our url (Github issue being created, new Github branch being created etc). Documentations of the webhook event payloads often are available for the services that support webhooks (here’s Github’s!).
- You can choose to save all those events to your database temporarily, then address them separately (which is how we are doing it at Gitstart).
- Lastly you need to decide how you would like to process each event, either upserting each entry according to action to a respective database table, or trigger a hook or an action…
Why do we save webhook events to database? Can’t we just process the webhook events directly? We chose to save events to an webhook_events table before processing them because in case any errors while processing events, we can safely retry without losing any events. It’s also a good way of balancing out our internal systems in case there is a spike of webhook_events that compromise the performance of other workflows.
An alternative and even more scalable way would be to have an extra step before the saving webhook_events into the database by utilising a message queue. 📩
1. Designing the Infrastructure for Webhooks
- A table for all Webhooks, example fields include but not limited to:
- id // unique id for Github or other sources
- A table for all Webhook Events, example fields include but not limited to:
- (Optional): Check the types for the data you may want to capture from the third party services. Look at @octokits/types for all pre-defined types for Github.
ThirdParty App Webhooks Infrastructure
- Setup a Webhook Url pointing from the third party app’s settings: eg. https://your.url/subscription/webhooks/[some_id]
- (Optional but highly recommended) : Setup Webhook Secret Token
2. Dealing with Webhooks, 1 at a time
For each webhook_event that comes through, say you are dealing with Webhooks from Github, Jira, Zapier, Zoom… etc. You need to first divert them into different buckets to be able to apply custom actions to them in parallel. Before we store the incoming webhook_event into our webhook_eventstable, for security reasons, you should:
- Setup Github Webhooks Secret Token here 🤫
- Make sure the header of the Response to include secret signature eg x-hub-signature
With that in place, you can insert the following into the webhook_events table (Payload of a push event on Github can be found here):
- unique webhookId for Github (as you are designing for multiple datasources, this is important)
- eventType (eg push)
- eventSubType (eg added)
- data (aka the part of the Response payload that is useful data)
Why do we do this step? By doing this, we are creating a backlog of webhook_events which are all secure events that are using the right secret token. This is useful for in case your event processing pipeline was down, or in case there was an error processing your realtime webhook_events, all the webhook_events are still stored in the database and you can pick it up again when you fixed the processing part.
3. How to... processThoseWebhookEvents()
Now that all the webhook events are stored in our webhook_events table, before we start processing those webhook events, we need to first query (/listen to) the database for all the events, filter through and choose the webhook events we are going to process, and then write the respective processing code.
Why do we do this step? Oftentimes our webhooks will receive more events than we need, or at different stages of development, we would like to first ignore some less important events, etc.
- At Gitstart we are using Hasura for our backend GraphQL, where we can subscribe to the table and have a stream (Observables) of webhook_events coming in instead of querying it once)
- We need to set the source to specify the webhook sources we are dealing with. In our example, we only have Github so far, but our infrastructure is extendable with other sources like Gitlab, Bitbucket, Zoom, etc
Note: each source points to its own file where we specify all the webhook subevents that we will process, or ignore.
- We will filter through the webhookEvents when we query for (/subscribe to) the database with the keys of SOURCE_PROCESSOR.
- Of all those events we will first processWebhookFromVariousSources then go to processWebhook (for each source) … see in code snippet below
- We will then check if each webhook_event has been process already, which we can have a helper function to track the flag timeOfLastEventProcessed to hav.
- Later we will processEvent individually according to the different eventTypes and actions under the Github file.
- Lastly, upsert back to webhook_events table to show that you’re done with this event (eg with fields like updatedAt, mergedAt, isIgnored…)
Here is some pseudocode to help illustrate our design 🎨. In the code you can see, we are processing different sources in parallel, which really speeds up lots of things.
4. Structure all eventTypes you will be working with in a single file for the source (eg: Github)
Why do we do this step? You can easily manage all the eventTypes you’re dealing with in a single file, link the right processing files, control what type of events we want to fetch from the database and also make note of what are the eventTypes to skip.
5. Finally we can process the event!
Say we are in the processPR file — under this eventType, there are a few different eventSubTypes, which you can first have a handler to filter through each webhook_event and only work on the specifiedeventSubTypes (eg. created, closed etc).
As mentioned earlier, you can decide how you want to process each event. A common one is to store the data of that event into the respective table you have created in your database for further manipulation. Or you can trigger an internal hook, or an action!
6. Final Notes
- During development you may want to point the URL endpoint to expose your localhost with something like serveo.net or localtunnel
Thank you for reading!