Here at Gitstart, while working with clients on platforms like GitHub or Gitlab, we often find the need to sync codebases across a pair of remote repositories. For each client assigned task, our developers work on a branch of a private repository (cloned/forked from a client repository) and we need to make sure that the client repository is up to date with this and vice-versa. Over time, the syncing of repositories becomes a repetitive process so it makes sense to introduce some sort of automation to this work.
Enter Gitstart Fork: Fork is an internal tool we use which leverages the power of web hooks in order to do almost real-time syncing of code across a pair of repositories, seamlessly.
But how does it work, you might be asking. To simplify the codebase, we’ve decided to split functionality into two parts:
- Pull: move changes in the client repository into our repository
- Push: move changes in our repository to the client’s repository
This article will mainly focus on Pull, and Push will be discussed in a future article.
Tech Stack: We write most of our code in TypeScript and NodeJS. Our database of choice is PostgreSQL with Hasura as the GraphQL engine (which has a nifty feature called a subscription which makes processing super easy).
For simplicity, we’ll be talking about GitHub repos, but this can be extended to any git based remote service like Gitlab or Bitbucket.
Fork Pull offers the following features for syncing:
- Branch control: specify which branches to sync.
- Granular file syncing control: specify which folders/files to be synced and which to be ignored.
- .gitignore support: ignore files mentioned in .gitignore while syncing.
2. The database
We start with a table which keeps track of the pair of repositories we want to sync, which branches to sync in those repositories as well as any files (or folders) which we don’t want to sync. A high-level schema would look like:
- id: integer
- fromRepo: string
- toRepo: string
- fromBranch: string
- toBranch: string
- ignored: string
- folders: string
We also keep a record of all the so-called “pulls” we have made:
- id: integer
- startedAt: timestamp
- finishedAt: timestamp
- error: string
- commitSlice: relation to git_commit_slices
- repoSlice: relation to git_repo_slices
This table acts as a middle ground between starting a pull and ending a pull. We mainly use it to track the progress of a pull event. We can gauge the success or failure of a pull, learn about errors that occurred, calculate the time required for a pull and much more.
A successful pull is logged into a third table which keeps track of a pair of head commits of the branches synced (the importance of this table will become clear in the next section):
- id: integer
- targetCommit: string
- originCommit: string
Note: all the schemas are merely representational and fields can be included and excluded depending on functionality desired.
3. Detecting changes
A database is well and good, but how do we detect when to pull? After all this seems to be the most important part of a tool that claims to be “real-time”. This is where the magic of webhooks and Hasura Subscriptions comes in! We already have a pretty robust and scalable webhook infrastructure set up at Gitstart, about which you can read in this wonderful article by my equally wonderful colleague, Tania Tan Wu. However, just as a summary:
- We store all our web hooks from third-party services like Github or Jira in our database.
- Subsequently these webhooks are processed and the corresponding tables are updated accordingly.
- If processing fails for any webhook payload, we wait 5 minutes before retrying.
Again, for a much more in-depth explanation as to why we do things this way, refer to the article above :)
For our purposes, we need only focus on one type of GitHub event: push. This event occurs whenever:
One or more commits are pushed to a repository branch or tag.
We combine this webhook event with a little subscription magic to have the following flow for detecting changes:
- Commit(s) are pushed to our source repository (fromRepo).
- Push webhook is logged in our DB and is processed.
- This changes the head commit (SHA-1) of the source branch in our database.
- Once the database is updated with this new commit, our subscription (see GraphQL code below) notices that the sliced branch of the main repository has new commits that have not been pulled yet and thus runs the processor.
- The code then performs a pull, syncing the repos. (more on this in the following section)
The main trigger for a pull event is a database update. In our case, this is achieved through webhooks (due to their real-time capabilities). However, the same can be achieved via a number of other options.
Disclaimer: the following steps assume that our database is up to date.
4. Git architecture
Now comes the fun part! The actual “pull” isn’t super complicated. We’re using isomorphic-git as our node git client. The git operations can be divided into 5 stages:
We clone the repos to temporary folders. In case the folders already exist, we simply pull the latest commits and switch to the branches we are interested in. We don’t delete the folders and reclone every time because pulling is faster than cloning.
We delete and remove from the git index all the files/folders present in the toRepo but not in the fromRepo for the folders to be synced, taking care to exclude any ignored files/folders. We use glob matching for this purpose, taking out the difference of the sets of file paths and delete those.
Now we copy all the files from the ```fromRepo``` to the `toRepo`. We use a lightweight library called copy-dir for this. The library provides us a filter to filter out file paths, which is useful in our case.
4.4 Add and Commit
Simple step where we add all the file paths to staging and commit them, receiving the `SHA-1` of the head commit for database purposes.
Finally we push the commits to our `toBranch`. This push will trigger a webhook which we will subsequently process and maintain the integrity of our database.
5. Final thoughts
Our repos should now be synced! It’s almost like magic, yet that’s what it feels like when technology is used in subtle and clever ways. I hope everyone found this article a fun read and next time we’ll be back to explain how we push our code to keep repositories in sync. Until then take care!
I would like to thank Arslan Arshad, my mentor over at Gitstart, whose brainchild Fork is and I hope my writing has done it justice. I would also like to thank my friend and colleague Tania Tan Wu for helping me review this article. Finally I would like to thank Hamza Zia and all the folks over at Gitstart for giving me the opportunity to work on such an amazing technology!
This blog post was originally published here on medium.