GitHub Notifications challenges
In this post I share some insights on some of the technical challenges GitHub Notifications had
Hi there, friends; it took a little while to be back!
Unfortunately, after our trip to Finland/Switzerland, I got COVID. It wasn't too bad, but I just took the time to relax and recover from it; the newsletter can always wait :)
On this week's edition: GitHub Notifications.
Disclaimer: I’m a former Engineer at GitHub, and this post aims to share some insights on the technical challenges with large systems like Notifications. I didn’t build most of this, and credit goes to the super talented engineers there.
Recently I saw this tweet on my timeline.
Not sure if I want to amplify this person's voice as I've already seen some questionable online behavior. Still, since I was part of the Notifications team at GitHub for a little over a year, I can share some insights on the surprising issues plaguing notifications.
So cool, I can watch repos!
If you've used GitHub for a while, this might be familiar to you:
That's the original implementation of the Watch-a-repo button, which has profound implications for notifications.
For starters, GitHub users were pretty liberal with watching repos, but hardly anyone would be interested in EVERYTHING that happens in a super busy repo like rails/rails, which means users followed repos for no particular reason other than to signal they care about a project. Since you get notifications for each issue, pr, comment, etc., this meant many notifications, most of which were left unread by users. Now, to the first problem:
It was hard to surface the proper notifications to users. And while the team wanted to do so (and had significant proposals), there was a technical challenge. Notifications had two of the largest tables at GitHub, with billions of rows each.
In practice, this meant:
It was impossible to change table schemas. GitHub uses https://github.com/github/gh-ost to run migrations without locking tables, but when it comes to tables with billions of rows and data of that size, it's hard to create a full (ghost) copy of the table in the same host.
Replication lag: as with any big and sufficiently busy database, you start to suffer from replication lag issues.
So what do you think is the best solution here? Generate less data? Split it somehow? Indeed, generating less data was partially implemented: web notifications were disabled by default for all new users for a long time. But besides that, there wasn't an easy way without hoping millions of users would stop watching repositories.
The other solution is a technical one: sharding. And that's what the team did. The team rolled out Vitess, which lets you shard your database without having to rewrite your entire app. It's not risk-free, and the team went to great lengths to guarantee success. Once implemented, it brought new possibilities: reduced replication lag issues and opened the door to modifying the table's schema.
The second issue was Notifications being tightly coupled to subscribable
objects. For example, suppose you ever comment on any issue for a repo. In that case, that automatically subscribes you to that issue, and you get subsequent updates notifications. Furthermore, there's also the concept of threading baked into the system and made notifications "naturally" suited for objects with a thread-like shape (an issue with comments, for example). This means two problems:
There wasn't a straightforward way to send a single notification to non-subscribable objects, such as a notification for a failed credit card payment.
Not easy to integrate independently. The complexity around
subscribable
objects,threading
, and templates scattered in Ruby objects, made Notifications a very tightly coupled code. Any external team that needed to send new notifications would need the support of the Notifications team to send those. And it usually meant multiple days of work across teams.
Ultimately, the whole system had to be rewritten to make it easy for teams to create new notifications for new products at GitHub. By the time I left, the team was busy with that, and the new backend system looked great, something close to an event-driven architecture that made more sense; and I’m sure the team will eventually write about it. But, of course, you must migrate existing data safely first, which might take a while.
There are a lot of other topics I haven't touched on which are equally complex: search, mobile notifications, and the UX experience per se.
So indeed, the notifications product almost needs an entire startup to fix it. GitHub has a lot of engineers to assemble one. Still, Notifications happened to be under the same organization building GitHub Projects (which recently became generally available), and competing with its relevance for the users and the business is tricky. I do hope in the next year we might see a new revamp of the product, especially open source developers would be grateful.
If you're curious about other areas or challenges in that area, let me know! Happy to expand on these topics.