͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

The Most Complicated System I Ever Worked On

A ludicrously elaborate pipeline of webscraping, parsing, persistence, processing, and polling

Jacob Bartlett

Feb 17

∙

Preview

READ IN APP

Writing online is a lot like creating content on OnlyFans.

And I’m not just saying that because I’m very sexy.

I’m saying that because it has a strict power-law distribution.

There are several big names at the very top of the leaderboard that pull in millions of dollars a year. On the other side, there’s an incredibly long tail of people who make almost no money, perhaps with a tiny number of paid readers and/or simps.

Wow, TIL Substack was even more top-heavy than OnlyFans (source)

Between the extremes, a quiet middle class at #88 on the technology leaderboard who are probably pulling in something like minimum wage (p.s. that’s exactly where I’m at, hence my recent job-hunt).

This power-law distribution shows up everywhere. In software engineering, it manifests in the complexity of problems that we work on day-to-day. There is a very long tail of everyday “build this screen” or “investigate this bug” that you can largely perform on autopilot.

There are regular mid-level challenges in the middle, that crop up every month or two: profile this subtle performance problem, or help architect this new feature.

Once in a blue moon; maybe just once in your career; you have to pull off something nobody’s built before. Something with so many moving parts it needs constant work to prevent it overwhelming your team.

This rare peak challenge is what I’m bringing to you today. The hardest engineering challenge I have ever faced. A one-of-a-kind feature that orchestrated an automated, client-side, comprehensive, high-volume data ingestion pipeline.

Building a reliable system on top of an unreliable system is always quite traumatic, so please excuse my occasional abrupt screams as I write this.

The original pipeline allowed you to fetch your Google data, which could reach Gigabyte scale for each user, the root of much of the complexity. For this example, however, I’m doing the same thing with my Spotify data, because open-sourcing my music search history in the sample project will be less likely to get me arrested.

I didn’t build this bad boy from scratch. This system was inherited MVP form, by an engineer who is a lot more intelligent than me. I did, however, have the pleasure of spending a year hardening the pipeline from a neat (but fragile) pipeline into a battle-hardened production system that worked robustly.

This article comes complete with the full sample project, demoing the whole pipeline in its glory (with companion server code!)

Come and see the man-made horrors beyond your comprehension.

Paid members get the full article, and a lot more:
🌟 Access Elite Hacks, my exclusive advanced content
🚀 Read my free articles a month before anyone else
🧵 Master concurrency with my full course and advanced training

Continue reading this post for free in the Substack app

Claim my free post

Or upgrade your subscription. Upgrade to paid

Comment

Restack