Hard truths about scaling data intensive applications

ramraj07 · on April 19, 2022

Was hoping for some genuine insights but left dry with generic truisms.

If anyones looking for any discussion worthy truths here are some from my experience:

1. Metrics are important but you have to find the balance between too many thresholds and too few. The only metric with a set-and-forget threshold I’ve come across is using the Kolmogorov Smirnoff test to look at changes in distribution. Otherwise the only thing that works practically is to monitor trends of factors like fill rates and unique values on log scale over time.

2. Create a UI for everything. Your RBAC, your pipeline overview, your configurator, whatever. The moment you slap a UI over something the engineers don’t need to be involved in implementation anymore.

3. The moment you have more than 2-3 people touching code and configs add an RBAc service and force everyone to modify things through a ui so you can leave an audit trail and full reversibility. Benefit, when it comes to SOc2 or whatever cert, you’re already there.

4. Focus on creating a robust configuration service as soon as possible, and start using it everywhere. Nothing should be hardcoded.

5. Some engineers are just not good with big data and data troubleshooting. That’s reality. It’s almost futile trying to onboard them on big data pipeline or product work. This correlates with an engineers ability to have a very large mental model of both the code and the data schemas which gets more complicated with data pipeline companies.

6. Airflow sucks. It’s one of those tools where it doesn’t fundamentally solve the problem it’s supposed to be solving. If you can get using astronomer or something from the beginning please do. Same for great expectations.

faizshah · on April 19, 2022

Can you expand on the metrics point, whats a decent core set of metrics to monitor on your dashboard?

Also any advice for schema versioning at scale?

dwohnitmok · on April 19, 2022

What's wrong with Great Expectations? Or Airflow for that matter?

ramraj07 · on April 20, 2022

Great expectations: because it’s primarily focused on thresholding based alerting which never works with messy data. Because it doesn’t actually have any advanced metrics (like the Ks test above). Because it gives very minimal actual tooling. It doesn’t graph the metrics in any meaningful way. It doesn’t offer any tools to deal with metric fatigue. If not used correctly you end up hard coding thresholds in a way that’s annoying to modify so you just start ignoring all the alerts anyway.

Airflow - you’re still left to figure out how to run a single instance of the scheduler in a HA method because if you don’t then it has lower reliability than a cronjob in a Linux machine. And given everyone moving to snowflake like models you really don’t need what airflow offers viz. workers etc. Just something to run queries in a service.

johnqian · on April 19, 2022

This is spam-tier content. Almost nothing specific to scaling data intensive applications. You could rename the article to “how to catch the squirrel in your backyard” and it’d be just as relevant.

I wonder if the upvotes were purchased or coordinated. If actual readers upvoted this, I would love to know why.

aunty_helen · on April 19, 2022

I would call it linkedin tier content. Makes non-technical people feel like they've learned something new. Like their chair just slid an extra inch towards the dev team.

As for the front page, it only takes a couple (<10) upvotes early on for something to be shoved in front of a critical mass of people.

darioush · on April 19, 2022

The one hard truth about data intensive systems:

There will be migrations and they will be bumpy.