Was hoping for some genuine insights but left dry with generic truisms.
If anyones looking for any discussion worthy truths here are some from my experience:
1. Metrics are important but you have to find the balance between too many thresholds and too few. The only metric with a set-and-forget threshold I’ve come across is using the Kolmogorov Smirnoff test to look at changes in distribution. Otherwise the only thing that works practically is to monitor trends of factors like fill rates and unique values on log scale over time.
2. Create a UI for everything. Your RBAC, your pipeline overview, your configurator, whatever. The moment you slap a UI over something the engineers don’t need to be involved in implementation anymore.
3. The moment you have more than 2-3 people touching code and configs add an RBAc service and force everyone to modify things through a ui so you can leave an audit trail and full reversibility. Benefit, when it comes to SOc2 or whatever cert, you’re already there.
4. Focus on creating a robust configuration service as soon as possible, and start using it everywhere. Nothing should be hardcoded.
5. Some engineers are just not good with big data and data troubleshooting. That’s reality. It’s almost futile trying to onboard them on big data pipeline or product work. This correlates with an engineers ability to have a very large mental model of both the code and the data schemas which gets more complicated with data pipeline companies.
6. Airflow sucks. It’s one of those tools where it doesn’t fundamentally solve the problem it’s supposed to be solving. If you can get using astronomer or something from the beginning please do. Same for great expectations.
Great expectations: because it’s primarily focused on thresholding based alerting which never works with messy data. Because it doesn’t actually have any advanced metrics (like the Ks test above). Because it gives very minimal actual tooling. It doesn’t graph the metrics in any meaningful way. It doesn’t offer any tools to deal with metric fatigue. If not used correctly you end up hard coding thresholds in a way that’s annoying to modify so you just start ignoring all the alerts anyway.
Airflow - you’re still left to figure out how to run a single instance of the scheduler in a HA method because if you don’t then it has lower reliability than a cronjob in a Linux machine. And given everyone moving to snowflake like models you really don’t need what airflow offers viz. workers etc. Just something to run queries in a service.
This is spam-tier content. Almost nothing specific to scaling data intensive applications. You could rename the article to “how to catch the squirrel in your backyard” and it’d be just as relevant.
I wonder if the upvotes were purchased or coordinated. If actual readers upvoted this, I would love to know why.
I would call it linkedin tier content. Makes non-technical people feel like they've learned something new. Like their chair just slid an extra inch towards the dev team.
As for the front page, it only takes a couple (<10) upvotes early on for something to be shoved in front of a critical mass of people.
If anyones looking for any discussion worthy truths here are some from my experience:
1. Metrics are important but you have to find the balance between too many thresholds and too few. The only metric with a set-and-forget threshold I’ve come across is using the Kolmogorov Smirnoff test to look at changes in distribution. Otherwise the only thing that works practically is to monitor trends of factors like fill rates and unique values on log scale over time.
2. Create a UI for everything. Your RBAC, your pipeline overview, your configurator, whatever. The moment you slap a UI over something the engineers don’t need to be involved in implementation anymore.
3. The moment you have more than 2-3 people touching code and configs add an RBAc service and force everyone to modify things through a ui so you can leave an audit trail and full reversibility. Benefit, when it comes to SOc2 or whatever cert, you’re already there.
4. Focus on creating a robust configuration service as soon as possible, and start using it everywhere. Nothing should be hardcoded.
5. Some engineers are just not good with big data and data troubleshooting. That’s reality. It’s almost futile trying to onboard them on big data pipeline or product work. This correlates with an engineers ability to have a very large mental model of both the code and the data schemas which gets more complicated with data pipeline companies.
6. Airflow sucks. It’s one of those tools where it doesn’t fundamentally solve the problem it’s supposed to be solving. If you can get using astronomer or something from the beginning please do. Same for great expectations.