The mistake everyone makes is treating the entire space as if it's somehow different than the rest of software engineering, when it is precisely exactly the same.
Well… no I would disagree. I am a data engineer who relentlessly pushes for more classic SWE best practices in our engineering teams, but there are some fundamental differences you cannot ignore the data engineering that are not present in standard software engineering.
The most significant of these is that version control is not guaranteed to be the primary gate to changes to your system’s state. The size, shape, volume, skew of your data can change independently of any code. You can have the best unit/integration tests and airtight CI/CD but it doesn’t matter if your date source suddenly starts populating a column you were not expecting, or the kurtosis of your data changes and now your once performant jobs are timing out.
Compounding this, there is no one answer for how to handle data changes. Sure you can monitor your schema and grind everything to a halt the second something is different in your source, but then you’re likely running a high false positive rate, and causing backup or data loss issues for downstream consumers.
The truth is change management and scaling is more or less a predictable and solved problem in traditional SWE. But the same principles can’t be applied with the same effectiveness to DE. I have seen a number of SWE-turned-DE struggle precisely because they do not believe there to be any difference between the two. It would be nice if it were so simple.
You are massively oversimplifying "classic SWE" here. Ultimately it makes no difference.
How you handle contract violations between dependencies and third parties is not an inherently different problem. What should you do if a data source changes schema? How do you detect it, alert, take action? How are users or customers or consumer jobs impacted? This is all just normal software.
In many ways it feels like you're making my case for me.
It may be that those topics are part of the SWE process but not generally applicable for example UI work and therefore overlooked.
But it may also be something else. That a process can create Grafana dashboards and processes for API updates like a pro.
But only the person that really cares about the data and know it from inside out can catch the "unknown unknowns". And for that you need to give access (and hand over responsibilities) to Analyst persons too and in some ways trust them too, even though they may not have as rigorous processes.
>The most significant of these is that version control is not guaranteed to be the primary gate to changes to your system’s state. The size, shape, volume, skew of your data can change independently of any code.
That has been true in SWE since the days of ENIAC.
The training part of MLOps is important to ensure replicability and other desirable properties or the ML artifact, but the rest is clearly good CI/CD and observability.
You can develop a database schema for your application on dummy data, and test your business logic on dummy data. But you can't develop a machine learning model on dummy data.
It is best practice to keep live data out of your development environment for normal software development, but it is impossible for machine learning projects.
Yep. I advise my clients to have a controlled, sandbox partition of prod. Call it the whiteboard or what have you, things can be erased or you can cache model version runs. Your data transforms can happen in dev but training should happen against real data. Synthetic data can be a compromise but that can sharply bias your model and, since it a harder lift, often multiplies data prep time
Yes, this is what I thought and exactly what I am experiencing right now.
It is somewhat frustrating to see the engineering/platform team spending time on this kind of abstraction that is essentially useless for doing the work we need to do.