The mistake everyone makes is treating the entire space as if it's somehow diffe...

appplication · on Feb 12, 2024

Well… no I would disagree. I am a data engineer who relentlessly pushes for more classic SWE best practices in our engineering teams, but there are some fundamental differences you cannot ignore the data engineering that are not present in standard software engineering.

The most significant of these is that version control is not guaranteed to be the primary gate to changes to your system’s state. The size, shape, volume, skew of your data can change independently of any code. You can have the best unit/integration tests and airtight CI/CD but it doesn’t matter if your date source suddenly starts populating a column you were not expecting, or the kurtosis of your data changes and now your once performant jobs are timing out.

Compounding this, there is no one answer for how to handle data changes. Sure you can monitor your schema and grind everything to a halt the second something is different in your source, but then you’re likely running a high false positive rate, and causing backup or data loss issues for downstream consumers.

The truth is change management and scaling is more or less a predictable and solved problem in traditional SWE. But the same principles can’t be applied with the same effectiveness to DE. I have seen a number of SWE-turned-DE struggle precisely because they do not believe there to be any difference between the two. It would be nice if it were so simple.

xyzzy_plugh · on Feb 12, 2024

You are massively oversimplifying "classic SWE" here. Ultimately it makes no difference.

How you handle contract violations between dependencies and third parties is not an inherently different problem. What should you do if a data source changes schema? How do you detect it, alert, take action? How are users or customers or consumer jobs impacted? This is all just normal software.

In many ways it feels like you're making my case for me.

fifilura · on Feb 12, 2024

Whenever I get confronted with SWE best practices it is all about version control, unit tests, integration tests and CI/CD.

What are some good resources for the other things you describe?

jimberlage · on Feb 12, 2024

I don’t have great links handy, but read up on observability.

You’ll want to read up on what people do on error, especially in the case that contracts change (search “api versioning”.)

fifilura · on Feb 13, 2024

Thank you!

It may be that those topics are part of the SWE process but not generally applicable for example UI work and therefore overlooked.

But it may also be something else. That a process can create Grafana dashboards and processes for API updates like a pro.

But only the person that really cares about the data and know it from inside out can catch the "unknown unknowns". And for that you need to give access (and hand over responsibilities) to Analyst persons too and in some ways trust them too, even though they may not have as rigorous processes.

coldtea · on Feb 12, 2024

>The most significant of these is that version control is not guaranteed to be the primary gate to changes to your system’s state. The size, shape, volume, skew of your data can change independently of any code.

That has been true in SWE since the days of ENIAC.

esafak · on Feb 12, 2024

Machine learning people have cottoned on to this, so MLOps was invented.

tomrod · on Feb 12, 2024

The training part of MLOps is important to ensure replicability and other desirable properties or the ML artifact, but the rest is clearly good CI/CD and observability.

wodenokoto · on Feb 12, 2024

We had an application engine come in and run an ML project.

The first thing he did was remove real data from training, as he was shocked to see that development wasn’t done against dummy data.

And once the dev environment is running on real data, the changes in how you develop and operationalize just seems to cascade in my experience.

fifilura · on Feb 12, 2024

I am slightly confused by your reply. Can you elaborate a bit, was it good or bad?

wodenokoto · on Feb 12, 2024

You can develop a database schema for your application on dummy data, and test your business logic on dummy data. But you can't develop a machine learning model on dummy data.

It is best practice to keep live data out of your development environment for normal software development, but it is impossible for machine learning projects.

tomrod · on Feb 12, 2024

Yep. I advise my clients to have a controlled, sandbox partition of prod. Call it the whiteboard or what have you, things can be erased or you can cache model version runs. Your data transforms can happen in dev but training should happen against real data. Synthetic data can be a compromise but that can sharply bias your model and, since it a harder lift, often multiplies data prep time

fifilura · on Feb 12, 2024

Yes, this is what I thought and exactly what I am experiencing right now.

It is somewhat frustrating to see the engineering/platform team spending time on this kind of abstraction that is essentially useless for doing the work we need to do.

datavirtue · on Feb 12, 2024

This. Nearly everyone is confused.