Hacker News new | past | comments | ask | show | jobs | submit login

First off, don't get too distracted by what software engineers are talking about on the internet. Despite the enormous overlap in what kinds of technologies you use at work, the jobs are not the same, and the details of what constitutes a job well done are often not the same, either.

In particular: data engineers' main task is to produce clean data. Few people care about the cleanliness of the code used to produce it. They care about whether or not the data makes sense, whether or not it's accurate, how well it's documented, and whether or not others can understand the ETL pipelines' dependencies and assumptions so that people can predict how changes to business processes and IT systems will affect their business intelligence and data science efforts.

Cleanliness of code is, at best, a secondary concern. Especially if you're mostly working solo. The code for managing data pipelines is often relatively small and tightly scoped, to the point where even a messy ETL implementation can still be easier to understand than very clean application source code. Or at least, that's been my experience as someone who's worked as both a software engineer and a data engineer.

The biggest thing is, again, just to make sure the documentation is good. In one data engineering role I found a bunch of things in the data engineering pipeline that I was pretty sure were junk that we could get rid of to simplify the whole thing. Doing so would have reduced the code for the pipeline in question by ~25%, with comparable reductions in run time and server load. But I couldn't be sure because the original author left no documentation to indicate why those bits were there, so we ultimately decided to leave it there for the sake of safety. The original author was a software engineer, not a data engineer, so the code in question was incredibly clean and readable. I had no trouble figuring out how it was doing what it was doing. But what I really needed to know was what it was doing and why.

I'd also encourage caution about adopting fancy tech stack. It's amazing what you can accomplish with just a SQL database (or Parquet files) and some Python scripts. Data engineering is a space with a lot lot lot of vendors and open source projects selling solutions to problems you might not even have. And every additional solution you bring on is another piece of technology to understand, which can quickly get overwhelming if you're on a small team. Especially be careful about committing to the open source versions of technologies like Apache Spark. I can tell you from experience that, if you can't name the specific person who will personally be responsible for supporting and operating the technology, that person will be you, you will end up spending a lot of time doing it, and it will probably not be a use of time that the person who does your performance reviews will consider valuable. Especially if "I had to spend a bunch of time diagnosing Spark executor failures" becomes the reason why you couldn't get other things done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: