Hacker News new | past | comments | ask | show | jobs | submit login
Data Engineering Design Patterns (dedp.online)
334 points by sebg 10 months ago | hide | past | favorite | 48 comments



It's interesting to note, that when I was first called a DE - it was just software engineer in the data domain.

As in writing full software, that happen to focus on data.

Just 6 years ago I would be tinkering with PrestoDB code, looking at optimizing the scheduler and building Hadoop extensions.

Between that and today the field swung to people who came from BI, with considerably less software engineering background. To the point that just 2 years ago, when applying for DE roles I would be confused why majority of my screening questions came in the form of "how well do you know SQL".

Today I do the same as I did 3-4 years ago, but I am no longer a data engineer.


I had a similar experience at Airbnb.

My title at Airbnb was “Data Engineer” in 2016, then “Software Engineer - Data” 2016-2019, then just “Software Engineer” 2019-2023.

When I joined the DE team we were not in the Engineering Org, our manager reported to the head of Analytics (Chief Data Scientist). The DE perf cycle, job levels and comp were all tied to the Analytics Org levels. There was a Data Infra team (DI) under Engineering > Infrastructure who managed Presto, HBase, HIVE, &c. but didn’t touch pipelines, that was DE’s job.

Most of the DE’s owned more than pipelines though, many of us also wrote and owned services. Max on our team built Airflow and Caravel/Panoramix/Superset during hackathons, Johnathan built our Data Quality tool, Amit built the Minerva semantic metrics layer (which Nick, James and Paul spun out as Transform), Aaron built our Anomaly Detection platform, John built Dataportal, I built our Customer Support Roster service and a Kafka indexing service.

Our manager was awesome. She saw that we were undervalued in Analytics and lobbied successfully to move the team to the Engineering Infrastructure org. We were all retitled in Workday, our perf structure changed to align with the rest of Engineering, as did our levels.

DE living as a whole org team under Infra lasted less than a year before we were split up and distributed into the respective product teams we supported, as Software Engineers with a focus on building & maintaining pipelines, schemas, logging libraries… and the existing tools we had built. The intention was to be embedded into the product teams (Homes, Trips, Support Tools, &c.), skill up these teammates and share the oncall load. In reality what happened was that (at least) 3 DE teams then grew in the various product orgs.


Maybe this is different at the highest levels of the game but for the engineers in the more mainstream parts of the bell curve at the less than Google level of craziness and volume companies Data Engineers -- folks that have come up as former DBAs, DataWarehouse devs, db heavy backend devs, analytics / reporting folks -- it's been my experience that these folks tend to solve problems in a more straight forward, data centric, practical sort of way. And in my experience folks who enter a data role from the sofware side of things tend to come up with rather convoluted solutions to simple things.

Therefore I think the title distinction is warranted. It frames that the company is looking for engineers with skills in the software space -- source control mastery, knowledge of a language or two other than SQL, but also experience looking at query plans, designing large scale data systems, dealing with BI tools etc etc. A sw engineer from a traditional background CAN do this but I'd rather someone that fits the DE role more.


It's the least defined role.

Currently, I am in a funny situation when all teams agree we need an additional data engineer. But basically:

- Sales and finance want more of business intelligence analyst

- Devs want more of a backend engineer

- ML researchers want a data analyst proficient in ETL to do pipelines on the training dataset

All of those 3 have only one thing in common - they need to know SQL very well. I've worked extensively with various technologies to analyze data - pandas, sql, spark. And still, I find SQL (especially recently BigQuery) getting me what I want the quickest.


I’m puzzled by this. Knowing SQL seems like a minor technical matter. You can learn whatever you need to do (or use gpt4 to quickly do stuff). Why does one need the expertise in syntax rather than just a general strong foundation in software and data stuff?


This lines up with my experience, and I've found it heavily depends on what industry you are in.

> software engineer in the data domain.

I was basically this for the past ten years. Maybe it was because I was working only in startups.

Outside of tech/startup orgs, "data engineer" at least I found, were SQL specialists. About six years ago, I went into healthcare, and discovered there were about 30 people across five teams that were data engineers. "Oh cool. My colleagues," I thought. Imagine my surprise when I found they only knew SQL, knew data modeling theory, and had basically no SDLC experience. At my present job, in a traditionally blue-collar industry, I took over a team with the only data engineer in the whole company. He, too, knew only SQL. I've had shove Python at him and get him working in SDLC.

I think these people though, are shrinking. Putting pressure on this from the other side, Python is a common skill with data analyst these days. Software engineers do the heavy lifting and good-enough data modeling, while data analyst do business-specific analysis and good-enough software development like writing DAGs with Dagster. Knowing SQL isn't enough to get by in the job market.


I know plenty of people who only use sql. There is now a role called analytics engineer that primarily sql and often with transformation tools like dbt.

I personally haven’t met a lot of software devs who would call their data modeling capabilities ‘good enough’. It is a different way of thinking to go from building 3rd form normalized tables to denormalizing everything. Plus if a company is merging data from many systems, it makes it more complicated for the software developer.


The moment you have learned SQL, is when you have forgotten everything about C/C++


The BI world is honestly kind of weird.

You have people who are at the intersection of "understands databases, the relational model, query optimization etc. at the level of a very senior SWE" ∩ "needs to be told how git works in the year of our lord 2023".


Don't forget the person who thinks they really have "big data" and needs all this massive infrastructure. Eventually, one discovers it's a few gigs of CSV files that fits in RAM on a laptop.


I feel like duckdb has been making waves here helping people realize you don’t need snowflake/bigquery/etc for all ypur datasets, but you can still get the nice feature set of those systems


What is duckdb


Embedded column store database. SQLite but much much faster for aggregations.


Im in this post and i dont like it lmao.


Yeah I’m thinking of changing my title back to software dev instead of DE - it’s sort of getting a bad rep.


How do you define "bad rep"?


I made the change from SWE to DE a few months ago and several fellow SWEs saw it a first glance as if I was downgrading into a less technical role. They assumed it was some kind of modern DBA.

I’ve always seen it as a Data Oriented Software Engineer but it seems that isn’t always the case, specially recently where I’ve been offered jobs that were basically analyst roles or BI roles.


A lot of “data engineers” are former db analysts and such that don’t know much of anything technically outside of SQL and even that might be something they only are “certified” to know rather than actually good at.

It’s basically becoming a title I’d associate with being low-skill. I used to be a “software engineer in data” and never call myself a data engineer because people would think I don’t know how to write/maintain production services, just write ETL pipelines


Ok I guess you neew to find the worplace that suits you.

Just that I have had some recent experience with muti threaded java k8s services reading daily "streams" from Kafka.

When a SQL query would have solved it.


Many such cases in the data space of CV building over problem solving


I guess you have to decide what you want on your CV. "I solve problems" or "I build complex stuff".


Yea same. Also, honestly feels like the way the field is progressing it will just be eaten up by an SWE role. Feel the same for ML engineer and many other specialized roles.


I don't think that SWEs will.

The software and services are going to be getting advanced enough to just eliminate the need for a dedicated team to build ETL. People with relevant domain knowledge will have an easier time to deliver their work product, without the overhead of building phase.

To get a reasonably good data platform - point-and-click ETL service, SAAS offering and the likes of Metabase - are already good enough for medium enterprises... and beat Databricks offerings for speed(setup, delivery and operation) in reporting and operational data access.

I am absolutely sure that there will be a massive contraction in the DS, DE and ML opportunity market in the next few years. The major companies will consolidate and jobs in those domains will only be available at only a handful of companies... or extremely specialized startups.(much like chip design is now consolidated)

Long story short, for companies - you probably don't need DS, ML and DE departments.


My experience is that many established companies are still struggling to get adequate operational reporting. Data engineers are still helpful to move the data necessary to make that happen. DS and ML become useful later once there's a more mature data culture and infrastructure. Otherwise you have analysts spending most of their time doing data engineering so they have something to analyze.


I think we are aligned I am just much more sceptical towards no code solutions. So I think these roles will shrink massively and the little code you have to still do will then become just another thing to integrate for your swe role (this is what is happening in my industry at least)


As a Data Engineer, I feel like a fixer. People think their data is incorrect and they ask me to fix it. It's similar to fixing someone's computer when they think its broken.

Incredibly valuable but also somewhat unfulfilling in that you are rarely innovating/inventing like a SWE. There you are building tools that can change how people act/think. On the other hand, as a SWE/inventor you aren't desperately needed like you are as a DE. The needing can be nice...


Also a DE - the “data as a service” model is broken. Data engineers need to be upstream producing insights because they have the best feel for the warts.

Companies that treat DE as IT (“service model”) do not have the right perspective and will not hire the kind of DE who are difference makers.

Data teams should be partners.


I work in data, but I've seen a lot of comments from SWEs who are tired of building the same CRUD web app over and over again. The grass is always greener?


I work as DE in a local medium business and since there's no giant all-in-one ERP my role is essentialy build solutions to improve (or create) analytics to act upon. It's working and even if i'm not at spaceX, i feel like i'm inventing something. But, since the size of it all..i also fix things (software and hardware).

Now, the problem is to convince other business that they might need my services.


I used to give a lightning talk at Google called "Life of a Data Engineer" (~6+ years ago), where I tried to explain to Google Software Engineers and Product Managers the concept of a "Data Engineer" in other parts of industry, as well as Google Cloud Customers.

In that talk, I outlined three (3) primary archetypes which still exist today:

- Data Engineering for Business Intelligence (often a rebranding of yesterday's "DBAs")

- Data Engineering for Data Science

- Data Engineering for Machine Learning (precursor for today's "ML Engineers")

Having worn all of these hats in the past, and many others, in different contexts, one of the key points I wanted to convey was that, in practice:

- The Role of a Data Engineer on a Team is Complementary and Often Defined By The Tasks That Others Don’t (Want To) Do Well

There are many guides, tutorials, and courses that like to talk about tools and technologies, and yet very few that get real about organizational structures and the dynamics that cause these archetypes to emerge, peoples' desires to rebrand themselves as "Data Engineers," "ML Engineers," or "(Data) Software Engineers" instead.

There was an interesting survey of ~600 "Data Engineers" by DataKitchen from 2021 ago where they had some very relatable, yet eye-opening findings:

- 97% report experiencing burnout in their day-to-day jobs.

- 70% say they are likely to leave their current company for another data engineering job in the next 12 months.

- 79% have considered leaving the industry entirely.

- 91% report frequently receiving requests for analytics with unrealistic or unreasonable expectations.

- 87% say they are blamed when things go wrong.

- 69% say their company’s data governance policies make their day-to-day job more difficult.

- 89% report frequent disruptions to work-life balance due to unplanned work.

(source: https://datakitchen.io/white_papers/2021-data-engineering-su...)


https://f.hubspotusercontent10.net/hubfs/2866765/Whitepapers...

So people don't have to give their details for the survey


The "please let me know" <https://www.dedp.online/appendix/feedback> link at the bottom of the page gives an ugly Apache 404

heh, I would send feedback about that but ...

---

update: so, it seems the Feedback in the sidebar <https://www.dedp.online/appendix/feedback.html> works so it was just a missing page extension. While reading that I discovered the GitHub repo for the project is private, which explains why I couldn't find it, either


Author of the book here — yes, I used Mdbook (https://github.com/rust-lang/mdBook), and the links strangely end with `.html. as you correctly updated. Initially, I wanted to change it but left it as it was.


DAMA-DMBOK2 covers this very comprehensively

https://www.dama.org/cpages/body-of-knowledge


Data engineering is cool and new while data management is old school and enterprise.

Specifically, data engineering in some tech companies is truly a revenue driver, so it makes data engineering in other organizations be viewed as a cost center so much, even if it is the same work at most organizations.


This may be nitpicking, but technologies being described as "cool" versus "enterprise" or "new" versus "old" I find meaningless. I don't necessarily want to have the "coolest" or "newest" tech stack; I want to have the tech stack that solves reasonably and reliably solves my business problems. If that means leveraging "old" or "enterprise" technologies and practices, that could be totally fine.


How do you define the two terms?


Data Engineering is an engineering displine -- it can involve anything from data ingestion, transformation, storage, enrichment, aggregation up to presentation in operational reports. But it's still a manufacturing process with "data" as an input and "data" as its output.

Data Management is an organization discipline -- it is about how the enterprise manages data as an asset and how data is embedded in the organization. This includes data governance issues like common data models, and a chain of command (which person/role is responsible for which piece of data), but also second-tier data processes such as quality control and data valuation.


Data engineering versus data management?

Data engineering is nominally more pipeline oriented and less concerned with the governance & people side of things, but good data engineering people end up driving a lot of data management work because that's what makes the data engineering less painful (eliminate root cause of data errors and annoying data requests) and data overall more useful and valuable.


This is a pretty cool resource and lots of effort to put it together. I spent ten minutes reading the first few pages. I have to admit it's a bit hard to read. I would describe it as fairly verbose with lots of preamble before actually saying what you want to say. I also think this is very fixable with some editing and iterations. Just my 2 cents.


Could be interesting once there's more content, in its current state the content is mostly just definitions.


Author of the book here — 100%. The idea is to release early and update on the go. You might wonder if it's worth delving into the book in its current, unfinished form. I'd say it depends.

There's quite a bit of content around general DE knowledge. However, the anticipated design patterns are still in the works. Suppose that's what you're most excited about. In that case, you'll find the beginnings of exploring the patterns of `caching` and `ad-hoc querying` in the first Convergent Evolution chapter, but otherwise, you need to wait for more.


Maybe I'm just of a different vintage but I'm just not a fan of these e-mail newsleters that seem to be the trend these days. I'd much rather follow a Github repo for this sort of thing.


EmailNewsletters.Age > GitHub.Age ? “Showing my vintage” : “Kids these days”


Does anyone know what tech is used here to create the online book?



Yep, it is. Super happy with it. I added some custom themes and links to my second brain, but other than that, all mdBook. I also have more notes about writing a book with Markdown and self-publish at https://www.ssp.sh/brain/create-a-book-on-markdown/.


Awesome ty!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: