Hacker News new | past | comments | ask | show | jobs | submit login
The Need for Data Engineers (thenewstack.io)
174 points by eaguyhn 34 days ago | hide | past | web | favorite | 59 comments

I feel the term "Data Engineer" gets used for a lot of catch-all "we have problems that need an owner" situations.

There's not much consistency across job postings and interviews for this kind of thing.

I just interviewed for one "Data Engineer" position which consisted of nearly 100% stored procedures. No one knew what else to call it, and they didn't want to advertise for a DBA, because there were no real DBA responsibilities. So "Data Engineer" was chosen.

Another "Data Engineer" position was almost entirely Spark. There was no SQL involved - they expected all applicants to be Spark experts, with a deep knowledge of Scala.

It's hard to know what to expect out of "Data Engineer" positions until you walk into a place and start asking questions in the interview.

For us it's simply the runtime you target. A frontend engineer writes code for the browser. A mobile engineer writes code for iOS or Android. A backend engineer writes code for servers. A data engineer writes code for the analytical/warehouse environment.

It was typically backend engineers doing the data pipelines to expose records from their own services, until the company got bigger and the analysts got more demanding, and that became a dedicated role.

Engineers are professionals in the business of solving problems :)

Scientists with thumbs was one definition I heard

I feel the name "data engineer" is perfectly fine for both positions. It's the job of the job post to clarify the requirements.

I don't see an issue with looking for a "data engineer with deep experience building and maintaining data pipelines based on spark".

What would you call it? "Spark operator"?

> It's the job of the job post to clarify the requirements.

I couldn't agree more. That's the exact thing I'm saying is a problem here.

When stuff is wrong, and we don't have enough people, and don't have a well-defined job, but need someone to just handle it .. "Data Engineer".

You get there, and everybody has a different idea as to what you should be doing, and what skills you should have.

Is this any different from “software engineer”?

SE usually at least knows what language they're expecting you to mainly work in. Not always the case for DE.

Ah gotcha, the problem isn't that it's a vague job title, the problem is that the job description for a particular position may not set expectations clearly enough

No - I'm trying to say that people often don't even know what the DE really needs to do. They know they have back-end problems, and need someone to clean them up.

But often the hiring manager believes it's probably mostly SQL or mostly python, and advertises that. They really don't know - because someone left, and they are trying to replace them.

So you get there, and talk to the hiring manager, and they're like "are you good in SQL?".

Then you talk to someone closer to the problem, and they're like "most of it's in Java - how's your Java?".

Then someone even closer to the position says, "well, most of the work we need done is really in Scala".

It's your job to just take it over, and make it happen. "Fix the data" - using our existing patchwork of tools ...

This isn't always the case - but given the nature of the problem, this happens a hell of a lot more often than with Soft Eng's.

Nope. Just someone that is willing to use tools in addition to build tools to deal with data.

Most who consider themselves software engineers only want to code and build new, not use existing. Mostly a mindset thing.

You're seeing the same thing happen with the "Data Engineer" title as what has happened with "Data Scientist."

In the hiring market, many positions that were previously titled "Data Analyst"/etc were renamed "Data Scientist." Similarly "Database Engineer"/"Data Warehouse Engineer"/"DBA"/etc were renamed "Data Engineer."

Why? Various reasons:

One might be the need to give people a sexier title because they don't want to be "just a data analyst."

Another case might be, hiring managers and recruiting teams changing their title to capture more views for people searching using these terms.

And why are people searching for jobs using these terms? Articles like the ones in the original post, and "Data Scientist: The sexiest job of the 21st century" ;)

I had to stop calling one of my previous roles “data engineering” because people thought I was just doing SQL script ETL... which really was maybe 30% of the job, but apparently that’s too menial to give you programmer cred.

hmm ... I didn't think about that. I wonder if the term is building a negative connotation?

Most programming is just plumbing with data, so I'd say the term data engineer would fit most programmers, most of the time.

> I just interviewed for one "Data Engineer" position which consisted of nearly 100% stored procedures.

How about data or database programmer? I've also heard "Data Analyst", "SQL Developer", etc.

But I agree that "data engineer", "software engineer", "network engineer", etc are too general to really be meaningful.

Doesn't the term software engineer side from the same ambiguity?

I would pay in solid gold for a data engineer that knew how to glue <the things the data scientists need> to <the rest of the infrastructure> in a way that fixes the impedance mismatch that seems to exist in the tooling.

In my experience data tools don't mesh well with "cloud"-y IAM, monitoring, or auditing frameworks. Data folks ssh to shared cloud workstations and of course use agent forwarding because that's what the tooling expects. They want to use EFS to share data sets even though NFS on machines where people have sudo is a bad idea / EFS is maybe a poor fit if you're thinking about governance / provenance. There's a mix of "notebooks" running locally (or on the shared workstations) and DAGs running in the cloud with bespoke access control that either doesn't map to IAM or else there's no access control so to get to the dashboard you forward a port with SSH.

It's enough to make me want to wall them off in a separate AWS account, but maybe I'm just being a grumpy old SRE. edit: as I mention downthread, this is a knee-jerk reaction and is not likely to "succeed" for whatever definition of "success" your business has.

Data Engineers Supporting Their Data Science Teammates

"By the way, we are chartered to work separately from ____ infrastructure teams, to better enable our freedom and not hinder our own progress. We’re allowed to run our own computing environments, cluster(s), olap database(s), _____, and we’ve been doing it ourselves.

Can you run/manage/provision those for us too?"

From a talk I've given a few times called, "Life of a Data Engineer"

(Google slides link: https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6x...)

> we'd be more efficient if we could get someone else to do this for us

This is spot on.. I don't mean to vilify the data folks, they were hired to do a job and unless an explicit part of that job description is "make your stuff play nice with the existing stuff" (which it never is), why would they swim against the current and spend their cycles doing things that aren't going to get them promoted / that they're not naturally inclined to care about?

The sentiment of the quoted statement really highlights that they want to be in their own world and do things the way that makes sense to them.

Again, I can't fault them for it. It just turns into an absolutely staggering problem later when the company tries to get some certifications. Maybe engineering leadership needs to see this particular iceberg earlier.

At the large-ish global company I work for, there are many siloed groups, many who are make their own demand for full control over their own isolated infrastructure, and not just for data purposes (sometimes for workflow/process automation and control, and many other reasons).

The old, oft feared “shadow-IT” is a good term to represent what is actually an opportunity to recognize the opportunity for IT to reframe itself as the trusted service/platform/infrastructure provider and advisor, but perhaps not the only ones who implement and control every aspect of using that infrastructure going forward.

The extent to which digital technology has become fused with work, means people need more access and control over systems and tools within their daily work, but solid infrastructure, governance, etc. is still required. Clearly the trick is finding the balance between centralization and federation.

I think directions such as “citizen developer” and “data democracy” are not just fluff, but a pointer in the right direction of how to think about the newer classes of customer for information systems and technology.

I see this problem a lot, and in my experience there are at least two pieces to this puzzle - 1) many data science tools were originally desktop oriented or required specialized, siloed engines to run, causing a parallel universe of data to need to be imported into those tools and environments (for example SPSS) and 2) traditional infrastructure teams need to think of data infrastructure and architecture as a different subdiscipline running at layers 5-7 of the OSI model.

My problem as an infrastructure provider and data architect is how to provide a globally consistent, governed platform and model on top of which different classes of users have different levels of access rights to data in different forms and qualities, through different interfaces.

My 2 cents, I accept silver bars too lol :)

For what it’s worth, I don’t buy the argument that data folks should operate in an isolated infrastructure - we just need to adapt how we serve their needs, which can be quite extensive when you are talking about essentially anything ranging from someone writing highly complex algorithmic code to process large volumes of raw data (high level of support and access may be needed) vs. someone just designing a report or dashboard using just a graphical tool on top of a predefined data model (much lower infrastructure access needed).

I agree especially on these two points:

* it's counter-productive in a number of ways to create a data ghetto where they can do whatever they need to do. It doesn't engender trust or communication between Infra/SRE/DataEng/DataSci teams, and leads to "throw it over the wall" behavior.

* we need to adapt to how we serve their needs, not the other way around. It's a lot more likely to be successful if we are the ones who start bridging the gap. Data engineers are pretty specialized at enabling data scientists to do their job, they don't necessarily share the same skillset as Infra/SRE engineers.

Definitely agree - these folks are a class of customer and they are experiencing “pain”. It’s difficult to get there, as you have said, but I could not agree more, that we need to adapt to ease that pain. It’s actually a great opportunity.

We just have a big Hadoop cluster with a big Airflow instance, and nice frontends for submitting adhoc Spark jobs and SQL queries before they get checked in and productionized under Airflow.

What more do you really need?

"The Role of a Data Engineer on a Team is Complementary and Defined By The Tasks That Others Don’t (Want To) Do (Well)" -self

From a talk I've given a few times called, "Life of a Data Engineer"

(Google slides link: https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6x...)

Reposting a comment I made a few years ago on:

"We’re in the Middle of a Data Engineering Talent Shortage" https://news.ycombinator.com/item?id=12454901



I am a data engineer working on a machine learning team with models actively used as part of our product(s). From my experiences working in various contexts (applied machine learning, analytics, policy research, academics, etc...), there are several of factors that contribute to this shortage: (1) "data engineering" often requires a lot of breadth and knowledge, (2) "data engineering" is often (derisively and naively) referred to as the "janitorial work" of data science, (3) the spectrum of roles and requirements within the "data engineering" domain, in terms of job descriptions, can range from database systems administration, to ETL, to data warehousing, curation of data services / APIs, business intelligence, to the design/deployment/operation of pipelines and distributed data processing and storage systems (these aren't mutually exclusive, but often job descriptions fall into one of these stovepipes).

Some of my quick thoughts and anecdata:

Companies have made large investments in creating 'data science' teams, and many of those companies have trouble realizing value from those investments.

A part of this stems from investments and teams with no tangible vision of how that team will generate value. And there are several other contributing factors…

"Dirty work." People haven't learned how to, and more often don't want to do it. There's a vast number of tutorials and boot camps out there that teach newcomers how to "learn data science" with clean datasets -- this is ideal for learning those basics, but the real world usually does not have clean or ideal datasets -- the dataset may not even exist -- and there are a number of non-ideal constraints.

There are people that wish to call themselves “data scientists” that “don’t want to write code” and would “prefer to do the analysis and storytelling”

Engineering as the application of science with real world constraints: there are a number of factors that we take into account, often acquired through painful experience, that aren’t part of these tutorials, bootcamps, or academic environments.

Many “data scientists” I’ve met have a hard time adapting to and working with these constraints (e.g. we believe that the application of data science would solve/address __ problem, but: how do we know and show that it works and is useful? what are the dependencies, and costs of developing and applying that solution? is it a one-time solution, or is it going to be a recurring application? does the solution require people? who will use it? what are the assumptions or expectations of those operators and users? is it suitable? is it maintainable? is it sustainable? how long will it take? what are the risks involved and how do we manage them? is it re-usable, and can we amortize its costs over time? is it worth doing? This is part of a methodology that comes from experience, versus what is taught in data science)

Larger teams with more people/financial/political resources can specialize and take advantage of these divisions of labor, which helps recognize the process aspects of applying data science and address some of the above

Short story: if you view data engineering as "janitorial work" you're missing the big picture

Anyone else notice that the attributes of a 'unicorn' data scientist include the traits of a 'data engineer?'

I run a team of data engineers, and over the years there has been a lot of confusion between what is a data scientist and what is a data engineer.

I draw the divide in that data scientists discover the features and the methodology, while data engineers take these insights to production. One can argue that data scientists themselves could do that, but this is constrained by the domain expertise on tools(be that the depth of spark internals or whatever) and the number of hours in the day. It's hard enough to deal with the variance of the models to deal with the variance of the system.

A good data engineer is a unicorn.I define three central competencies for a data engineer: be a good coder: quality, maintainability, efficiency, know how to explore the data: SQL, R, just eye the damn data feed, know enough data science to interface with scientists

For a data engineer it's okay not to know probability theory and stats that much, but its a must for a data scientist( running TensorFlow out of the box with no understanding of the underlying math doesn't make a data scientist, just a common butcher).

I've seen the role you're describing (taking insights to production) move to be described as a "Machine Learning Engineer", whereas Data Engineering is closer to the front end of the process, productionising the _data_ gathering and organisation. I really liked this diagram, it matches well with how I've seen roles advertised lately https://twitter.com/workera_/status/1215081851577962497

When I tried to hire data engineers under that title I got a ton of resumes from people with very poor programming skills. It wasn’t until I swapped the job title to “software engineer” and put the data engineering details in the description that I got resumes from people with appropriate skills.

The main issue with good programmers is that you need to make sure that candidates know what the job entails and are onboard with it. There are definitely complexities involved but by and large it isn’t the type of work that CS programs glorify as “interesting work”.

I was under the impression that the Data Engineer role is just the market reaction to too many Data Scientists being produced without having the necessary Programming skills to self enable their day to day work.

Reading the comments maybe I was naive.

Data Engieneet to Data Scientist what Fullstack to Developer, aka more work responsabilities while paying the same ?

i think the scientist defines what data they need and how they want to query it and the engineer does what it takes for the data to get there.

data engineers would be like linemen for a utility company, setting up the power lines

In my experience a lot of people have the coding skills to be a data engineer but lack the ability to understand the value they can create.

I'm interviewing for a Data Engineering position right now, and one of the questions I was told to prepare for is "What is data engineering?" I think it's far more than just the data science aspects this article talks about. Data Engineering touches more aspects of your engineering projects than most people think. Curious what this crowd has to say about my idea here. Also, I'm looking for work. If you like my thoughts, hit me up.

I think there are 4 buckets of data engineering problems, each with their own challenges and solutions.

Operational Data Engineering This is the detritus that grows like weeds as parts of other projects and often isn't recognized as a data engineering problem. We need to pull a file off an FTP server or hit an API and do something with it. Next thing you know, there are dozens of these little things that are not individually hard, but having visibility into dependency trees and failure cases becomes difficult because they are spread out everywhere and it's not obvious where to look when things go wrong. Tools like Apache Airflow are a good solution even if you don't use them in other ways because they can centralize monitoring, logging, and graphs. Scaling isn't resource intensive for these tasks because they are discrete. You can fan out. The scaling challenge for this type of data engineering is really about tending your garden and keeping things coherently organized.

Business Logic Data Engineering This is processing where the data is highly structured and sometimes even ordered or sequenced. It's hard to scale because you can't just throw things into a stream and apply multiple workers. You have to have a managed process and likely shared in-memory state that collects the worker results and applies strict rules to a process. This is the opposite problem from big data. It's small data, rigidly organized, and carefully managed.

Data Science Data Engineering This is sort of classic ETL with a twist. ETL systems are typically pretty static once the E, T, and L are known quantities. But working with Data Scientists requires that your pipelines have to be pretty flexible because scientists are doing experiments. But they also have to be repeatable and comparable, which means your pipeline has to maintain version. This is also the area where you are most likely to encounter Big Data, so you have to be prepared to change your mental model and be able to use tools like Hadoop and Spark to bring compute to where your data is.

Analytics Data Engineering This is classic ETL pipelines that move data from point A to data lakes or data warehouses. The key thing to understand here is what you are modeling at the endpoint. If it's a legit data warehouse, you are modeling business processes. If you aren't doing that, you are--by definition--pushing data to a lake. Understanding your endpoint is key to choosing your reporting and analytics tools to lay on top of your data source. Data lakes are a good use case for ad-hoc, SQL-driven reporting tools like MetaBase. But if you are sitting on top of a well-structured fact/dimension type of warehouse, you will want more formal tools like Tableau, Pentaho, or Cognos.

Good description. I've found it easy to explain to people that data scientists are often your explorers or researchers, the folks that go out and deal with raw, uncleaned, poorly modeled information, looking for relationships that are relevant to the business/study.

Data Engineers are the folks that show up once the boss says 'yeah that's good enough we want to see the result of that process/model/algorithm on an ongoing basis'...now what was likely a pile of unsystemized jupyter notebooks and excel needs to get cleaned, sytemized and productionalized, preferably in tools designed to handle pipelines and scheduled jobs etc.

Mmm, I disagree. I do a fair amount of mucking around in notebooks/containers, but I also drop them (when they’re refined enough) into a pipeline where I (not another person dubbed a DE) can evaluate the integrity of the results on an ongoing basis. Though I do defer pipeline design dilemmas to our architect I do some of the implementation / testing work myself, I firmly feel that for a data scientist to be effective in their role they must also be part Data engineer (and visa versa). It’s hard to compartmentalizations those duties and get results efficiently.

Interesting, I always thought if the role as collecting and cleaning data for data scientists. Are you suggesting the DS does this part, or is the DE responsible for both and the DS ‘purely’ looks for the appropriate algorithm or model?

As a DS, I collect and clean my own data (sometimes literally as they’re coming off the upload line, if I’m not building the upload pipeline in question too), serving the raw data AS WELL AS metrics/models/algorithms Generated by notebooks/containers from raw data in from hive via Spark queries.

do you also instrument your own monitoring, are on call if one of your models breaks and have built the system where you "just drop a container"? If you are, then you are a data engineer too, otherwise you're standing on the shoulders of your data eng team and they make it look easy for you. Go buy them some donuts first time you go back to the office.

>>> do you also instrument your own monitoring,

Our architect wrote some cute Datadog wrappers that I implement in every pipe I roll out (and is that instrumental in diagnosing bugs). He also programmed in a call to keys in a store that times out from too many requests and hangs the whole function the wrapper is decorated on, that took me a while to diagnose pinpoint

>>>are on call if one of your models breaks and have built the system where you "just drop a container"?

Our motto is ‘you wrote it, you fix it!’ If the container pipe works, and you dropped a bomb in it that doesn’t work, why should the DE have to pickup the DS’s garbage?

>>>If you are, then you are a data engineer too,


>>>otherwise you're standing on the shoulders of your data eng team and they make it look easy for you. Go buy them some donuts first time you go back to the office.

I actually support a few data scientists in the manner you described above where they generate some metrics notebooks or container and I have to diagnose their crap in a DE capacity (recent problem involved optimizing their pyspark code to be less memory intensive)

Then they should buy you some donuts!

Data Scientist absolutely will do data engineering as part of their job.

But Data Engineers will able to make sure that data collection and data cleansing runs optimally and successfully every day. And that's a real skill when you have upstream schemas changing, fluctuations in data volume, instabilities in systems etc.

My team is hiring and your thoughts mesh with how we approach the space as well. Let me know how I can get in touch or shoot me a note to the email in bio. Remote friendly gig, the team is based out of SF.

Another helpful distinction I think here is architect != engineer, however you often see data architects that are also data engineers. I do feel there is a clear difference of focus though.

Data Engineer and Information Architect terms have both been watered down and bastardized so they are ambiguous in meaning. I hate putting them on a CV anymore.

Next topic "HTML Programmer".

I am currently available https://expert.pannous.com/

If one wants to become a data engineer, what specific vendors/technologies are increasing in demand? Ie. Databricks, talend, Cloudera?

Here is a good infographic [1] taken from DataCamp [2].

The infographic and article show what skills and tools are relevant for a job as a web developer (more specifically doing Python Web Development) and compares them with similarly important skills and tools for data science. It includes average salary expectations and links to websites where you can both learn, practice and search for a job.

[1] https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Pyt...

[2] https://www.datacamp.com/community/blog/web-development-data...

I think your diagram is for switiching to a data scientist role.

In data scientist job ads I often see companies that want a phd, advanced stats skills, and depending on the role AI related skills. They want to see track record in these, but will happily take a fresh PHD student who did a project that involved these. They don't want a software engineer who did some code camp courses, they want an accademic.

Conversely, data engineering, I see ads wanting a cross over of big data ETL technologies and devops - i.e. pyspark, kubernetes, and depending on the role experience of scaling and productionising AI, without actually needing a deep knowledge of AI algorithms.

This could be more viable for a software engineer who did some online courses, as they specify tools experience not academic background. However, it would be difficult for an experienced software engineer to switch into an experienced data engineer role, because it is expensive to set up data infrastructure at scale, so you can't switch over with a hobby project in the same way you could e.g. switch from experienced front end to experienced full stack by showing a significant webapp. Actually might be affordable in the silicon valley bubble I guess.

> guessmyname


They are all just distributions of Spark.

You really need to know it inside and out especially how to do ETL at scale using it.

Data is a pretty major component of the programmer's craft, whether it's DBs, I/O, or blobs. Most any experienced programmer is a "Data Engineer".

You couldn't be more wrong.

Data Engineer as a term came out of the Data Science space. Which means that you will be expected to have skills around Spark, Data Lakes, ETL at scale, validation, schema management and syncing, data catalogs etc.

It's not some general skill just like you wouldn't say every programmer is a Network Engineer because they use a HTTP client.

A shorter version of how I described it above is that Data Engineers intentionally decouple logic from data so that ETL processes can be managed at scale (not necessarily talking about Big Data when I say scale. Sometimes lots of small ETL processes are just as difficult to manage as Big Data).

Don't think he is that wrong. Putting aside Spark, the remaining terms are pretty broad.

Data Lakes: Is this our new fancy term describing "data". So what distinguishes "data lakes" from "data"?

ETL: I would guess 95% of application programs take input, parse it (extract), do some data wrangling (transform) and save the result somewhere else (load).

Validation: Again a broad term. Do you mean validation of statistical models? Without validation your predictions are worthless so I guess it is a standard thing to do if you want to do any kind of machine learning.

Schema Management and data catalogs: Standard DB stuff I would say.

We just like to define new job descriptions. It's the same with DevOps, which seems to be the new term for System Administrator.

Is data engineer just a posh title for a system analyst?

Data engineering doesnt that well

So many posts in this thread are spot on. I've heard descriptions of some tech positions being equivalent to 'internet plumbers,' well, having spent a two week rotation shadowing plumbers in my youth, I have come to think of what I do as more akin to being an 'internet garbage man.' I deal with the shit the no one else wants to deal with, or maybe more like an e-waste manager. There is gold in the shit, but no one wants to actually do the dirty work of building the system to move all the nasty sharp PCBs to somewhere that the precious metals can be extracted in a way that that delicate workers won't cut themselves to pieces.

No surprise, it is hard to find people who want to do this job and are good at it. I see the demand in the academic world ('scholarly infrastructure' is a very niche place) where it is nearly impossible to hire someone who can do this work, so hearing that it is also impossible in industry means I guess it is time to start training the undergrads :/.

I have an idea for a curriculum that could teach some of the principles for this kind of work (give them the gentoo handbook for a start, and see if they can follow it to get a database up and running from a box of parts), but I suspect that mostly it would act as a way to filter out people who simply don't like the activity, and you also have to have some amount of interpersonal skills in order to understand the use cases of your colleagues ....

Anyone who cracks this problem will have solved a far more general one in the process.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact