
The Need for Data Engineers - eaguyhn
https://thenewstack.io/data-scientists-get-the-glamour-but-there-is-also-a-need-for-data-engineers/
======
FZ1
I feel the term "Data Engineer" gets used for a lot of catch-all "we have
problems that need an owner" situations.

There's not much consistency across job postings and interviews for this kind
of thing.

I just interviewed for one "Data Engineer" position which consisted of nearly
100% stored procedures. No one knew what else to call it, and they didn't want
to advertise for a DBA, because there were no real DBA responsibilities. So
"Data Engineer" was chosen.

Another "Data Engineer" position was almost entirely Spark. There was no SQL
involved - they expected all applicants to be Spark experts, with a deep
knowledge of Scala.

It's hard to know what to expect out of "Data Engineer" positions until you
walk into a place and start asking questions in the interview.

~~~
closeparen
For us it's simply the runtime you target. A frontend engineer writes code for
the browser. A mobile engineer writes code for iOS or Android. A backend
engineer writes code for servers. A data engineer writes code for the
analytical/warehouse environment.

It was typically backend engineers doing the data pipelines to expose records
from their own services, until the company got bigger and the analysts got
more demanding, and that became a dedicated role.

~~~
moandcompany
Engineers are professionals in the business of solving problems :)

~~~
walshemj
Scientists with thumbs was one definition I heard

------
philsnow
I would pay in solid gold for a data engineer that knew how to glue <the
things the data scientists need> to <the rest of the infrastructure> in a way
that fixes the impedance mismatch that seems to exist in the tooling.

In my experience data tools don't mesh well with "cloud"-y IAM, monitoring, or
auditing frameworks. Data folks ssh to shared cloud workstations and of course
use agent forwarding because that's what the tooling expects. They want to use
EFS to share data sets even though NFS on machines where people have sudo is a
bad idea / EFS is maybe a poor fit if you're thinking about governance /
provenance. There's a mix of "notebooks" running locally (or on the shared
workstations) and DAGs running in the cloud with bespoke access control that
either doesn't map to IAM or else there's no access control so to get to the
dashboard you forward a port with SSH.

It's enough to make me want to wall them off in a separate AWS account, but
maybe I'm just being a grumpy old SRE. _edit: as I mention downthread, this is
a knee-jerk reaction and is not likely to "succeed" for whatever definition of
"success" your business has._

~~~
sixdimensional
I see this problem a lot, and in my experience there are at least two pieces
to this puzzle - 1) many data science tools were originally desktop oriented
or required specialized, siloed engines to run, causing a parallel universe of
data to need to be imported into those tools and environments (for example
SPSS) and 2) traditional infrastructure teams need to think of data
infrastructure and architecture as a different subdiscipline running at layers
5-7 of the OSI model.

My problem as an infrastructure provider and data architect is how to provide
a globally consistent, governed platform and model on top of which different
classes of users have different levels of access rights to data in different
forms and qualities, through different interfaces.

My 2 cents, I accept silver bars too lol :)

For what it’s worth, I don’t buy the argument that data folks should operate
in an isolated infrastructure - we just need to adapt how we serve their
needs, which can be quite extensive when you are talking about essentially
anything ranging from someone writing highly complex algorithmic code to
process large volumes of raw data (high level of support and access may be
needed) vs. someone just designing a report or dashboard using just a
graphical tool on top of a predefined data model (much lower infrastructure
access needed).

~~~
philsnow
I agree especially on these two points:

* it's counter-productive in a number of ways to create a data ghetto where they can do whatever they need to do. It doesn't engender trust or communication between Infra/SRE/DataEng/DataSci teams, and leads to "throw it over the wall" behavior.

* we need to adapt to how we serve their needs, not the other way around. It's a lot more likely to be successful if we are the ones who start bridging the gap. Data engineers are pretty specialized at enabling data scientists to do their job, they don't necessarily share the same skillset as Infra/SRE engineers.

~~~
sixdimensional
Definitely agree - these folks are a class of customer and they are
experiencing “pain”. It’s difficult to get there, as you have said, but I
could not agree more, that we need to adapt to ease that pain. It’s actually a
great opportunity.

------
moandcompany
"The Role of a Data Engineer on a Team is Complementary and Defined By The
Tasks That Others Don’t (Want To) Do (Well)" -self

From a talk I've given a few times called, "Life of a Data Engineer"

(Google slides link:
[https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6x...](https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6xlu4ATGEfEsU84xW3UMcIdd8/edit?usp=sharing))

~~~
moandcompany
Reposting a comment I made a few years ago on:

"We’re in the Middle of a Data Engineering Talent Shortage"
[https://news.ycombinator.com/item?id=12454901](https://news.ycombinator.com/item?id=12454901)

\----------

(2016)

I am a data engineer working on a machine learning team with models actively
used as part of our product(s). From my experiences working in various
contexts (applied machine learning, analytics, policy research, academics,
etc...), there are several of factors that contribute to this shortage: (1)
"data engineering" often requires a lot of breadth and knowledge, (2) "data
engineering" is often (derisively and naively) referred to as the "janitorial
work" of data science, (3) the spectrum of roles and requirements within the
"data engineering" domain, in terms of job descriptions, can range from
database systems administration, to ETL, to data warehousing, curation of data
services / APIs, business intelligence, to the design/deployment/operation of
pipelines and distributed data processing and storage systems (these aren't
mutually exclusive, but often job descriptions fall into one of these
stovepipes).

Some of my quick thoughts and anecdata:

Companies have made large investments in creating 'data science' teams, and
many of those companies have trouble realizing value from those investments.

A part of this stems from investments and teams with no tangible vision of how
that team will generate value. And there are several other contributing
factors…

"Dirty work." People haven't learned how to, and more often don't want to do
it. There's a vast number of tutorials and boot camps out there that teach
newcomers how to "learn data science" with clean datasets -- this is ideal for
learning those basics, but the real world usually does not have clean or ideal
datasets -- the dataset may not even exist -- and there are a number of non-
ideal constraints.

There are people that wish to call themselves “data scientists” that “don’t
want to write code” and would “prefer to do the analysis and storytelling”

Engineering as the application of science with real world constraints: there
are a number of factors that we take into account, often acquired through
painful experience, that aren’t part of these tutorials, bootcamps, or
academic environments.

Many “data scientists” I’ve met have a hard time adapting to and working with
these constraints (e.g. we believe that the application of data science would
solve/address __ problem, but: how do we know and show that it works and is
useful? what are the dependencies, and costs of developing and applying that
solution? is it a one-time solution, or is it going to be a recurring
application? does the solution require people? who will use it? what are the
assumptions or expectations of those operators and users? is it suitable? is
it maintainable? is it sustainable? how long will it take? what are the risks
involved and how do we manage them? is it re-usable, and can we amortize its
costs over time? is it worth doing? This is part of a methodology that comes
from experience, versus what is taught in data science)

Larger teams with more people/financial/political resources can specialize and
take advantage of these divisions of labor, which helps recognize the process
aspects of applying data science and address some of the above

Short story: if you view data engineering as "janitorial work" you're missing
the big picture

Anyone else notice that the attributes of a 'unicorn' data scientist include
the traits of a 'data engineer?'

------
tumanian
I run a team of data engineers, and over the years there has been a lot of
confusion between what is a data scientist and what is a data engineer.

I draw the divide in that data scientists discover the features and the
methodology, while data engineers take these insights to production. One can
argue that data scientists themselves could do that, but this is constrained
by the domain expertise on tools(be that the depth of spark internals or
whatever) and the number of hours in the day. It's hard enough to deal with
the variance of the models to deal with the variance of the system.

A good data engineer is a unicorn.I define three central competencies for a
data engineer: _be a good coder_ : quality, maintainability, efficiency, _know
how to explore the data_ : SQL, R, just eye the damn data feed, _know enough
data science to interface with scientists_

For a data engineer it's okay not to know probability theory and stats that
much, but its a must for a data scientist( running TensorFlow out of the box
with no understanding of the underlying math doesn't make a data scientist,
just a common butcher).

~~~
dtjohnnyb
I've seen the role you're describing (taking insights to production) move to
be described as a "Machine Learning Engineer", whereas Data Engineering is
closer to the front end of the process, productionising the _data_ gathering
and organisation. I really liked this diagram, it matches well with how I've
seen roles advertised lately
[https://twitter.com/workera_/status/1215081851577962497](https://twitter.com/workera_/status/1215081851577962497)

------
bradleyjg
When I tried to hire data engineers under that title I got a ton of resumes
from people with very poor programming skills. It wasn’t until I swapped the
job title to “software engineer” and put the data engineering details in the
description that I got resumes from people with appropriate skills.

The main issue with good programmers is that you need to make sure that
candidates know what the job entails and are onboard with it. There are
definitely complexities involved but by and large it isn’t the type of work
that CS programs glorify as “interesting work”.

------
theK
I was under the impression that the Data Engineer role is just the market
reaction to too many Data Scientists being produced without having the
necessary Programming skills to self enable their day to day work.

Reading the comments maybe I was naive.

------
ldng
Data Engieneet to Data Scientist what Fullstack to Developer, aka more work
responsabilities while paying the same ?

~~~
devmunchies
i think the scientist defines what data they need and how they want to query
it and the engineer does what it takes for the data to get there.

data engineers would be like linemen for a utility company, setting up the
power lines

------
ianamartin
I'm interviewing for a Data Engineering position right now, and one of the
questions I was told to prepare for is "What is data engineering?" I think
it's far more than just the data science aspects this article talks about.
Data Engineering touches more aspects of your engineering projects than most
people think. Curious what this crowd has to say about my idea here. Also, I'm
looking for work. If you like my thoughts, hit me up.

I think there are 4 buckets of data engineering problems, each with their own
challenges and solutions.

Operational Data Engineering This is the detritus that grows like weeds as
parts of other projects and often isn't recognized as a data engineering
problem. We need to pull a file off an FTP server or hit an API and do
something with it. Next thing you know, there are dozens of these little
things that are not individually hard, but having visibility into dependency
trees and failure cases becomes difficult because they are spread out
everywhere and it's not obvious where to look when things go wrong. Tools like
Apache Airflow are a good solution even if you don't use them in other ways
because they can centralize monitoring, logging, and graphs. Scaling isn't
resource intensive for these tasks because they are discrete. You can fan out.
The scaling challenge for this type of data engineering is really about
tending your garden and keeping things coherently organized.

Business Logic Data Engineering This is processing where the data is highly
structured and sometimes even ordered or sequenced. It's hard to scale because
you can't just throw things into a stream and apply multiple workers. You have
to have a managed process and likely shared in-memory state that collects the
worker results and applies strict rules to a process. This is the opposite
problem from big data. It's small data, rigidly organized, and carefully
managed.

Data Science Data Engineering This is sort of classic ETL with a twist. ETL
systems are typically pretty static once the E, T, and L are known quantities.
But working with Data Scientists requires that your pipelines have to be
pretty flexible because scientists are doing experiments. But they also have
to be repeatable and comparable, which means your pipeline has to maintain
version. This is also the area where you are most likely to encounter Big
Data, so you have to be prepared to change your mental model and be able to
use tools like Hadoop and Spark to bring compute to where your data is.

Analytics Data Engineering This is classic ETL pipelines that move data from
point A to data lakes or data warehouses. The key thing to understand here is
what you are modeling at the endpoint. If it's a legit data warehouse, you are
modeling business processes. If you aren't doing that, you are--by definition
--pushing data to a lake. Understanding your endpoint is key to choosing your
reporting and analytics tools to lay on top of your data source. Data lakes
are a good use case for ad-hoc, SQL-driven reporting tools like MetaBase. But
if you are sitting on top of a well-structured fact/dimension type of
warehouse, you will want more formal tools like Tableau, Pentaho, or Cognos.

~~~
thedudeabides5
Good description. I've found it easy to explain to people that data scientists
are often your explorers or researchers, the folks that go out and deal with
raw, uncleaned, poorly modeled information, looking for relationships that are
relevant to the business/study.

Data Engineers are the folks that show up once the boss says 'yeah that's good
enough we want to see the result of that process/model/algorithm on an ongoing
basis'...now what was likely a pile of unsystemized jupyter notebooks and
excel needs to get cleaned, sytemized and productionalized, preferably in
tools designed to handle pipelines and scheduled jobs etc.

~~~
exdsq
Interesting, I always thought if the role as collecting and cleaning data for
data scientists. Are you suggesting the DS does this part, or is the DE
responsible for both and the DS ‘purely’ looks for the appropriate algorithm
or model?

~~~
Feyn_man_
As a DS, I collect and clean my own data (sometimes literally as they’re
coming off the upload line, if I’m not building the upload pipeline in
question too), serving the raw data AS WELL AS metrics/models/algorithms
Generated by notebooks/containers from raw data in from hive via Spark
queries.

~~~
luckydata
do you also instrument your own monitoring, are on call if one of your models
breaks and have built the system where you "just drop a container"? If you
are, then you are a data engineer too, otherwise you're standing on the
shoulders of your data eng team and they make it look easy for you. Go buy
them some donuts first time you go back to the office.

~~~
Feyn_man_
>>> do you also instrument your own monitoring,

Our architect wrote some cute Datadog wrappers that I implement in every pipe
I roll out (and is that instrumental in diagnosing bugs). He also programmed
in a call to keys in a store that times out from too many requests and hangs
the whole function the wrapper is decorated on, that took me a while to
diagnose pinpoint

>>>are on call if one of your models breaks and have built the system where
you "just drop a container"?

Our motto is ‘you wrote it, you fix it!’ If the container pipe works, and you
dropped a bomb in it that doesn’t work, why should the DE have to pickup the
DS’s garbage?

>>>If you are, then you are a data engineer too,

:D

>>>otherwise you're standing on the shoulders of your data eng team and they
make it look easy for you. Go buy them some donuts first time you go back to
the office.

I actually support a few data scientists in the manner you described above
where they generate some metrics notebooks or container and I have to diagnose
their crap in a DE capacity (recent problem involved optimizing their pyspark
code to be less memory intensive)

~~~
luckydata
Then they should buy you some donuts!

------
haffi112
In my experience a lot of people have the coding skills to be a data engineer
but lack the ability to understand the value they can create.

------
sixdimensional
Another helpful distinction I think here is architect != engineer, however you
often see data architects that are also data engineers. I do feel there is a
clear difference of focus though.

------
slowhand09
Data Engineer and Information Architect terms have both been watered down and
bastardized so they are ambiguous in meaning. I hate putting them on a CV
anymore.

Next topic "HTML Programmer".

------
singularity2001
I am currently available
[https://expert.pannous.com/](https://expert.pannous.com/)

------
AznHisoka
If one wants to become a data engineer, what specific vendors/technologies are
increasing in demand? Ie. Databricks, talend, Cloudera?

~~~
guessmyname
Here is a good infographic [1] taken from DataCamp [2].

The infographic and article show what skills and tools are relevant for a job
as a web developer _(more specifically doing Python Web Development)_ and
compares them with similarly important skills and tools for data science. It
includes average salary expectations and links to websites where you can both
learn, practice and search for a job.

[1]
[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Pyt...](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Web_Development_Data_Science_Full.png)

[2] [https://www.datacamp.com/community/blog/web-development-
data...](https://www.datacamp.com/community/blog/web-development-data-science)

~~~
shubb
I think your diagram is for switiching to a data scientist role.

In data scientist job ads I often see companies that want a phd, advanced
stats skills, and depending on the role AI related skills. They want to see
track record in these, but will happily take a fresh PHD student who did a
project that involved these. They don't want a software engineer who did some
code camp courses, they want an accademic.

Conversely, data engineering, I see ads wanting a cross over of big data ETL
technologies and devops - i.e. pyspark, kubernetes, and depending on the role
experience of scaling and productionising AI, without actually needing a deep
knowledge of AI algorithms.

This could be more viable for a software engineer who did some online courses,
as they specify tools experience not academic background. However, it would be
difficult for an experienced software engineer to switch into an experienced
data engineer role, because it is expensive to set up data infrastructure at
scale, so you can't switch over with a hobby project in the same way you could
e.g. switch from experienced front end to experienced full stack by showing a
significant webapp. Actually might be affordable in the silicon valley bubble
I guess.

------
angel_j
Data is a pretty major component of the programmer's craft, whether it's DBs,
I/O, or blobs. Most any experienced programmer is a "Data Engineer".

~~~
threeseed
You couldn't be more wrong.

Data Engineer as a term came out of the Data Science space. Which means that
you will be expected to have skills around Spark, Data Lakes, ETL at scale,
validation, schema management and syncing, data catalogs etc.

It's not some general skill just like you wouldn't say every programmer is a
Network Engineer because they use a HTTP client.

~~~
ianamartin
A shorter version of how I described it above is that Data Engineers
intentionally decouple logic from data so that ETL processes can be managed at
scale (not necessarily talking about Big Data when I say scale. Sometimes lots
of small ETL processes are just as difficult to manage as Big Data).

------
sys_64738
Is data engineer just a posh title for a system analyst?

------
corporateslave5
Data engineering doesnt that well

------
tgbugs
So many posts in this thread are spot on. I've heard descriptions of some tech
positions being equivalent to 'internet plumbers,' well, having spent a two
week rotation shadowing plumbers in my youth, I have come to think of what I
do as more akin to being an 'internet garbage man.' I deal with the shit the
no one else wants to deal with, or maybe more like an e-waste manager. There
is gold in the shit, but no one wants to actually do the dirty work of
building the system to move all the nasty sharp PCBs to somewhere that the
precious metals can be extracted in a way that that delicate workers won't cut
themselves to pieces.

No surprise, it is hard to find people who want to do this job and are good at
it. I see the demand in the academic world ('scholarly infrastructure' is a
very niche place) where it is nearly impossible to hire someone who can do
this work, so hearing that it is also impossible in industry means I guess it
is time to start training the undergrads :/.

I have an idea for a curriculum that could teach some of the principles for
this kind of work (give them the gentoo handbook for a start, and see if they
can follow it to get a database up and running from a box of parts), but I
suspect that mostly it would act as a way to filter out people who simply
don't like the activity, and you also have to have some amount of
interpersonal skills in order to understand the use cases of your colleagues
....

Anyone who cracks this problem will have solved a far more general one in the
process.

