
Engineers Shouldn’t Write ETL - _ttg
https://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/
======
whack
I've worked in a hedge fund in the past, where my role sounded a lot like what
the author describes as a "Data Engineer". I would have thinkers, ie people
with a lot of financial experience, come up with ideas on which datasets we
want to import from which vendors, and how we should handle the 80 different
types of corporate actions that are contained within this dataset.

I sometimes gave my own suggestions on how to improve upon their ideas, but
for the most part, I was happy to focus on implementing their ideas, in the
most clean, elegant, robust and testable manner possible. I was happy to do
the "plumbing" work of improving upon our tech stack and architecture, in
order to make the entire system better functioning and easier to maintain.

According to the author, I'm supposed to resent the fact that I'm a
"doer/plumber", and not a "thinker". In reality, it was the opposite. Do I
really want to spend my entire day reading the Bloomberg manual and figuring
out which tables/columns will give us the data we want, and the nuances of
what this dataset does and does not cover? Sorry, I have zero interest in
doing that.

I enjoy programming. I enjoy system design. I enjoy building stuff. I have
zero interest in becoming an expert on how to interpret the Bloomberg
symbology file. Besides, if I ever left the financial industry and joined a
tech company, that knowledge will become completely useless.

Did I or anyone consider myself to be a "menial" plumber? I don't think so. I
was getting paid hundreds of thousands of dollars, because the "thinkers"
recognized the value that I brought to the table. They appreciated that I
could quickly and robustly implement the ideas that they had, and keep the
system running smoothly without hiccups. They recognized anyone can do a "good
enough" job, but it's much much harder to find someone who can do a _great_
job. And for my part, I was perfectly happy to be that guy.

If you're someone who wants to expand your breadth and take on more "thinker"
responsibilities, more power to you. But just don't forget that there are
people like me out there too. There's no shame in being an excellent "doer".

~~~
internet555
What’s funny to me is how many incompetent “thinkers” appear in meetings.
Obviously, thought (even removed from implementation entirely) often has
immense value. Eg, many people spent a lot of time thinking about arithmetic,
linear algebra, floating point, compilers, and now I can go run whatever cool
algorithm on my computer. But I continually seem to run into these people who
seem borderline incompetent at anything but spewing out whatever pops into
their head. Half is nonsense, one-quarter would be actively destructive if you
tried to implement it, they always seem to know everything about everything
but whenever it’s something you know really well you can tell that they are
very confused, etc. when I meet these people now I just think “oh, you’re one
of those guys who is good at saying a lot of things” and then move on. Oh well

~~~
DoreenMichele
You see that a lot with 2e people -- bright people with some weaknesses or a
disability. That could account for how negatively you have experienced this.
Many 2e people have never really been taught good ways to handle the
combination of big strengths and big weaknesses.

I serve as a sounding board a lot for my oldest son and that works well, but
it's not uncommon for such people to just be trying to meet their own need to
process information and/or feed their ego, oblivious to how it impacts other
people and not really welcoming of the feedback they really need for this to
be constructive. A good sounding board doesn't just listen, they ask pertinent
questions and make insightful comments that help move the thought process
along.

Sometimes when I meet people like that, I'm able to direct the conversation to
a more constructive back and forth of that sort. But some people just know
they have this need to talk, they have a lot of baggage that makes them openly
hostile to meaningful feedback and they crave validation. Anything other than
praising their half-baked ideas is met with toxic reactions. In such cases,
the best you may be able to do is basically make a few polite noises and then
disengage as quickly as possible.

~~~
emeraldd
What does "2e people" mean in this context ... That's not a term I've run
across before?

~~~
DoreenMichele
Twice exceptional.

[https://en.m.wikipedia.org/wiki/Twice_exceptional](https://en.m.wikipedia.org/wiki/Twice_exceptional)

~~~
leoc
(mobile link! D; )

------
bacon_waffle
ETL means "Extract, Transform, Load"
[https://en.wikipedia.org/wiki/Extract,_transform,_load](https://en.wikipedia.org/wiki/Extract,_transform,_load)

~~~
kartan
Thank you. I think that is good practice to introduce abbreviations correctly.
Even that it is easy to forget when you work with them all the time.

"How do I introduce an abbreviation in the text? The first time you use an
abbreviation in the text, present both the spelled-out version and the short
form."
[https://blog.apastyle.org/apastyle/abbreviations/](https://blog.apastyle.org/apastyle/abbreviations/)

~~~
spc476
Or, you know, actually _use_ HTML with the <abbr> tag ...

------
jimbokun
I think this diagnoses the problem well, but ignores an obvious solution.

A team of one data scientist and one engineer, completely responsible for
building a model, and seeing it through into production, meeting all
applicable SLAs and performance metrics.

Or maybe it's two data scientists and one engineer, or one scientist and two
engineers, whatever is required.

The point is to have a small team you can hold completely accountable for
their output. They sink or swim together, so there is no debating whether the
scientists or engineers get the credit or take the blame. They are assessed by
the effectiveness of the end product they produce.

~~~
disposedtrolley
Small teams are awesome in so many scenarios. I recently wrapped up a 4-week
proof of concept for a client on knowledge management and discovery using NLP.

I was able to work with someone apt at machine learning while I focused on
building out the UI and backend. We delivered a first release about 3 days
after we started, giving ample time to seek feedback and let the users shape
the direction.

~~~
tkyjonathan
In 4 weeks, I was able to create a data mart that had self-healing (we had
issues with Python/events missing data which should have reached the existing
data warehouse) and the physical data models in it sped up an existing 6 hour
ETL task down to 0.15 seconds AND speed up a production query that took 5
seconds per click down to 0.07 seconds.

No team needed or proof of concepts. Actual working data models, up to date
tables + ETL code in production.

Background is DBA.

------
evrydayhustling
This is a great read, and this is a critical sentence:

> We are not optimizing the organization for efficiency, we are optimizing for
> autonomy.

Efficiency is for production pipelines where the product is thoroughly defined
and production costs eat deeply into profit margin. Most software
organizations have massive margins - but only if they get to the right
product. Organizing people for ownership and autonomy engages their
creativity, but also ensures that the org can move forward even when one side
or the other falls behind.

------
amarshall
> There is nothing more soul sucking than writing, maintaining, modifying, and
> supporting ETL to produce data that you yourself never get to use or
> consume. Instead, give people end-to-end ownership of the work they produce
> (autonomy).

I think this is more the point than “engineers shouldn’t write ETL”: the
engineering-related department consuming the ETL’s output should likely be the
ones writing/maintaining it. Or, perhaps more generally: don’t delegate
entirely to another team if the team that cares about the result is capable of
meeting their own needs.

~~~
humbleMouse
This quote is ridiculous. Some people enjoy plumbing high speed
reliable/transparent data pipes. I could care less what goes thru the pipes I
make.

~~~
closeparen
The unglamorous ETL work is the config and query writing to apply
infrastructure building blocks to particular pairs of tables, not the creation
of the generic infrastructure.

~~~
pishpash
Exactly. They are talking about manual, custom, one-use ETL that needs to be
maintained forever. Don't nobody got time for that.

On the other hand, sometimes you can't get away from that because different
orgs/humans generate trash data in idiosyncratic forms. Things will get much
better once we pry all the human hands off of data and let engineers redesign
all of them across the world. Not going to happen soon.

------
unholyguy001
Guess that didn’t last cause look they are hiring data engineers

[https://www.stitchfix.com/careers?gh_jid=1252958&gh_jid=1252...](https://www.stitchfix.com/careers?gh_jid=1252958&gh_jid=1252958)

------
closeparen
The author completely lost me. Analysts produce reports. Data scientists
produce models. We don’t ask a data scientist to produce a model unless we
have a serious intention to put it in production. There are significant
engineering challenges in taking a model from the data scientist’s batch-mode
workbooks and Hadoop queries to a reliable near-real-time online service, and
the relationship can get dysfunctional, but it has nothing to do with data
scientists being BI in disguise.

------
iblaine
> March 16, 2016

It's 2018. A lot has changed since 2016. The line between sw engineering &
data engineering is much thinner.

~~~
meritt
I'm curious what innovative tools have emerged in the past two years that
changed the dynamic?

~~~
haney
Not sure what the author was referring to but, Airflow has gotten better /
gained wider adoption during that time and my team started using DBT which
saved a bunch time and was new during that period.

~~~
closeparen
Airflow makes the distinction wider, as creating data pipelines requires even
less software engineering.

------
craig_asp
I've worked in BI (end-to-end - data modelling, reporting, ETL, etc.) for more
than 10 years now across various organisations and since "data science" became
all the rage, I had the pleasure to work with a few data scientists. From what
I've seen so far, they are very good as statisticians (some of them university
lecturers) but when it comes to building ETL pipelines, I don't think any of
them could actually do it properly. Properly as in an ETL process which
connects to various data sources, writes to logs, is repeatable, restartable
and so on. It is not easy to get to know how to build a proper ETL process and
it is not easy to learn how to "do data science" correctly as well. I see it
as more productive (from my personal experience) to let the "data engineers"
do the "data engineering" work - build data models, ETLs, etc. and let the
"data scientists" do the "data science" work - build and fiddle with
statistical models. Just like with a "full stack" developer, and the
separation of work between "back end" and "front end" developers, it might be
better to let each do what they do best unless you have people who can do both
properly (but often it's hard to find them and they would actually be better
in one area or the other). The frustration between the two camps - data
"engineers" and "scientists" is usually due to mismanagement (distinct teams
doing each bit separately, coordinated by one to many management layers)
rather than suboptimal division and allocation of labour. Small teams of two
to four people which contain the correct mix of experts would benefit from the
strengths of both data professional types, and would avoid the problems around
syncing the effort.

------
EToS
Lots of people want their key discipline to be the centre of the universe, you
see it across designers, content creators, engineers, testers etc.. The key to
any team in my experience is to have a healthy mixture of specialists (narrow
scope, high resolution) and polyglots (wide scope, lower resolution), and to
promote collaboration as much as possible..

------
solatic
Can't this whole thing be boiled down to "DevOps for Data
Science/Engineering"?

Different parts of the org with different skillsets and cultures practicing
empathy for each other by communicating interests in version-controlled code,
allowing for guard-railed autonomy, which leads to business agility.

Yep. Sounds about right.

> Optimize for autonomy not efficiency

Optimizing for efficiency without considering the cost of work in progress
(WIP) (irrelevant ETL models), rework (unscalable models), or unplanned work
(unscalable models that make it to production) results in company silos (data
engineering, infrastructure engineering) cheering local maxima while covering
their ass in the face of a business that's suffering from a long lead time.
Two teams with two backlogs will accomplish work exponentially faster compared
to three teams with three backlogs.

It boggles my mind how books like The Phoenix Project are not required
reading.

------
ghc
It’s a bad situation in your typical enterprise, but it’s even worse where
I’ve spent my career: working with realtime industrial data. I became
convinced that building time series data pipipelines was a bad idea after many
late nights in the office fixing fragile systems that couldn’t handle real-
world complexity.

As fun as it is to build with and learn new technologies, it’s a bad idea to
build data pipelines unless you have a lot of resources and good leadership
that can make peace between all the different people who touch the data.

Unfortunately in the world of sensors and equipment there aren’t many
solutions, so I started a company (at
[https://sentenai.com](https://sentenai.com) ) to save others from my years of
struggle. It turns out it’s even harder to build a _general_ time series data
pipeline solution, but we’re making progress.

------
msencenb
Does anyone have experience with ETL as a service like StitchData (not related
to stitchfix)?

The startup I'm employed at needs some data analysis, but it is not big data,
simply a way to unify analytics into a queryable database. I'm not looking
forward to writing any ETL code, and was hoping someone here had a tool to
help.

~~~
mycelium
I would highly highly recommend ETL as service, after adopting it recently. It
substantially changes your relationship with your data sources in a really
positive way. And frankly, ETL for common data sources is code that you just
don't need to write.

I would say that you should pilot with a few ETL vendors. We currently use
Fivetran, they're fine but we've had enough burps that I cannot cold recommend
them over other vendors. I cannot for the life of me remember the details, but
I think we went with them over Stitch for pricing reasons.

~~~
georgewfraser
I'm Fivetran's CEO and I just want you to know, whatever "burps" you
experienced, these things keep me up at night and the whole team is always
striving to make the pipeline "just work". The whole vision of our product is
that you should be able to plug in and get a perfect mirror image of all your
data sources in your data warehouse. Anytime we fall short of that it drives
us crazy.

~~~
Petefine
Do you have a forum or suggestions tool at all? Fivetran has been amazing for
our new datawarehouse and we're very pleased with the service, but there are a
few little (non-bug) things that would have made it even easier.

~~~
fraserharris
You can email me: fraser@fivetran.com

------
kevincennis
> Autonomy means the data scientists own that code as well. All the way into
> production.

This does not strike me as a great idea.

~~~
EdwardDiego
Yep, data science and software engineering are two very different disciplines.

------
datademon
As an undergraduate who is about to graduate with a degree in "Data Science"
this post encapsulates a lot of my worries as I move into the work world.
Should I focus on being a "thinker" a "doer" or a "plumber"? For the first
three years I was planning on being a CS major until I was denied from the
department: now the data science major is my only hope to graduate. I feel as
though my programming skills are solid: but not good enough to be on any sort
of fast paced infrastructure/devops team. On the flipside: I feel as though I
am so far behind on stats/math knowledge that it's pointless to try and become
a data scientist/analyst. I've thought about data engineering (the 'doer') as
a happy compromise between the two. However there are barely any intern or
entry level data engineering positions that I can find. The ones I do find
require knowledge of so many frameworks that I don't know where to start.
Additionally, I'm not even sure if data engineering even is a happy
compromise, especially after reading the post. Time is ticking, and sooner or
later I'm going to have to figure out what route to take, and how I want to
specialize. I go to a hyper competitive university in a hyper competitive
region of the country and I'm starting to feel like I'm falling behind and
getting lost.

If any of you older/more experienced engineers and scientist have advice or
wisdom for me, I would very much appreciate it.

~~~
nitrogen
A bit OT, but as a more experienced engineer who dropped out of school to
start a company, I'm curious: why weren't you able to get into your school's
CS program?

Don't worry too much about "falling behind". There will always be time to
learn more math or a new framework. Worry more about finding that first job,
any job, then you can branch out once inside the industry. Networking beats
recruiters beats sending a resume, so try to find a friend who already works
where you want to be.

~~~
datademon
I did poorly on a math class that was required to declare the major. It's
ironic since now that I'm in the data science major, I have to do even more
math classes and less programming classes.

I would love to do my own startup. I have a few ideas floating around. But I
feel like I lack the discipline to sit down every day and force myself to work
on them without external deadlines/pressure.

In terms of jumping into the tech industry: I understand the advice about
looking for any job when starting out. It just seems that even a lot of the
entry level jobs are very specialized.

~~~
nitrogen
I'd recommend against doing a startup straight out of school unless you get
accepted into a notable accelerator with a solid cofounder. Apply for the
seemingly specialized jobs anyway, the worst they can do is say no.

------
yannis7
arrogant and pompous uses of terms "mediocre, soul-sucking, etc etc", while
the fundamental ideas of the article range from trivial to minimal-value

------
polm23
Previously.

[https://news.ycombinator.com/item?id=11312243](https://news.ycombinator.com/item?id=11312243)

------
Roritharr
My father in law is such a "thinker". He has been since the 70s and worked on
all kinds of projects from IBM Mainframes for up to Hadoop and Kafka for
Insurance Companies and Telkos.

It's ridiculous to me how hard it is for him to find a new job at 60. He
financially doesn't have to, but he wants to train younger guys on how to deal
with all the weirdness one encounters in ETL Jobs.

------
protomyth
_Report Developers, on the other hand, are folks who have made a career around
designing reports in a specific tool (e.g. Microstrategy, et al). They are
specialists._

Is this the common perception, because it really doesn't line up with my
experience?

~~~
bigger_cheese
At least in my Org Reports are pretty much an after thought left to the data
engineers (like me) to "take this metric I've developed" and display it on the
morning report.

Writing/updating a report is easiest part of my job it's the data that goes
into building it that is hard translating the "simple metric I've developed"
and getting it to run in a robust automated and sane fashion is the difficult
part.

The complexities in my org are two fold.

Firstly the infrastructure people don't get data - at all. They speak PLC's
and HMI's to them it's all OPC and magic A2A messaging takes care of
everything. All data is time series to them and it all goes into an historian
(which is basically a giant ring buffer i.e it gets flushed periodically)
anything beyond that is past their level of expertise.

The data needs to be batched together the time series information has to be
processed into "event frames" \- this data was all part of this sequence of
conveyor belt movements for example. Then you need to link it to related
events etc and archive it in some kind of sane fashion so that in six months
time if there is a product defect or something like that you can trace the
entire series of event frames for that particular production batch.

Secondly the people the article calls "data scientists" (in my org these are
Engineers - real ones of the Chem and Mech variety) don't know anything about
databases or handling data they prototype their metrics in Matlab, Fortran,
Excel and the like.

You really need someone to translate their code into something sane that can
be automated. Engineers are not taught to code at all. I know I studied
engineering at university Fortran is the lingua franca. Code is just a way of
representing mathematics. Asking these people to do all the data processing
pipeline is just not going to happen. It's not their job. They write the
simulations and models they have the domain knowledge thats whats important
for them to be worrying about.

~~~
protomyth
Ok, I think I'm getting the specifics of this situation. So, we are talking
about internal reports, not something that could actually get in the external
customer's hands.

~~~
bigger_cheese
Yes this is internal stuff. I work at a large industrial manufacturing plant.

Reports that go externally are done by certified people. (Laboratory
technicians for product specifications and finance analysts for stock market
stuff).

~~~
protomyth
I’ve done external reports for clinical trials and agriculture, and I guess
they weren’t as up on getting certifications. Thanks for the very detailed
replies.

------
didibus
> We strive to lead the business with our output rather than to inform it

I think the business hires data scientist to be informed. Not to make business
decisions on their behalf.

> Data scientists love working on problems that are vertically aligned with
> the business and make a big impact on the success of projects/organization
> through their efforts. They set out to optimize a certain thing or process
> or create something from scratch. These are point-oriented problems and
> their solutions tend to be as well. They usually involve a heavy mix of
> business logic, reimagining of how things are done, and a healthy dose of
> creativity

Again, I'm confused? That sounds like the data scientists should have majored
in business then. If data scientists start doing that, what will all the other
business folk do then?

Data scientists should just build out reports that provide valuable insights
and potential patterns that can help make business decisions. The difference
with prior reports engineer or data analysts or wtv, is that a data scientist
is assumed to be able to generate statistical analysis or/and pattern analysis
over the data. While prior, a data analyst only needed to perform basic
versions of that which did not go beyond what SQL could do.

The data engineer should enable the data scientist to perform this analysis by
both working with the software engineers to acquire it safely, securely,
reliably and at scale. And working witj the data scientist in order to apply
his statistical analysis efficiently and at scale to a possibly very large
data set. Finally, he might need to work with both software engineer and data
scientist to setup real time or close to real time versions of the analysis.

All result from the analysis should be presented (aka reported) to the
business. The data scientist can suggest interpretations or ideas to address
findings, but it's the business role to make tactical and strategic decisions
about business processes and products.

And if you're doing ML as part of a process, then you need a ML scientists.
Say you need to build out voice recognition, or the likes. Basically comp sci
or math majors with ML masters or PHDs.

------
motymichaely
Different parts of engineering require different skill set.. Someone has to do
the data engineering part (be it the data scientist, data engineer, ops,
whatever..). This requirement hasn't changed since 2016: 50 to 90 percent of
time is spent "Cleaning" Data for Analytics. You just need engineers with the
right skills and tools to help reducing this time and get things done.

[https://s3.amazonaws.com/xplenty-
assets/infographics/raw_dat...](https://s3.amazonaws.com/xplenty-
assets/infographics/raw_data_cleaning_is_killing_bi.pdf)

------
technofiend
It's _not_ in my experience performant but Pentaho is definitely ETL-for-
dummies easy to use. Similar to your average user pivoting data in Excel
rather than learning Python or R, sometimes having a tool with suboptimal
performance is better than optimizing an adhoc or short term process.

~~~
dgudkov
If you're looking for a _real_ ETL-for-dummies, take a look at my EasyMorph
([https://easymorph.com](https://easymorph.com)). We've made a number of
simplifications that specifically target "dummy" users, e.g. columns may mix
values of different types (text, numbers, etc.).

~~~
v4n4d1s
Thanks for developing easymorph! Free version helped me through my bachelors
degree. It's my go-to tool to introduce people to ETL and similar concepts.

~~~
dgudkov
You're welcome! Great to hear it happened to be of help :)

------
ArchTypical
I wrote plenty of ETL. Maintained high throughput using whatever I could find.
Then I got another job and had to write ETL for AdTech, where the volume is
unlimited. Nothing about it is surprising or hard. Engineers are great at
handling known data and transforms, then adapting to unknown data.

------
Annatar
I see this every day in my job. He so nailed the problems. Data scientists
must be made responsible and accountable end-to-end for their solutions. And
they must be grilled on operational deployability and maintanability before,
during and after deployment. They have to become accountable.

------
dagw
I often really enjoy it when I get a chance to do ETL work. The 'T' in ETL can
many times involve some pretty fun and creative challenges. And even in the
general case there is something really satisfying about putting together a
clever and well constructed ETL pipeline.

------
Plough_Jogger
(2016) tag.

------
jgalt212
The counterpoint to this, and the economic value therein is PG's Schlep
Blindness piece.

[http://www.paulgraham.com/schlep.html](http://www.paulgraham.com/schlep.html)

------
tkyjonathan
ETLs, physical data modelling and data marts/warehouses used to be handled
within database admin's task in small to medium sized companies and largely
with ETL tools or just SQL.

~~~
just_myles
Yup. That's been my experience. The DBA used to handle all these tasks and as
of I don't know a 5 years ago, it's been segmented into data engineering. I
think in this case it's a good thing. I always considered that a non-
administrative tasks.

------
dwaltrip
> Enable Everyone to be Best in the World

This particular line really rubs me the wrong way.

Do the best you can... You won't be the best in the world, but you can still
have a positive impact.

------
marcell
Article from 2016

------
finnley
There's a huge difference between writing ETL to apply business logic vs. grab
the data from a common API like Google Analytics. There's no tool in the world
that can write all the logic you need to transform data the way your
organization uses it unless you have an extremely simple, common use case.

What this article is really saying is that replicating your data from source
apps shouldn't be manually coded. The harder part still needs someone to write
code so business users don't need to.

------
corporateguy5
This article matches my experience exactly. Some companies will hire a “data
scientist” on pedigree. They will be low on skills and high on charisma. The
engineers are burdened with implementing the ideas as well as shoulder the
failure of the algorithms. “You spent the last few months implementing
algorithms and none worked?”. Very little blame will go to the data scientist.
In tons of cases data scientists are more like product managers.

