
Engineers Shouldn’t Write ETL - mjohn
http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl
======
mikestew
_... a highly specialized team of dedicated engineers...If they are not bored,
chances are they are pretty mediocre. Mediocre engineers really excel at
building enormously over complicated, awful-to-work-with messes they call
“solutions”._

OMG, the author just described the last place I was at. Processed a few Tb of
data and suddenly there's this R. Goldbergesque system of MongoDb getting
transformed into PostGres...oh, wait, I need Cassandra on my resume, so Mongo
sux0r now...displayed with some of the worst bowl of spaghetti Android code
I've witnessed. The technical debt hole was dug so deep you could hide an
Abrams tank in there. To this day I could not tell you the confused thinking
that led them to believe this was necessary rather than just slapping it all
into PostGres and calling it a day.

All because they were processing data sets sooooo huge, that they would fit on
my laptop.

I quit reading about the time the article turned into a pitch for Stitch Fix,
but leading up to that point it made a good case for what happens when
companies think they have "big data" when they really don't. In summary,
either a company hires skills they don't really need and the hires end up
bored, or you hire mediocre people that make the convoluted mess I worked
with.

~~~
darksaints
This is so true. I do business intelligence at Amazon, and I've seen this play
out millions of times over. The fetishization of big data ends up meaning that
everybody thinks their problem needs big data. After 4 years in a role where I
am expected to use big data clusters regularly, I've really only needed it
twice. To be fair, in a complex environment with multiple data sources
(databases, flat files, excel docs, service logs), ETL can get really absurdly
complicated. But that is still no excuse to introduce big data if your data
isn't actually big.

I really hate pat-myself-on-the-back stories, but I'm really proud of this
moment, so I'm gonna share. One time a principal engineer came to me with a
data analysis request and told me that the data would be available to me soon,
only to come to me an hour later with the bad news that the data was 2
terabytes and I'd probably have to spin up an EMR cluster. I borrowed a
spinning disk USB drive, loaded all the data into a SQLite database, and had
his analysis done before he could even set up a cluster with Spark. The proud
moment comes when he tells his boss that we already had the analysis done
despite his warning that it might take a few days because "big data". It was
then that I got to tell him about this phenomenal new technology called SQLite
and he set up a seminar where I got to teach big data engineers how to use it
:)

P.S. If you do any of this sort of large dataset analysis in SQLite, upgrade
to the latest version with every release, even if it means you have to `make;
make install;` Seemingly every new release since about 3.8.0 has given me
usable new features and noticeable query optimizations that are relevant for
large query data analysis.

~~~
ignoramous
Fellow amazonian here. We switched from a massively distributed datastore (not
to be named) to rodb for storage and found 10x improvement, not to mention
eliminating cost and other head-aches; kind of expected since rodb is an
embedded db...

~~~
pneumatics
What is rodb?

~~~
csears
I'm guessing he means a "Real-time Operational Database". This seems to be a
generic term for system like a data warehouse that contains current data,
instead of just historical data. If you are taking the output of a Spark flow
and storing it in Postgres or MongoDB or HBase for applications to query, then
those could be considered RODBs.

Since this is Amazon, I suspect he is referring to SPICE (or their internal
version), which was released last fall as part of AWS's QuickSight BI
offering...

"SPICE: One of the key ingredients that make QuickSight so powerful is the
Super-fast, Parallel, In-memory Calculation Engine (SPICE). SPICE is a new
technology built from the ground up by the same team that has also built
technologies such as DynamoDB, Amazon Redshift, and Amazon Aurora. SPICE
enables QuickSight to scale to many terabytes of analytical data and deliver
response time for most visualization queries in milliseconds. When you point
QuickSight to a data source, data is automatically ingested into SPICE for
optimal analytical query performance. SPICE uses a combination of columnar
storage, in-memory technologies enabled through the latest hardware
innovations, machine code generation, and data compression to allow users to
run interactive queries on large datasets and get rapid responses."

[http://www.allthingsdistributed.com/2015/10/amazon-
quicksigh...](http://www.allthingsdistributed.com/2015/10/amazon-
quicksight.html)

------
dizzystar
_Nobody enjoys writing and maintaining data pipelines or ETL. It’s the
industry’s ultimate hot potato. It really shouldn’t come as a surprise then
that ETL engineering roles are the archetypal breeding ground of mediocrity.

There is nothing more soul sucking than writing, maintaining, modifying, and
supporting ETL to produce data that you yourself never get to use or consume._

This is like... your opinion. Some people find pushing around HTML / JS / CSS
absolutely soul-crushing. Considering the lion's share of websites are ugly,
unusable, and slow, does this mean that front-end engineering is a breeding
ground of mediocrity, so server-side devs and CFOs should all be sharing in
the pain?

Some people actually enjoy working with data, and don't find ETL and
pipelining horrible to do at all. It is a different set of challenges, but
calling people people mediocre because of ETL is a non-sequitur.

~~~
bazqux2
I love ETL. I've worked with real Big Data (5PB+) and working on the data
pipeline was my favorite part. The feeling you get when you rewrite a job to
run 1000x faster so the company can make way more money.

~~~
Terr_
That implies the work is in a revenue-center or that you're at the rare
company that isn't myopically focused on sales.

~~~
TheLogothete
No, it doesn't. If you make an M/R job work 1000 times more efficiently you
save money for the company. It doesn't matter if you are a profit center or a
cost center.

------
btilly
The thinker/doer problem goes way back. In most organizations the person who
thinks of something gets the lion's share of the credit, and the person who
implements it does the lion's share of the work. And if it turns out to be a
bad idea, the thinker can always blame a bad implementation, thereby passing
the lion's share of the blame to the doer.

I've seen careers made and broken based on whether people got to play thinker
or doer.

This makes rewards for thinking very lopsided. However the problem is that
actual credit for success REALLY belongs with the people who did the work.

This problem shows up at every scale in every organization. For example there
are hundred people who want to be the business side of a startup for every
person who wants to build the tech. Why? The business person gets to be the
thinker, the developer does the work. And then the business person expects to
become the CEO and get the bulk of the payout!

~~~
andrewflnr
But you can't say that the doer always deserves the credit either. Sometimes
the idea is the hard part. Similar for blame. It doesn't work to make
generalizations. You have to make a judgment call every time, and usually the
answer will be a complicated mixture.

~~~
cookiecaper
"Ideas" alone are almost never worth anything. You have to do the work to back
it up. Everyone I know has about a dozen ideas (you hear them all the time as
someone who makes ideas real).

What matters is the technical skill to make the idea go from a fantasy to a
reality semi-reminiscent of the idealized fantastic version, whether that
skill is in business, accounting, programming, marketing, or whatever.

~~~
rukuu001
Sure, an unrealised idea is next to worthless.

The real pain is making a decision and expending resources on your
challenging/risky idea. There's very little appetite for the responsibility
and risk that come with big ideas (in a BigCo).

Got the ability to think up new ideas, sell them within an organisation, and
get them executed (hello 'doer') in a way that provides value to that
organisation? You're gold, and worth way more than the 'doer'.

~~~
cookiecaper
>Got the ability to think up new ideas, sell them within an organisation, and
get them executed (hello 'doer') in a way that provides value to that
organisation?

No one can "have" this ability because it's transient. Unless you control the
entire corporation (in which case you don't need to influence anyone else
anyway), there is always someone who can come in and break your previously-
perfect ability to "sell" your ideas inside the org. You're claiming that
artful politicians (or, more blatantly, "good bullshit artists") are more
valuable than skilled engineers. I don't believe that.

------
lordnacho
He hits upon a quite interesting division of labor. Where I've worked in
finance, there's been "strategists" and there's been "developers". You can
guess which one is seen as high prestige.

The problem arises when someone gets into a position where they can think big
thoughts without having to do any nitty gritty. Effectively, they end up
jumping in right when the real producers have finished the actual work, and
then coming up with some polish that makes it look like they came up with some
interesting result.

This is not actually a way to get work done. It's a way to play politics.

Worse yet, it's actually completely detrimental to getting things done. When
you have things split up between thinkers and doers, what do the incentives
look like? It's quite simple. I may order some analysis, and I may not fully
understand the nuances. But whatever happens, as a thinker I'll have to have
something grandiose to say, and I'll need to keep the doers busy. That way if
I don't find a real conclusion, it's everyone's fault. If I do find something,
it's thanks to me.

Where I worked the people with the big plans couldn't code their way out of a
paper bag. Ask them what Big-O is, they draw a blank. Ask them how their
trading strategy will actually send orders to the exchange, they draw a blank.
But ask them something that sounds like strategy, and they will feed you
plenty of unsubstantiated BS.

My new venture is coders all the way down. Strategists who can actually use
git without asking what it is, understand that algorithmic complexity actually
matters, and so on. Coders who understand what the market is.

~~~
BenoitP
The more I think about making credit match with work in an organisation, the
more it looks like a neural network where the credit/revenue/profit gradient
is having problems being back-propagated.

In the horizontal setup you describe (layer of thinkers on top of layer of
doers), credit hits a barrier at the thinkers. The gradient isn't propagated.

In the vertical setup in the article (layer of thinker-doers), of course the
backprop will be good because it is only one layer thick. You gain proper
incentives, proper treatment of data on the whole pipeline. And the engineers
can also concentrate on a purely orthogonal thing: writing tools.

But you lose the benefit of having the layer being able to focus on one thing.
The author acknowledges those efficiencies (his word). It is hard to find
people with a wide set of skills. Although in this case it is balanced, as now
the engineers have gained specialization.

But I digress. _My point was: humans in orgs are bad at backprop. Why share
the credit at all?_ Organisations can be seen as neural networks/graphs, and
they can lack proper backprop.

I'd _love_ to see the results of some pagerank-like backprop. Every employee
gets one base point. Every week, he is asked: "who helped you the most in
doing your job this week?". Sales would credit analysts who would credit
engineers, etc. Or Sales would credit analysts-engineers who would credit
tool-writers, etc. It could go both ways: engineers could credit sales or
analysts for writings well thought-out problem descriptions.

Then you would run pagerank on it, and base every promotion, every salary
increase on it. Information would flow well, and everybody has a clear
direction (his gradient) of what he can do to shine.

Also, by injecting revenue at the sales layer in a certain period of time, you
could identify who conctributed the most in an increase of revenue.

Also, I posit that managers have a tiny view of what happens in a firm. They
only get to see a fraction of interactions, while the brunt of what matters
happens in the long tail of one-to-one interactions. Should you chose to
promote people with the highest PR, you would have a true result-based bottom-
up org.

------
throwawy31816
This author seriously needs to expand all of his TLIs (three-letter
initialisms) the first time he uses them, as any writer worth his or her salt
would do. There are those who may be interested in what he has to say, but
can't follow because of assuming abbreviations.

~~~
mikestew
Though I agree with you on expanding TLIs, if you have to have "ETL" defined
for you, you probably won't get the "joke". And though this will come out more
cynical than I intend, if you don't know the acronyms, then you probably won't
be buying what Stitch Fix is selling. Filtering their funnel, maybe?

~~~
ams6110
While possibly true, it's simply a courtesy to the reader to parenthetically
define any acronym the first time it's used in a published piece of writing
(of course this would not apply to internal emails, casual comments such as
discussion forums here, etc.)

~~~
quadstick
The original post may have been intended for a select audience that would be
familiar with the context, so that author may be forgiven, but the person that
submitted the post should have kept in mind the much wider audience here.

As a hardware engineer, ETL is a NRTL that competes with UL and CSA. Oh,
excuse me, Thomas Edison's Electrical Testing Labs is a Nationally Recognized
Testing Laboratory that competes with Underwriters Laboratories and the
Canadian Standards Association.

------
kafkaesq
_There is nothing more soul sucking than writing, maintaining, modifying, and
supporting ETL to produce data that you yourself never get to use or consume._

I'm not sure I get why writing ETL code for data you'll never consume is any
more soul-sucking than, say, refactoring JS code for a website you couldn't
begin to care about (and which will never be properly re-designed anyway); or
even doing "thinker"-level work but for an industry you couldn't begin to care
about (advertising), etc.

In other words, what most developers of whatever technical stripe do for a
living.

~~~
kaspm
And I also fundamentally disagree with the notion that moving a large amount
of realtime data reliably and with accuracy, monitored and consistent with
relatively little failure is not an interesting engineering challenge in
itself. I find that for all the talk about data-driven organizations, most
don't use a tenth of what is available but that when the tenth is needed, it's
hugely satisfying to be able to provide it.

~~~
dspillett
_> And I also fundamentally disagree with the notion that [ETL work] is not an
interesting engineering challenge in itself._

A lot of people think that certain DBA/ETL/BI/similar work is boring and
simlpy don't want to do it and so don't learn to do it well. Which is fine by
me: it means those of us who can do it well can get paid good money when
someone needs it.

The only problem with this theory in practise is that many also think such
work is easy and free of complications; so they baulk at paying for people
genuinely can do it well, get people less experienced who say they can do it
well but do it badly, and judge the rest of us by that standard and assume
database people are thick and can't do easy jobs properly...

------
v64
> The fundamental flaw that prevents the Thinker and Doer model from living up
> to its recruiting hype is the assumption that there exists an army of
> soulless non-mediocre Doer engineers who eagerly implement the ideas and
> vision of data scientists.

There's a large, active community of engineers who specialize in data, whose
job is to technologically enable data scientists the means to perform their
analyses. I know these people exist because I'm one of them, and I work with
them, and I've met them at meetups and conferences. I don't know why the
author doesn't think these types of engineers exist. Not all of us who code
want to work with the web.

> If you read the recruiting propaganda of data science and algorithm
> development departments in the valley, you might be convinced that the
> relationship between data scientists and engineers is highly collaborative,
> organic, and creative. Just like peas and carrots.

Almost every data team I've worked with is structured this way. I work daily
with data scientists. I have a data scientist sitting to my right, two data
scientists sitting across from me. Our teams are highly integrated and I can't
imagine it working any other way. If the teams the author is familiar with
don't operate in this manner, then I can see why he'd think the endeavor is
hopeless.

I also disagree with the author's conclusion. The data scientist's job is to
analyze and interpret data. They should not be spending any time thinking
about how to get that data. They should not be concerned about where the data
is coming from. The more time scientists have to spend thinking about ETL, the
less time they have to do what their training is in, statistical analysis.

~~~
nmkridler
I completely disagree, data scientists who can not create the data they need
are at a significant disadvantage to those who can. Our job is more than being
able to analyze and interpret data. If you have someone in your organization
that spends no time thinking about how they get the data, you need to fire
them or reduce their salary.

~~~
v64
The data scientists I work with are statistics PhDs. The extent of their
programming knowledge is R and SQL. What are they supposed to do if the data
they need to analyze is only available through a SOAP API you log into with
OAuth, and they need to log in once a day to retrieve the latest day of data?
Unless you're a software engineer, you probably don't have the skillset
necessary to easily get that data.

The data we use comes from relational databases and document stores operated
by different departments, external APIs and third party services, SalesForce,
server log files, etc. A stats PhD does not have the training to gather this
data themselves.

In terms of a hybrid scientist/engineer role, I don't know many software
engineers who are also good at stochastic calculus or ensemble learning.
Likewise, I don't know many data scientists who are also comfortable writing
cronjobs to retrieve external API data or have the ability to diagnose server
problems.

~~~
nmkridler
What you are describing is a statistician and that's perfectly fine, but
lumping them in with data scientists devalues the role for those of us doing
more.

~~~
v64
How would you differentiate the roles of statistician, data scientist, and
data engineer? I've used and heard the titles "statistician" and "data
scientist" used interchangeably, and the Wikipedia entry for data science [1]
gives evidence to support that usage since the late 90s:

"In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled
"Statistics = Data Science?" for his appointment to the H. C. Carver
Professorship at the University of Michigan. In this lecture, he characterized
statistical work as a trilogy of data collection, data modeling and analysis,
and decision making. In his conclusion, he initiated the modern, non-computer
science, usage of the term "data science" and advocated that statistics be
renamed data science and statisticians data scientists."

From the same article, a quote from Nate Silver:

"I think data-scientist is a sexed up term for a statistician....Statistics is
a branch of science. Data scientist is slightly redundant in some way and
people shouldn’t berate the term statistician."

If your skillset differs from a statistician, then calling yourself a data
scientist is not going to be a differentiating title in common parlance.

[1]
[https://en.wikipedia.org/wiki/Data_science#History](https://en.wikipedia.org/wiki/Data_science#History)

~~~
nmkridler
I think the quote and definition from the blog is a good one: “better
engineers than statisticians and better statisticians than engineers”. Perhaps
that 1997 quote was influential in the decision to use the term Data Science,
I think the current usage encompasses much more than statistics. When I
started it required the ability to push production code, build statistical
models, and communicate results effectively. Maybe I'm wrong and maybe the
tools got better, but for a while, you couldn't provide value if you couldn't
get to the data or create the data you needed.

------
jaz46
Most of the comments in this thread are focusing on the author calling ETL
boring -- that is the title after all. But I found the greater point of the
article to be about empowering data scientists and giving them autonomy. This
post reminds me of Jerry Chen's DDI post [1], except it's about data science.

The notion that a data scientist's only job is to "write a statistical model"
and then it's someone else's problem to run it in a distributed environment
only exacerbates the problem and lowers DS code quality.

Full disclosure: my company Pachyderm [2] is trying to solve exactly the
problem Jeff is talking about in the post. We've built a data processing
platform on top of the container ecosystem. Basically, the data scientist has
complete control over the runtime environment for their analysis since
everything is bundled into a container. It scales to work for actual "big"
data, but it also great for small teams that don't have massive infrastructure
resources.

[1] [http://venturebeat.com/2015/04/01/the-geek-shall-inherit-
the...](http://venturebeat.com/2015/04/01/the-geek-shall-inherit-the-earth-
the-age-of-developer-defined-infrastructure/) [2]
github.com/pachyderm/pachyderm

------
brown9-2
_If you manage to hire them, they will be bored. If they are bored, they will
leave you for Google, Facebook, LinkedIn, Twitter, … – places where their
expertise is actually needed. If they are not bored, chances are they are
pretty mediocre._

Granted that yes, lots of solutions don't exactly require a Hadoop cluster
with thousands of nodes, this is a pretty gross and mean-spirited dig at
"mediocre engineers" a number of times. It would be nice if we didn't treat
people that don't work at Amazon/Google/Twitter/LinkedIn as lesser beings
because they find their jobs at a probably-doesn't-have-Big-Data company.

(Does StitchFix have Big Data? If the answer is no, are their "Data platform
engineers" mediocre?

------
kod
The idea that engineers should build lego blocks without knowing what they're
going to be used for is questionable at best.

A better idea imho is to have small crossfunctional teams where scientists and
engineers work together to build only what they need with short iteration
cycles.

If everyone involved doesn't have at least a broad perspective on the end-to-
end purpose of what they're working on, they're probably going to build the
wrong thing.

~~~
0xdeadbeefbabe
Although, what you say about lego blocks applies to iterations, which are lego
blocks in time.

------
MrFoof
>You Probably Don’t Have Big Data

"Big Data" is like sex in high school. Everyone talks about it but few people
really have lots of it and some just don't have any.

~~~
strictfp
The thing is that everybody has big data, it's only a question of how much of
your data you save.

~~~
tyre
This is not true.

For many startups, even if they audit everything they won't have petabytes of
data.

------
fizixer
Data-driven decision-making to change the course of a business, is so
internally disruptive it's unlikely to happen in an org-chart culture full of
management layer.

Because that's what it is:

\- It is attempting to question, critique, override, everyday decisions made
by the management (including the CEO) based on available data.

\- It is doing that with maximal knowledge of the whole organization. That
means all the records, finances, secrets, what not, have to be divulged to the
data-science team. (which in itself is an unsurmountable challenge, i.e., to
convince the management to allow full data access; think emails, chat logs,
meetings minutes of CEO's, VP's, etc, etc).

This will make the management go, "so let me get this straight, I authorize
you access to data of the whole organization, and you come up with a
conclusion (some of the times at least) that I'm full of it?"

I highly doubt any organization would be up for this kind of internal
disruption, even if that means more success for the company.

~~~
martini159
Your comment so perfectly summarizes what I've long felt is the deep dark
dirty secret of data science (at least at a company thats not
Google/Facebook/etc). And it flies in the face of all the "top job" lists
which are always littered with data-related jobs.

Very few people are interested in making data-driven decisions. They want an
employee (subordinate) who will prove that the decision they've made or are
planning to make, is correct. Anything else is, as you say, very internally
disruptive.

Being a data scientist or data analyst at a startup is (for the most part) a
completely miserable existence. You are relegated to doing interesting things
that are usually discarded. It can make you feel like your job is pointless.

In the end, one either makes the decision to be (at best) useless, or (at
worst) a puppet. That, or you quit.

Thank you so much for your comment - it's refreshing to see I'm not alone in
feeling like this.

------
ajnin
This is not very convincing. He starts off by saying that the "traditional"
model where the data scientists do the thinking while the engineers do the
doing is unsuccessful because the engineers need to get invested in other
people's ideas, need to maintain them and get blamed if they fail while the
data scientists get all the praise. So he suggests replacing this with a new
model where the engineers work horizontally aka in the shadows, have to be
"Tony Stark tailors" and get out of the way while the data scientists get to
be Tony Stark. Which is basically the same thing.

------
nkurz
ETL (Extract, Transform and Load) is a process in data warehousing responsible
for pulling data out of the source systems and placing it into a data
warehouse.

[http://datawarehouse4u.info/ETL-
process.html](http://datawarehouse4u.info/ETL-process.html)

~~~
spriggan3
Thanks, I had no clue what ETL meant, but the article was probably not
directed at my kind.

------
phamilton
> We are not optimizing the organization for efficiency, we are optimizing for
> autonomy.

This is one of the toughest parts of building a scalable organization (with or
without big data). Getting past the idea of efficiency and being OK with
redundancy.

This means allowing two teams to both build a common feature they might need,
rather than establishing a dependency. It means making one teams job broader
even if it overlaps with another team.

I find it interesting that we are perfectly willing to have redundancy on the
software side (load balancing, slaves, etc) but not on the development side.

------
noddingham
I could have done without the first half of the post telling me that I (or
others) are mediocre, then going on to tell me how the author (and his
fellows) are not just because they strive to be the "Best in the World".

This just reads like a puff piece for another valley startup by some guy who's
better than you. Oh, and here's how we do it, you should try doing it this way
too, because we think it's totes the best.

------
gtrubetskoy
I agree with the beginning of the article, which describes the present state
pretty well, the part about "better engineers than statisticians and better
statisticians than engineers", etc. But then I disagree with the rest.

The distinction between "Data Scientists" and "Engineers" is bogus, and the
point about whether your data is "Big" is a red herring.

In reality, there should not be any distinctions between "scientists" and
"engineers", you must strive to be both a "doer" and a "thinker". You can't
think without doing, and can't do without thinking.

If you're in this field, and consider yourself an "engineer" but your math
sucks, go read up on all you can about mathematics and statistics, just like
you did back when you were learning about programming, operating systems and
networking.

If you consider yourself a "data scientist" but don't know anything other than
R and basic Python, go study programming and operating systems and networking,
like you studied math at some point.

Somewhere on youtube I remember Dr. Donald Knuth (who is definitely an
excellent programmer/engineer/computer scientist, arguably one of the best the
world has known) saying that he considers himself primarily a _mathematician_.

Or, if you've read (or at least heard of) "the dragon book", you might find it
interesting and inspiring that one of its main authors Dr. Jeffrey Ullman
(whom I'd place in the same league as Knuth) went on to write another
excellent (and available freely online, BTW) book "Mining of Massive
Datasets", which IMHO is _the_ one fundamental "big data" book out there.

So Data Scientists - go learn some programming languages like C and study UNIX
and may be read "The Art of Computer Programming" and Engineers go read
[http://www.mmds.org/](http://www.mmds.org/).

Then you'll all get along.

~~~
dxbydt
This is a supremely ridiculous set of suggestions that has no merit
whatsoever. Companies aren't libraries. They aren't paying you to sit and read
books. There is an assigned dayjob, a set of tasks you have on your jira that
you have to resolve by your deadlines, and that occupies the 8 hour workday ID
you are doing any justice to it. So any reading you do is on the side, on your
own time.

Furthermore, people have these roles precisely because of their talents and
their choices. As a Data Scientist, most of what I do is read ML literature,
build ML models and write technical reports in Tex on what worked and what
didn't. The skills to do this were acquired over many painful years of
graduate work in math, statistics, ML. To suggest somebody can just read their
way through that material is quite laudable, but you are underestimating the
difficulty by orders of magnitude. Essentially, you are suggesting that all of
the graduate study and mentoring and homeworks and assignments and all that
went into the learning process be condensed into a book which one can just
plow through and become a DS. Well, good luck with that. By the same token,
expecting me to have the same level of efficiency and passion as a data
engineer when faced with a Hadoop/Oozie/Presto/Pig/kafka or what have you is
silly. I don't care for these technologies and how to work them. I know it
takes a really long time to get good at them - that's why the engineers get
paid a lot of money and also get yelled at when the ETL job fails. Because
it's a set of seriously valuable skills that were no doubt acquired over lots
of time and practice. It's not like I can buy a book on these things, just
read through them and suddenly I am a DE! I neither have the interest nor the
time to do that.

>>the distinction between data scientists and data engineers is bogus

Not at all! Both DS and DE professionals do distinctly different work and
conflating everything under 1 umbrella buys you nothing.

~~~
leblancfg
>> Companies aren't libraries. They aren't paying you to sit and read books.

That stroke you've brushed is too wide. Smart employers will have some of the
money they're paying an employee going towards learning... and if they're
really smart, they can even measure their ROI. Leads to less turnover, and
better long-term vision for their projects.

I get your point, but give someone passionate enough 6 months in a new work
environment, and with a decent mentor, and you might find they become
surprisingly adept at it. The hard part is hiring for the capability to learn
(fast).

------
nissimk
Title should have been "Engineers Shouldn't only write ETL." I agree with the
author's statement of the problem, but not with the proposed solution.
Succinctly, I think the problem is compartmentalization and specialization.
These are qualities that are sometimes promoted by management so that it is
easier to maintain control over the organization and to hire people who won't
require much training to do their jobs. Unfortunately, compartmentalization
and specialization both lead to unhappiness in the workers, and are net
negative for production. I believe the solution is fostering a wholistic
approach among the specialists. Data scientists (who should be statisticians
or machine learning experts) should interact regularly with software
development engineers that have to productionize their research and they
should also both interact regularly with systems and database administrators
who make it all work in production. Rather than being separate teams working
on parts of the same goal, they should all be one team. By working together
through the poroblems faced in each area, they can learn more about each
other's areas of expertise and will create a better solution faster. This
isn't true just for data science, but throughout technology, where operational
software developers should work together with product development, marketing,
testing and operations to break down the divide and get all team members
working towards the same goal.

------
ZenoArrow
> “What is the relationship like between your team and the data scientists?”
> This is, without a doubt, the question I’m most frequently asked when
> conducting interviews for data platform engineers. It’s a fine question –
> one that, given the state of engineering jobs in the data space, is
> essential to ask as part of doing due diligence in evaluating new
> opportunities. I’m always happy to answer. But I wish I didn’t have to,
> because this a question that is motivated by skepticism and fear."

> "Rather than try to emulate the structure of well-known companies (who made
> the transition from BI to DS), we need to innovate and evolve the model! No
> more trying to design faster horses…

A couple years ago, I moved to Stitch Fix for just that very reason. At Stitch
Fix, we strive to be Best in the World at the algorithms and analytics we
produce. We strive to lead the business with our output rather than to inform
it."

I find this article rather peculiar. At the start, you'd be forgiven for
thinking this was an article about a company looking to find a solution to a
problem, but as the article progresses it's clearer that they're selling
themselves as the solution to the problem they outlined.

In other words, they start off looking like a customer, but only to set up the
premise required to sell the solution to the problem their company supposedly
has/had. Turned me off from taking the product seriously.

------
zeroecco
This article is so one sided it is painful. it is almost like Sheldon Cooper
wrote this article. As an engineer I am offended and hurt that we are referred
to as “Tony Stark’s tailor”.

~~~
ellimilial
Btw didn't he build the suits himself and just share it with his less able cop
friend?

I'd agree, it feels like it invalidates the claim about getting credit for
'being a thinker'.

------
tyre
The article is about not over-engineering solutions to problems you do not
have. If you don't have interesting problems that require world-class
solutions, then don't hire as if you do.

> If they are not bored, chances are they are pretty mediocre. Mediocre
> engineers really excel at building enormously over complicated, awful-to-
> work-with messes they call “solutions”.

And then comes this line:

>At Stitch Fix, we strive to be Best in the World at the algorithms and
analytics we produce.

Without further justification, why does StitchFix, a subscription shopping
service, need to be the "Best in the World" at algorithms and analytics? They
have harder problems than Google or the Centers for Disease Control or NASA?

Unless they have justification for that, it seems a bit ironic given the
article's ire for over-engineering.

------
iblaine
I think the conclusion is this. Data Scientists, Data Engineers, and
Infrastructure Engineers exist in their respective roles. Data Engineers
should enable Data Scientists to be better engineers by creating frameworks
for Data Scientists. By doing so, Data Scientists will be less likely to put
stress on everyone else.

Another point I'd like to make is that not everyone hates ETL and pipeline
management. I happen to like it. It's rewarding to stand up reliable self-
healing data pipelines and ETLs.

------
jkestelyn
All the angst about "big" data that "isn't big" is based on a false premise.
"Big data" was NEVER just about scale, but rather intended to be equally
descriptive of diversity as well as velocity. The problem is that whomever
coined the term made the same strategic error as whomever coined "global
warming" \-- the adjectives used are too specific to adequately describe the
full range of qualities involved.

------
mattexx
_The best-case outcome of many efforts of data scientists is an artifact meant
for a machine consumer, not a human one._

I posit this outcome is absolutely necessary for any data science project to
be worthwhile in any organization.

In the case where the project produces a report and goes no further, you still
need the data and code for reproducibility, _one of the main principles of the
scientific method_ [1].

In the case where the project gets handed off to engineers to re-implement,
reproducibility is even more critical, since the engineers best effort to
reproduce the code will almost certainly not be successful the first time, and
you will need to validate many versions of the production model. Doing this by
hand even once it wasteful, doing so many times is tragically so.

In the case where the data scientists can produce a service worthy of
production use, kudos!! But understand the caveat that in _truly_ big data or
big compute flows, this outcome remains highly unlikely.

[1]
[https://en.wikipedia.org/wiki/Reproducibility](https://en.wikipedia.org/wiki/Reproducibility)

------
dgudkov
I've done quite a few BI/DWH projects and what I found is that the best
approach is to begin from deep prototyping done by data analysts and ETL
developers together. It may all start with just a few spreadsheets and a
simple dashboard. After many iterations and a lot of brainstorming it grows
into a rather developed working prototype that the both sides have equally
contributed to. Then the prototype is productionized by the engineers using
standard ETL/whatever tools. So everybody gets the credit, and everybody is
motivated. This experience made me create EasyMorph [1] -- a tool for quick
ETL prototyping and brainstorming. It's like Excel but for tables, and it's
equally suitable for data scientists and developers.

[1] [http://easymorph.com](http://easymorph.com)

------
georgewfraser
I would take this a step further and say NOBODY should write their own ETL. In
a world where:

1\. SaaS services have APIs

2\. Your database is hosted in the cloud

3\. You use a standard SQL data warehouse that is also hosted in the cloud.

ETL from (1, 2) to (3) is a completely standard problem, and you should be
able to buy a fully-automated solution. My company (Fivetran) does this as a
service. We've replaced lots of homebrew data pipelines built by our
customers, and we always see the same issues:

* Homebrew ETL pipelines use fancy big-data tech like Hadoop and Kafka in places where it has no relevance, like syncing your 20 GB Salesforce instance.

* Homebrew ETL pipelines don't deal with all the dark corners of the data sources, such as: what happens when someone adds a new custom column? What happens when your MySQL read replica fails over and a new binlog starts? Etc.

The lesson being, don't do this yourself.

------
scandox
This reminds me of Spolsky's story about MS trying to create a Master Slave
paradigm in coding. The Master would define functions and the slave would
write the actual code in the functions. But of course no one wants to be the
slave. Everyone needs to feel they are a thinker. Naturally.

------
dunkelheit
In other words division of labor does not work quite so well for a data
science department as for a pin factory. The proposed solution (letting data
scientists code more) is not radical enough in my opinion. Why not muddle the
roles even further? Let everybody feel the pains that people in the other
roles experience. Foster empathy and personal connections. Let developers talk
to the users and vice versa.

I worked at a company where distinction between the roles was emphasized by
physical separation, presumably so that they won't interfere with each others
day-to-day duties. The downside is that each group starts caring about their
particular thing only, feeling that they are the ones who really keep the
place running and other groups are bozos doing their job incredibly poorly.

------
StudyAnimal
ETL is just a small part of it, and engineers should probably have more of a
role than they do. My last project was a data warehouse one, where the thing
was obviously slapped together by PowerCenter users. They thought it was all
about the ETL, They forgot the other 95%, how to engineer large scale,
complex, maintainable software solutions.

Sure, let the powecenter users "write the ETL", but then they need to get the
heck out of the way and let the big boys actually build the warehouse.

------
TeMPOraL
Someone with big enough clout should utter a proper rule of thumb at some
prestigious software conference. Something like "if your data can fit on a
single commercially-available hard drive, it's not big data". Maybe then it
has a chance to filter down to university education over the next decade or
so.

(Corollary to that rule of thumb: if your data fits on a hard drive, all "big
data" tools you need are shell scrips and SQLite.)

------
kakoni
How do people do ETL these days? Using spark? Some framework?

Personally for smaller projects I've used kiba[1] or transforms in pgloader
[2]

[1] [http://www.kiba-etl.org/](http://www.kiba-etl.org/) [2]
[https://github.com/dimitri/pgloader](https://github.com/dimitri/pgloader)

------
lafay
Is ETL really even necessary anymore? Why not just run fast ad hoc queries
over the raw data with something like Google BigQuery?

~~~
web007
Yeah - just Extract it from your MySQL / Mongo / Postgres / logfiles /
whatever system it's in right now, Transform it into a CSV or whatever the
input needs to be and Load it into BigQuery. Once it's there, you can do
whatever you need!

~~~
ams6110
On a smaller scale, the "q" utility has been a boon for me in the handling of
ad-hoc delimited data files.

[http://harelba.github.io/q/](http://harelba.github.io/q/)

Really one of the best things I've discovered in the past 5 years. Saves so
much work compared to doing stuff with sed, awk, and the like.

~~~
khc
wow good find! I've been using my own script
[https://github.com/kahing/bin/blob/master/avg](https://github.com/kahing/bin/blob/master/avg)
but q seems a lot more flexible and works with more than number types.

------
hakann
I work at a small startup with only 2 people in analytics. We build the
infrastructure, data pipelines and do the BI analysis and data science. And I
really enjoy knowing how it all comes together and being able to change
anything in the pipeline. Maybe not having enough money for a big data
department is our blessing.

------
chris_wot
It's really amazing that there are businesses who collect a lot of data, spend
most of their time transforming and loading it - then do nothing with it.
Nothing effective at any rate.

I'm genuinely curious - to analyse data effectively, is there a baseline of
statistical understanding you need to have? If so, what is it?

------
eveningcoffee
This is the same pattern as architects, coders and administrators. So yeah,
nobody wants to be a coder monkey and administrator role is also tedious, and
the added value of architects is actually kind of low.

Sorry, I could not get any further than when the sale pitch kicked in, so
sorry when there was anything new after that.

------
systems
i would bet that the author must be thinking that those who disagree with him
are ... mediocre engineers or developers

personally, i think, there is nothing wrong with being average .. people with
average skills built great things

mediocre is just a mean way to say average

------
AndrewUnmuted
> Most companies structure their data science departments into 3 groups:

> Data scientists ... aka “the thinkers”

> Data engineers ... aka "the doers"

> Infrastructure engineers ... aka "the plumbers"

The author is clearly not an infrastructure engineer.

~~~
jxnl
Which is why he spent 4 years doing managing data plat at Netflix...

------
trimbo
I recommend it every time this topic comes up:

[https://github.com/google/crush-tools](https://github.com/google/crush-tools)

"Big" data on the command line.

------
kelvin0
What's an ETL? Electronic Transport Layer? Big Data N00b here ...

~~~
reverius42
[https://en.wikipedia.org/wiki/Extract,_transform,_load](https://en.wikipedia.org/wiki/Extract,_transform,_load)

------
hrvbr
Cheers for the beautiful website that doesn't require javascript.

------
kpga
Maybe a bit off topic, but from the 3 roles mentioned (Data Scientist, Data
engineer and Infrastructure engineer) which one (if any) is better suited for
working remotely?

------
cowardlydragon
Everybody who writes code does ETL in some form.

That's the fundamental action of computation. Read, compute, write.

This is a stupid article.

~~~
jessaustin
Haha yes that's basically the Church-Turing Thesis right there.

------
joeblau
I'm getting a DNS error - could be me though.

