Hacker News new | past | comments | ask | show | jobs | submit login
Engineers Shouldn’t Write ETL (stitchfix.com)
291 points by mjohn on Mar 18, 2016 | hide | past | web | favorite | 174 comments



... a highly specialized team of dedicated engineers...If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”.

OMG, the author just described the last place I was at. Processed a few Tb of data and suddenly there's this R. Goldbergesque system of MongoDb getting transformed into PostGres...oh, wait, I need Cassandra on my resume, so Mongo sux0r now...displayed with some of the worst bowl of spaghetti Android code I've witnessed. The technical debt hole was dug so deep you could hide an Abrams tank in there. To this day I could not tell you the confused thinking that led them to believe this was necessary rather than just slapping it all into PostGres and calling it a day.

All because they were processing data sets sooooo huge, that they would fit on my laptop.

I quit reading about the time the article turned into a pitch for Stitch Fix, but leading up to that point it made a good case for what happens when companies think they have "big data" when they really don't. In summary, either a company hires skills they don't really need and the hires end up bored, or you hire mediocre people that make the convoluted mess I worked with.


This is so true. I do business intelligence at Amazon, and I've seen this play out millions of times over. The fetishization of big data ends up meaning that everybody thinks their problem needs big data. After 4 years in a role where I am expected to use big data clusters regularly, I've really only needed it twice. To be fair, in a complex environment with multiple data sources (databases, flat files, excel docs, service logs), ETL can get really absurdly complicated. But that is still no excuse to introduce big data if your data isn't actually big.

I really hate pat-myself-on-the-back stories, but I'm really proud of this moment, so I'm gonna share. One time a principal engineer came to me with a data analysis request and told me that the data would be available to me soon, only to come to me an hour later with the bad news that the data was 2 terabytes and I'd probably have to spin up an EMR cluster. I borrowed a spinning disk USB drive, loaded all the data into a SQLite database, and had his analysis done before he could even set up a cluster with Spark. The proud moment comes when he tells his boss that we already had the analysis done despite his warning that it might take a few days because "big data". It was then that I got to tell him about this phenomenal new technology called SQLite and he set up a seminar where I got to teach big data engineers how to use it :)

P.S. If you do any of this sort of large dataset analysis in SQLite, upgrade to the latest version with every release, even if it means you have to `make; make install;` Seemingly every new release since about 3.8.0 has given me usable new features and noticeable query optimizations that are relevant for large query data analysis.


Me and a coworker were laughing at the parent comment, and I told him:

"I guarantee that somewhere, sometime, an engineer has been like 'hay guys, I loaded our big data into SQLite on my laptop and it ended up being faster than our fancy cluster'". We then joked that the engineer would be fired a few weeks later for not being a "cultural fit".

A few minutes later you commented with your story. I hope you didn't get fired? :)


There was a HN story in which the author had the same "big data claim" (it was his intention to show there is very little big data in the real world, for most people out there) but instead he only needed to process couple GBs. He just used standard Unix commands and the performance was proven to be incredibly awesome.

Here is an example for SQLite: https://news.ycombinator.com/item?id=9359568


I think you're talking about this: http://aadrake.com/command-line-tools-can-be-235x-faster-tha...

I quite enjoyed it as well.


I used to work somewhere that did a lot of ETL and we used MS databases - mainly, ahem, Access. At the time I had no idea about *nix. I've often thought since how much easier and quicker it would be to solve lots of the processing jobs with this sort of thing, but never thought about implementation details. This is a great reminder that new is not always best and reminds me that a little knowledge is a dangerous thing!


That guy's problem set allowed him to filter out massive percentages of his data from the get go.


Wow thanks! I couldn't find it in search :-) glad you found it.


The really fun aspect of this for me, personally, is that I've been doing computers for ages (25 yrs or so?) now and 3TiB still intuitively feels like a massive amount of data even though I think I have something like 6TiB free space on my home server disks... which, in total, didn't even cost as much as a months' grocery shopping. Sometimes it really takes effort to rid yourself of these old intuitions that don't really work any more.

(I'm getting better at it!)

EDIT: And my desktop machine has -- let's see -- about 500000 times the RAM my first computer had. Truly astounding if you think about it.


I can remember going through this with a 10MB file about 15 year ago. It felt like a lot after growing up with floppy disks. But even a modest CPU could iterate over it quickly, I just didn't realise. I just assumed I would need to process it in a database!


Exactly!

On a somewhat related note: The original Another World[1] would probably fit into the caches that your CPU has as a matter of course these days.

[1] https://www.youtube.com/watch?v=Zgkf6wooDmw


My first hard drive was only 2MB larger than the caches on my CPU.


That's awesome... I'm working with some 15-20 GB sqlite databases. Though 2 TB sounds kind of big?

Was it 2 TB before compression? Because sqlite does blow up the data size over the raw data usually (depending on the original format obviously). It can be kind of wasteful, and I ended up storing some fields as compressed JSON for this reason (that actually beats the sqlite format).

Also, the sqlite insert speed can be much slower than the disk's sequential write speed (even if you make sure you're not committing/flushing on every row, and if you have no indices to update, etc.)

So I think inserting and laying on the data could be nontrivial. But the queries should be fast as long as they are indexed. In theory, sqlite queries should be slower for a lot of use cases because it is row-oriented, but in practice distributed systems usually add 10x overhead themselves anyway...


I worked at a place that shall remain anonymous. An engineer later terminated for cause spent almost 2 months setting up a hadoop cluster with all the stuff that goes along with it to analyze some log data. When a senior engineer manager finally got incredibly tired of waiting for the results, said senior engineer manager wrote the desired analysis in an afternoon in ruby, and it ran in a couple hours mostly limited by the spindle speed on his laptop drive.

The productivity factor of ruby or python in a single memory space vs hadoop is at least 10x.


Fellow amazonian here. We switched from a massively distributed datastore (not to be named) to rodb for storage and found 10x improvement, not to mention eliminating cost and other head-aches; kind of expected since rodb is an embedded db...


I'd like to know what rodb is. Columnar DB? Columnar DBs often beat "big data" -I've beaten decent sized spark clusters in one thread of J.


Read only database. It is a hand optimized/compressed database engine that is used for big "once a day" data sets.


I'm Googling and finding a few things called "RODB" that don't quite match your descriptions. What in particular is it?

Or are you talking about just rolling your own format to dump your data into? I've done that, but I'd still appreciate if I could use something that someone else had put the thought into. (Something like "cdb" by Daniel J. Bernstein, but that's an old 32-bit library that's limited to 4 GB files.)


rodb is in-house, non-FOSS tech @ Amazon.


Thanks, that actually helps, particularly since I sell a competing product (which isn't RO).


What is rodb?


I'm guessing he means a "Real-time Operational Database". This seems to be a generic term for system like a data warehouse that contains current data, instead of just historical data. If you are taking the output of a Spark flow and storing it in Postgres or MongoDB or HBase for applications to query, then those could be considered RODBs.

Since this is Amazon, I suspect he is referring to SPICE (or their internal version), which was released last fall as part of AWS's QuickSight BI offering...

"SPICE: One of the key ingredients that make QuickSight so powerful is the Super-fast, Parallel, In-memory Calculation Engine (SPICE). SPICE is a new technology built from the ground up by the same team that has also built technologies such as DynamoDB, Amazon Redshift, and Amazon Aurora. SPICE enables QuickSight to scale to many terabytes of analytical data and deliver response time for most visualization queries in milliseconds. When you point QuickSight to a data source, data is automatically ingested into SPICE for optimal analytical query performance. SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, machine code generation, and data compression to allow users to run interactive queries on large datasets and get rapid responses."

http://www.allthingsdistributed.com/2015/10/amazon-quicksigh...


Read-only DB. Think of it as read-only memory mapped key/value store.



Like rocksdb or lmdb but optimized for reads. Read-only DB.


What technologies / architecture does Amazon use for business intelligence? I've just done a business intelligence course so I'm interested how do more "technology-centered" companies approach BI, whether it's the same thing I learned in the course (put everything into an integrated relational database if I simplify a lot).


Lots and lots of SQL (a very good thing IMO). This can come in the form of Oracle, Redshift, or any of the commonly available RDS databases (probably not SQL Server though). This is augmented with a lot of Big Data stuff, which used to be pretty diverse (Pig, Hive, raw Hadoop, etc.) but is moving very quickly towards a Spark-centric platform. There is occasionally some commercial software like Tableau/Microstrategy. Apart from that, it is a whole lot of homegrown technology.

As far as architecture is concerned, every team is different. At least on the retail side, we tend to center around a large data warehouse that is currently transitioning from oracle to redshift, and with daily log parsing ETL jobs from most services to populate tables in that data warehouse. Downstream of the data warehouse is fair game for pretty much anything.


Are you using Re:dash[1]? Seems like a good fit if you're doing lots of SQL and working with many different databases.

[1] http://redash.io/


Do you have somesort of framework for those ETL jobs? How do you handle ETL dependencies/workflow?(=thinking spotify's luigi[1] here)

[1] https://github.com/spotify/luigi


Would you know why sql server generally isn't used?


Probably because:

* SQL Server is proprietary

* SQL Server licenses are expensive

* SQL Server runs only on Windows (or least used to?)


I chuckled a bit because the GP mentioned Oracle. I'm sure you've heard this one:

Customer: "How much does an Oracle database license cost?" Oracle Rep: "Well, how much do you have?"


That's actually not a joke. Oracle liked to argue you should accept server licences as being a fraction of your budget, so when it went up you'd automatically pay them more.


Amazon actually does allow SQL Server. The poster saying it probably didn't was influenced by the fact that Azure, Microsoft's own cloud solution is an AWS competitor.

To your own points:

SQL Server is as proprietary as Oracle SQL Server is cheaper than Oracle SQL Server is being ported to Linux in 2017 :)


Amazon.com seems to strongly encourage the companies it acquires to use aws. Apparently, woot.com is fixing to rewrite the website in Java or something just so they do not have to use Windows. Apparently, it makes sense to them.


How long does it take you to load 2 TB into SQLite? How long do queries take? I believe you, but I'm in disbelief that it could be close to as efficient as throwing into ram. I mean, an EMR cluster takes like 5 minutes to spin up.

Where do I learn how to do this? I've tried loading a TiB (one table one index) into SQLite on disk before, and it took forever. Granted this was a couple years ago, but I must be doing something fundamentally wrong.

I want to try this out. I've got 6TiB here of uncompressed CSV, 32 GiB ram. Is this something I could start tonight and complete a few queries before bed?

Actually, out of curiosity, I looked it up on the sqlite site. If I'm reading the docs correctly, with atomic sync turned off, I should expect 50,000 inserts per second. So, with my data set of 50B rows, I should expect to have it all loaded in ... Just 13 days. What am I missing?


There was a little bit of unfair comparison. I didn't have to load the data over a busy network connection, and I didn't have to decrypt the data once it was loaded (I had the benefit of a firewall and the raw data on a USB3 hard drive). I think there was a conversion from CSV to parquet on the cluster as well. And the engineer who set up the cluster was multitasking so I'm sure there was some latency issues just from that. But my analysis still only took a few hours (5? Maybe 6?).

There are a handful of things that make a difference. First of all, don't use inserts, use the .import command. This alone is enough to saturate all the available write bandwidth on a 7200rpm drive. It is not transactional, so you don't have to worry about that...it bypasses the query engine entirely, really is more like a shell command that marshals data directly into the table's on disk representation. You can also disable journaling and increase page sizes for a tiny boost.

Once imported into SQLite you get the benefit of binary representation which (for my use case) really cut down on the dataset size for the read queries. I only had a single join and it was against a dimensional table that fit in memory, so indexes were small and took insignificant time to build. One single table scan with some aggregation, and that was it.


Good tips. I'll give it a try!


thanks for the SQLite tip, I've been meaning to add it to my tool set.

question: is SQLite incrementally helpful when I'm already comfortable with a local pgsql db to handle the use case you suggested? would SQLite be redundant for me in this case?

question: between postgres and unix tools (sed, awk) is there reason to use SQLite?


Reasons to prefer sed/awk: you're in bash

Reasons to prefer sqlite: it's easier to embed in an app, you want sort, join, split-apply-combine, scale, transactions, compression, etc.

Reasons to prefer pgsql: sqlite's perf tools suck compared to pgsql (last time I got stuck anyway) and I'm sure there are lots of sql-isms that sqlite doesn't handle if that's your jam. EDIT: forgot everything-is-a-string in sqlite, just wanted to add that it has bit me before.


Another reason to use Unix tools like sed and awk (and grep and sort and...) is that, if you just need to iterate over all your data and don't need a random-access index, they are really really fast.


don't forget bdb!


No. If you are already comfortable in PgSQL, there is no need to use SQLite.

You will see PostgreSQL will come in handy once you get beyond the initial import stage..

PostgreSQL's type system will come to your aid . SQlite essentially treats everything as string, which can turn nasty when you get serious with your queries etc.,


Hey, would you be willing to go a little in to your background/qualifications as well as maybe what your daily/weekly workload and goals look like? BI is a field I'm curious about because it sounds very interesting, very important, and as far as I can tell, with tons of surplus demand even for entry level. It makes me wonder if the actual day to day must be horrible for those things to be true.


In all fairness, one can run Spark on a single machine. The key insight and that "single machine" has gotten pretty big these days, the bit shuffling technology may be secondary.


My favorite is when you jump on a project and do a simple estimation of compute throughput for the highly complex distributed system, its something like hundreds of kilobytes per second. You could literally copy all the files to one computer and process faster than the web of broken parts. It becomes cancerous; to work around system slowness ever complex caching mechanisms and adhoc work is constructed.

I think part of the problem is many engineers can debug an application, but surprisingly few learn performance optimization and finding bottlenecks. This leads to an ignorance is bliss mindset where engineers assume they are doing things in reasonably performant ways and so the next step must be to scale, without even a simple estimate for throughput. It turns into bad software architecture and code debt that will cause high maintenance costs.


Would you happen to know a good way to learn performance optimization? I'm working with datasets currently that I am trying to get to run faster, and I cannot tell if the limitation is on my hardware or due to ignorance on my part


Unfortunately I'm not familiar with a single resource for this, which probably contributes to developers being unfamiliar. There is a lot of domain specific knowledge depending upon if you're working with graphics, distributed computing, memory allocators, operating system kernels, network device drivers, etc.

The universal knowledge is learning to run well designed experiments, and this comes from practice. It's like how you would debug code without a debugger. There are profiling tools in some contexts that help you run these experiments, but at the highest level simply calculating the number of bytes that move through components divided by the amount of time it took is very enlightening.

It's valuable to have some rough familiarity of the limits of computer architecture. You can also do this experimentally; for example, you could test disk performance by timing how long it takes to copy a file much larger than RAM. You could try copying from /dev/zero to /dev/null to get a lower bound on RAM bandwidth. You can use netcat to see network throughput.

Bandwidth is only part of the picture; in some cases latency is important (such as servicing many tiny requests). Rough latency numbers are in [0, 1], but can also be learned experimentally.

Many popular primitives actually dont perform that great per node. For example something like a single MySQL or Spark node might not move than ~10MB/s per node; significantly lower than network bandwidth. You can actually use S3 to move data faster if it has a sequential access pattern :)

[0] http://static.googleusercontent.com/media/research.google.co...

[1] https://gist.github.com/jboner/2841832


Like all things in tech, it depends. I agree with you that most startups will never see data that will not fit in memory. Remember http://yourdatafitsinram.com ?

Now, a data warehouse where you have a hoarde of Hadoop engineers slaving over 1000 node clusters is probably overkill unless you are a company like facebook or are in the data processing biz. However, some database management systems can be very complementary.

For example, while you can do a majority of things with Postgres, some are not fun or easy to do.

Have you ever set up postgis? I'd much rather make a geo index in MongoDB, do the geo queries I need and call it a day than spend a day setting up postgis.

We find that MongoDB is great for providing the data our application needs at runtime. That being said, it's not as great when it comes time for us analyze it. We want SQL and all the business intelligence adapters that come with it.

Yeah you could do a join with the aggregation framework in MongoDB 3.2 but it just isn't as fun.


> Have you ever set up postgis? I'd much rather make a geo index in MongoDB, do the geo queries I need and call it a day than spend a day setting up postgis.

I think you should try it again. I felt the same way when it was in the 1.5 version, but post 2.0 it is really easy. And the documentation has gotten so much better. It really is as simple as `sudo apt-get install postgis` and then `CREATE EXTENSION postgis;" in psql.


I think there's a more common reason why companies end up with "awful-to-work-with messes": ETL is deceptively simple.

Moving data from A to B and applying some transformations on the way through seems like a straightforward engineering task. However, creating a system that is fault-tolerant, handles data source changes, surfaces errors in a meaningful way, requires little maintenance, etc. is hard. Getting to a level of abstraction where data scientists can build on top of it in a way that doesn't require development skills is harder.

I don't think most data engineers are mediocre or find their job boring. The expectation from management is that ETL doesn't require significant effort is unrealistic, and leads to a technology gap between developers and scientists that tends to be filled with ad-hoc scripting and poor processes.

Disclosure: I'm the founder of Etleap[1], where we're creating tools to make ETL better for data teams.

[1] http://etleap.com/


I thought it nailed a lot of dynamics at my last gig as well. Of course, I joined for the challenges of scaling and wrote an ETL framework in Rails (that was a mixed bag but very instructive) and then got bored after I realized how small our data really was. Then I left for Google and all my data went up by two prefixes.

I do love reading an article that supports my contention that ETL is the Charlie Work [0] of software engineering.

0 - http://www.avclub.com/tvclub/its-always-sunny-philadelphia-c...


I do love reading an article that supports my contention that ETL is the Charlie Work [0] of software engineering.

Though I've never formally done work with the "ETL" label, what I've seen of it reminds me of the work I used to do for client years ago where I'd take some CSV file (or what have you) and turn it into something that their FoxBase-based accounting system could use. It was boring grunt work (or shall we say, "Charlie work"), but I billed it at my usual rate and it paid the mortgage. I would never, ever wish to make a career of it, however. (And if my assessment of what someone knee-deep in ETL does all day is completely off base, I apologize.)


The thing is, I kinda identify with Charlie. I rather enjoy getting it done and it's the kind of thing that keeps the ship sailing smoothly. Sometimes you can derive a strange appreciation from thwacking rats in the basement, so to speak.


It's all Charlie Work for the VCs, "founders", and (occasional) early hires who take in the lion's share of profits and prestige for the long hours most of us pour into our jobs -- day after day, year after year.

BTW, that may sound polemic, but it's not -- that's really how a lot of business types, academic researchers, and others think of nearly all programming-related work.


Not that polemic indeed, but only if we're willing to stop gazing our own navels. Most of us here aren't hired with nice paycheks because we're smart and creative, but because we're smart, creative, and yet without the sense to find better work than Charlie Work.

Anyway, this is a shame. I thought "data science" still had nice R&D flavored jobs. Colour me disillusioned.


> data sets sooooo huge, that they would fit on my laptop

well yes this is an embarrassing fact for data scientists. A midrange stock macbook can easily handle a database of everyone on earth. In RAM. While you play a game without any dropped frames.


This is very very true, very often when it comes to more than a gigabyte of data or more then a hundred of queries a second. Quite a few inexperienced devs suddenly think it's big data they've been reading about and FINALLY they can play with it!

With all the hype around 'big data' and all that crap, many people seem to forget how far you can go with plain simple SQL when it's properly configured, and not talking about some complicated optimizations, just solid fundamentals. And no problem if you can't do it, things like Amazon RDS will help you.


People frequently think that their sub-petabyte dataset is a big data problem.

Time and again, the tools prove that it's not.

To recollection, I can count the number of publicly-known companies dealing with these datasets on two hands, if being generous.


Indeed. Whenever I give a talk about "big data" I make sure to include the phrase, "I'm pretty sure most of the big data market exists on the hubris that developers want to desperately believe their problems are bigger than they really are."


It depends a lot on what you're doing, but 2 TB in general seems a lot. If you have to perform an out-of-disk sort, you probably need a distributed setup. The funny thing is that most setups that use EMR defeat the data locality principle and I think that's the speedup people experience when they run it on a single laptop for example. Reading the original Google paper helped me a lot in understanding this.


If you have to perform an out-of-disk sort, buy another hard disk and now it's not out-of-disk anymore. This will suffice for nearly any data set you would ever need to sort.

A 4 TB drive costs about $120, and you'll spend way more than that on software development and extra computers if you do distributed computing when you don't need to.


Just to clarify, 2 TB everyday right?


Nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato. It really shouldn’t come as a surprise then that ETL engineering roles are the archetypal breeding ground of mediocrity.

There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.

This is like... your opinion. Some people find pushing around HTML / JS / CSS absolutely soul-crushing. Considering the lion's share of websites are ugly, unusable, and slow, does this mean that front-end engineering is a breeding ground of mediocrity, so server-side devs and CFOs should all be sharing in the pain?

Some people actually enjoy working with data, and don't find ETL and pipelining horrible to do at all. It is a different set of challenges, but calling people people mediocre because of ETL is a non-sequitur.


I love ETL. I've worked with real Big Data (5PB+) and working on the data pipeline was my favorite part. The feeling you get when you rewrite a job to run 1000x faster so the company can make way more money.


That implies the work is in a revenue-center or that you're at the rare company that isn't myopically focused on sales.


No, it doesn't. If you make an M/R job work 1000 times more efficiently you save money for the company. It doesn't matter if you are a profit center or a cost center.


I don't think the author would disagree with you on that. His main point is that engineers don't want to be in a role where all he/she is doing is productionizing someone else's ideas. There should be areas of end-to-end ownership for both the engineering team and the data science team.

It doesn't sound like he's suggesting here that some types of technology are just generally horrible to work with or that it sucks to build ETL jobs in general. Maybe he could have done a better job defining what he meant by ETL engineer here, but he does qualify it in that quote with "ETL to produce data that you yourself never get to use or consume."

In general, roles that offer little areas of ownership do draw mediocre engineers.


Could not agree more. With all this talk of data and ideas, you would think he could include some data to back up his ideas. He can't, because this is all just opinion.


this.


The thinker/doer problem goes way back. In most organizations the person who thinks of something gets the lion's share of the credit, and the person who implements it does the lion's share of the work. And if it turns out to be a bad idea, the thinker can always blame a bad implementation, thereby passing the lion's share of the blame to the doer.

I've seen careers made and broken based on whether people got to play thinker or doer.

This makes rewards for thinking very lopsided. However the problem is that actual credit for success REALLY belongs with the people who did the work.

This problem shows up at every scale in every organization. For example there are hundred people who want to be the business side of a startup for every person who wants to build the tech. Why? The business person gets to be the thinker, the developer does the work. And then the business person expects to become the CEO and get the bulk of the payout!


But you can't say that the doer always deserves the credit either. Sometimes the idea is the hard part. Similar for blame. It doesn't work to make generalizations. You have to make a judgment call every time, and usually the answer will be a complicated mixture.


And I didn't say that the doer always deserves the credit. As an example, who deserves more credit for the success of Apple, Steve Wozniak or Steve Jobs? Wozniak created the Apple I and most of the Apple II. But clearly Jobs' ideas built the current company.

However these cases are the exception, not the rule. As a rule ideas are cheap, implementations are hard. And success has more to do with iterating on the implementation than the starting ideas.


The Jobs/Wozniak example is a great one. I might suggest that Wozniak's execution created the company, and Jobs' vision transformed the company into what it is today.

But that transformation involved nearly destroying the company first, getting himself ousted. Then recreating MacOS as a UNIX platform for a market that did not want it, only to be finally integrated by the dying Apple as a Hail Mary play by both Jobs' and Apple.

Jobs' get a lot of credit for the vision, and even the execution. But it could have turned out lots of different ways. If Pixar had not been successful (in no small part because of Jobs dumping millions and millions of his own money into it), one could imagine NeXT not being bought by Apple.


  > However the problem is that actual credit for
  > success REALLY belongs with the
  > people who did the work.

  > And I didn't say that the doer always
  > deserves the credit.
You kind of did.


No, I really didn't. Coming up with ideas, then iterating on them, is itself a form of work. And when done well, it is a more productive form of work than figuring out technical challenges.

Take my Steve Jobs example. Do you really think that he didn't work hard?


"Ideas" alone are almost never worth anything. You have to do the work to back it up. Everyone I know has about a dozen ideas (you hear them all the time as someone who makes ideas real).

What matters is the technical skill to make the idea go from a fantasy to a reality semi-reminiscent of the idealized fantastic version, whether that skill is in business, accounting, programming, marketing, or whatever.


Yes, I've heard the "ideas are worthless" meme, I just don't entirely believe it. Ideas are easier to have, so a bigger fraction of them are worthless, but that's not actually a good measure of the "relative worth" of ideas and implementations, precisely because the mechanics of producing them is different. Good ideas, e.g. actual market opportunities, problem-solving breakthroughs, are rare and valuable. Insight is valuable.

You can execute as hard as you want, but if you're headed in the wrong direction, the only good thing that's going to happen is that you're going to learn when ideas really are important.


Sure, an unrealised idea is next to worthless.

The real pain is making a decision and expending resources on your challenging/risky idea. There's very little appetite for the responsibility and risk that come with big ideas (in a BigCo).

Got the ability to think up new ideas, sell them within an organisation, and get them executed (hello 'doer') in a way that provides value to that organisation? You're gold, and worth way more than the 'doer'.


>Got the ability to think up new ideas, sell them within an organisation, and get them executed (hello 'doer') in a way that provides value to that organisation?

No one can "have" this ability because it's transient. Unless you control the entire corporation (in which case you don't need to influence anyone else anyway), there is always someone who can come in and break your previously-perfect ability to "sell" your ideas inside the org. You're claiming that artful politicians (or, more blatantly, "good bullshit artists") are more valuable than skilled engineers. I don't believe that.


Eh, this is true of start up business ideas, but not necessarily true in an enterprise that has hired a bunch of doers. Vision is rare and important.


I actually don't really agree. From all the medium-large sized business I've been on the inside of, as both a consultant and employee, it seems that a lot of people have a lot of pretty OK ideas about how to fix things within the org and where to take the products. In general, excluding a few noobs and whack-jobs, these ideas seem to be about as viable and plausible as the ideas that actually do get pushed down.

The valuable thing in a big company isn't the idea; it's the way the idea is executed. You'd be amazed what kind of nonsensical ideas will work if your team gets it just right.


He hits upon a quite interesting division of labor. Where I've worked in finance, there's been "strategists" and there's been "developers". You can guess which one is seen as high prestige.

The problem arises when someone gets into a position where they can think big thoughts without having to do any nitty gritty. Effectively, they end up jumping in right when the real producers have finished the actual work, and then coming up with some polish that makes it look like they came up with some interesting result.

This is not actually a way to get work done. It's a way to play politics.

Worse yet, it's actually completely detrimental to getting things done. When you have things split up between thinkers and doers, what do the incentives look like? It's quite simple. I may order some analysis, and I may not fully understand the nuances. But whatever happens, as a thinker I'll have to have something grandiose to say, and I'll need to keep the doers busy. That way if I don't find a real conclusion, it's everyone's fault. If I do find something, it's thanks to me.

Where I worked the people with the big plans couldn't code their way out of a paper bag. Ask them what Big-O is, they draw a blank. Ask them how their trading strategy will actually send orders to the exchange, they draw a blank. But ask them something that sounds like strategy, and they will feed you plenty of unsubstantiated BS.

My new venture is coders all the way down. Strategists who can actually use git without asking what it is, understand that algorithmic complexity actually matters, and so on. Coders who understand what the market is.


The more I think about making credit match with work in an organisation, the more it looks like a neural network where the credit/revenue/profit gradient is having problems being back-propagated.

In the horizontal setup you describe (layer of thinkers on top of layer of doers), credit hits a barrier at the thinkers. The gradient isn't propagated.

In the vertical setup in the article (layer of thinker-doers), of course the backprop will be good because it is only one layer thick. You gain proper incentives, proper treatment of data on the whole pipeline. And the engineers can also concentrate on a purely orthogonal thing: writing tools.

But you lose the benefit of having the layer being able to focus on one thing. The author acknowledges those efficiencies (his word). It is hard to find people with a wide set of skills. Although in this case it is balanced, as now the engineers have gained specialization.

But I digress. My point was: humans in orgs are bad at backprop. Why share the credit at all? Organisations can be seen as neural networks/graphs, and they can lack proper backprop.

I'd love to see the results of some pagerank-like backprop. Every employee gets one base point. Every week, he is asked: "who helped you the most in doing your job this week?". Sales would credit analysts who would credit engineers, etc. Or Sales would credit analysts-engineers who would credit tool-writers, etc. It could go both ways: engineers could credit sales or analysts for writings well thought-out problem descriptions.

Then you would run pagerank on it, and base every promotion, every salary increase on it. Information would flow well, and everybody has a clear direction (his gradient) of what he can do to shine.

Also, by injecting revenue at the sales layer in a certain period of time, you could identify who conctributed the most in an increase of revenue.

Also, I posit that managers have a tiny view of what happens in a firm. They only get to see a fraction of interactions, while the brunt of what matters happens in the long tail of one-to-one interactions. Should you chose to promote people with the highest PR, you would have a true result-based bottom-up org.


This author seriously needs to expand all of his TLIs (three-letter initialisms) the first time he uses them, as any writer worth his or her salt would do. There are those who may be interested in what he has to say, but can't follow because of assuming abbreviations.



Though I agree with you on expanding TLIs, if you have to have "ETL" defined for you, you probably won't get the "joke". And though this will come out more cynical than I intend, if you don't know the acronyms, then you probably won't be buying what Stitch Fix is selling. Filtering their funnel, maybe?


While possibly true, it's simply a courtesy to the reader to parenthetically define any acronym the first time it's used in a published piece of writing (of course this would not apply to internal emails, casual comments such as discussion forums here, etc.)


The original post may have been intended for a select audience that would be familiar with the context, so that author may be forgiven, but the person that submitted the post should have kept in mind the much wider audience here.

As a hardware engineer, ETL is a NRTL that competes with UL and CSA. Oh, excuse me, Thomas Edison's Electrical Testing Labs is a Nationally Recognized Testing Laboratory that competes with Underwriters Laboratories and the Canadian Standards Association.


Hm? Stitch Fix is selling clothes, not software. This is just their engineering blog.


TLA is the common abbreviation for what you call TLI, that is Three Letter Abbreviation.

The common abbreviation for a four letter abbreviation is ETLA. That is, Extended Three Letter Abbreviation.


ETL, DBA, API - aren't they all incredibly standard acronyms?


Not to me. I know what DBA and API are, but I've got no idea what ETL is.


Standard for you means that we live under rocks? Obviously if it's standard we must be clueless. Nice way of phrasing it.


100% agreed. I was clueless about ETL too. While the acronym itself may be old, people forget that this field is hip and sexy for maybe like last 3 years or so.



Google fixes your ignorance instantly. This TLI is common enough to assume.


It's a pretty common acronym in the field, actually.


There are many of us here who are not in the field.


Maybe in the data science field, but not in the tech field more generally. Probably doesn't help that the thread title didn't specify it was data science-related.


Might end up sounding like this: https://www.youtube.com/watch?v=BpMwZDfSLBk


There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.

I'm not sure I get why writing ETL code for data you'll never consume is any more soul-sucking than, say, refactoring JS code for a website you couldn't begin to care about (and which will never be properly re-designed anyway); or even doing "thinker"-level work but for an industry you couldn't begin to care about (advertising), etc.

In other words, what most developers of whatever technical stripe do for a living.


And I also fundamentally disagree with the notion that moving a large amount of realtime data reliably and with accuracy, monitored and consistent with relatively little failure is not an interesting engineering challenge in itself. I find that for all the talk about data-driven organizations, most don't use a tenth of what is available but that when the tenth is needed, it's hugely satisfying to be able to provide it.


> And I also fundamentally disagree with the notion that [ETL work] is not an interesting engineering challenge in itself.

A lot of people think that certain DBA/ETL/BI/similar work is boring and simlpy don't want to do it and so don't learn to do it well. Which is fine by me: it means those of us who can do it well can get paid good money when someone needs it.

The only problem with this theory in practise is that many also think such work is easy and free of complications; so they baulk at paying for people genuinely can do it well, get people less experienced who say they can do it well but do it badly, and judge the rest of us by that standard and assume database people are thick and can't do easy jobs properly...


I'm with you. I hear people reflexively dissing ETL (and other aspects of front-line data engineering) all the time, but I've come to suspect they don't really know these systems are actually about.


> The fundamental flaw that prevents the Thinker and Doer model from living up to its recruiting hype is the assumption that there exists an army of soulless non-mediocre Doer engineers who eagerly implement the ideas and vision of data scientists.

There's a large, active community of engineers who specialize in data, whose job is to technologically enable data scientists the means to perform their analyses. I know these people exist because I'm one of them, and I work with them, and I've met them at meetups and conferences. I don't know why the author doesn't think these types of engineers exist. Not all of us who code want to work with the web.

> If you read the recruiting propaganda of data science and algorithm development departments in the valley, you might be convinced that the relationship between data scientists and engineers is highly collaborative, organic, and creative. Just like peas and carrots.

Almost every data team I've worked with is structured this way. I work daily with data scientists. I have a data scientist sitting to my right, two data scientists sitting across from me. Our teams are highly integrated and I can't imagine it working any other way. If the teams the author is familiar with don't operate in this manner, then I can see why he'd think the endeavor is hopeless.

I also disagree with the author's conclusion. The data scientist's job is to analyze and interpret data. They should not be spending any time thinking about how to get that data. They should not be concerned about where the data is coming from. The more time scientists have to spend thinking about ETL, the less time they have to do what their training is in, statistical analysis.


I completely disagree, data scientists who can not create the data they need are at a significant disadvantage to those who can. Our job is more than being able to analyze and interpret data. If you have someone in your organization that spends no time thinking about how they get the data, you need to fire them or reduce their salary.


The data scientists I work with are statistics PhDs. The extent of their programming knowledge is R and SQL. What are they supposed to do if the data they need to analyze is only available through a SOAP API you log into with OAuth, and they need to log in once a day to retrieve the latest day of data? Unless you're a software engineer, you probably don't have the skillset necessary to easily get that data.

The data we use comes from relational databases and document stores operated by different departments, external APIs and third party services, SalesForce, server log files, etc. A stats PhD does not have the training to gather this data themselves.

In terms of a hybrid scientist/engineer role, I don't know many software engineers who are also good at stochastic calculus or ensemble learning. Likewise, I don't know many data scientists who are also comfortable writing cronjobs to retrieve external API data or have the ability to diagnose server problems.


What you are describing is a statistician and that's perfectly fine, but lumping them in with data scientists devalues the role for those of us doing more.


How would you differentiate the roles of statistician, data scientist, and data engineer? I've used and heard the titles "statistician" and "data scientist" used interchangeably, and the Wikipedia entry for data science [1] gives evidence to support that usage since the late 90s:

"In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?" for his appointment to the H. C. Carver Professorship at the University of Michigan. In this lecture, he characterized statistical work as a trilogy of data collection, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists."

From the same article, a quote from Nate Silver:

"I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician."

If your skillset differs from a statistician, then calling yourself a data scientist is not going to be a differentiating title in common parlance.

[1] https://en.wikipedia.org/wiki/Data_science#History


I think the quote and definition from the blog is a good one: “better engineers than statisticians and better statisticians than engineers”. Perhaps that 1997 quote was influential in the decision to use the term Data Science, I think the current usage encompasses much more than statistics. When I started it required the ability to push production code, build statistical models, and communicate results effectively. Maybe I'm wrong and maybe the tools got better, but for a while, you couldn't provide value if you couldn't get to the data or create the data you needed.


SOAP + Oauth is a weird combination but you could definitely work with it in R.


I just randomly picked two of the most painful protocols I could think of :) It doesn't surprise me though, I feel like I can't go a workday without hearing the phrase "Oh, actually, I can do that in R"


I disagree with you.


Most of the comments in this thread are focusing on the author calling ETL boring -- that is the title after all. But I found the greater point of the article to be about empowering data scientists and giving them autonomy. This post reminds me of Jerry Chen's DDI post [1], except it's about data science.

The notion that a data scientist's only job is to "write a statistical model" and then it's someone else's problem to run it in a distributed environment only exacerbates the problem and lowers DS code quality.

Full disclosure: my company Pachyderm [2] is trying to solve exactly the problem Jeff is talking about in the post. We've built a data processing platform on top of the container ecosystem. Basically, the data scientist has complete control over the runtime environment for their analysis since everything is bundled into a container. It scales to work for actual "big" data, but it also great for small teams that don't have massive infrastructure resources.

[1] http://venturebeat.com/2015/04/01/the-geek-shall-inherit-the... [2] github.com/pachyderm/pachyderm


If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … – places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre.

Granted that yes, lots of solutions don't exactly require a Hadoop cluster with thousands of nodes, this is a pretty gross and mean-spirited dig at "mediocre engineers" a number of times. It would be nice if we didn't treat people that don't work at Amazon/Google/Twitter/LinkedIn as lesser beings because they find their jobs at a probably-doesn't-have-Big-Data company.

(Does StitchFix have Big Data? If the answer is no, are their "Data platform engineers" mediocre?


The idea that engineers should build lego blocks without knowing what they're going to be used for is questionable at best.

A better idea imho is to have small crossfunctional teams where scientists and engineers work together to build only what they need with short iteration cycles.

If everyone involved doesn't have at least a broad perspective on the end-to-end purpose of what they're working on, they're probably going to build the wrong thing.


Although, what you say about lego blocks applies to iterations, which are lego blocks in time.


>You Probably Don’t Have Big Data

"Big Data" is like sex in high school. Everyone talks about it but few people really have lots of it and some just don't have any.


The thing is that everybody has big data, it's only a question of how much of your data you save.


This is not true.

For many startups, even if they audit everything they won't have petabytes of data.


"Big Data" is like good analogies. Few people really have lots and some just don't have any. ;)


And among those that do, fewer still know how to do it particularly well.


Data-driven decision-making to change the course of a business, is so internally disruptive it's unlikely to happen in an org-chart culture full of management layer.

Because that's what it is:

- It is attempting to question, critique, override, everyday decisions made by the management (including the CEO) based on available data.

- It is doing that with maximal knowledge of the whole organization. That means all the records, finances, secrets, what not, have to be divulged to the data-science team. (which in itself is an unsurmountable challenge, i.e., to convince the management to allow full data access; think emails, chat logs, meetings minutes of CEO's, VP's, etc, etc).

This will make the management go, "so let me get this straight, I authorize you access to data of the whole organization, and you come up with a conclusion (some of the times at least) that I'm full of it?"

I highly doubt any organization would be up for this kind of internal disruption, even if that means more success for the company.


Your comment so perfectly summarizes what I've long felt is the deep dark dirty secret of data science (at least at a company thats not Google/Facebook/etc). And it flies in the face of all the "top job" lists which are always littered with data-related jobs.

Very few people are interested in making data-driven decisions. They want an employee (subordinate) who will prove that the decision they've made or are planning to make, is correct. Anything else is, as you say, very internally disruptive.

Being a data scientist or data analyst at a startup is (for the most part) a completely miserable existence. You are relegated to doing interesting things that are usually discarded. It can make you feel like your job is pointless.

In the end, one either makes the decision to be (at best) useless, or (at worst) a puppet. That, or you quit.

Thank you so much for your comment - it's refreshing to see I'm not alone in feeling like this.


Yeah, I definitely need the CEO's emails with cat pictures for my predictive maintenance models.


This is not very convincing. He starts off by saying that the "traditional" model where the data scientists do the thinking while the engineers do the doing is unsuccessful because the engineers need to get invested in other people's ideas, need to maintain them and get blamed if they fail while the data scientists get all the praise. So he suggests replacing this with a new model where the engineers work horizontally aka in the shadows, have to be "Tony Stark tailors" and get out of the way while the data scientists get to be Tony Stark. Which is basically the same thing.


ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.

http://datawarehouse4u.info/ETL-process.html


Thanks, I had no clue what ETL meant, but the article was probably not directed at my kind.


> We are not optimizing the organization for efficiency, we are optimizing for autonomy.

This is one of the toughest parts of building a scalable organization (with or without big data). Getting past the idea of efficiency and being OK with redundancy.

This means allowing two teams to both build a common feature they might need, rather than establishing a dependency. It means making one teams job broader even if it overlaps with another team.

I find it interesting that we are perfectly willing to have redundancy on the software side (load balancing, slaves, etc) but not on the development side.


I could have done without the first half of the post telling me that I (or others) are mediocre, then going on to tell me how the author (and his fellows) are not just because they strive to be the "Best in the World".

This just reads like a puff piece for another valley startup by some guy who's better than you. Oh, and here's how we do it, you should try doing it this way too, because we think it's totes the best.


I agree with the beginning of the article, which describes the present state pretty well, the part about "better engineers than statisticians and better statisticians than engineers", etc. But then I disagree with the rest.

The distinction between "Data Scientists" and "Engineers" is bogus, and the point about whether your data is "Big" is a red herring.

In reality, there should not be any distinctions between "scientists" and "engineers", you must strive to be both a "doer" and a "thinker". You can't think without doing, and can't do without thinking.

If you're in this field, and consider yourself an "engineer" but your math sucks, go read up on all you can about mathematics and statistics, just like you did back when you were learning about programming, operating systems and networking.

If you consider yourself a "data scientist" but don't know anything other than R and basic Python, go study programming and operating systems and networking, like you studied math at some point.

Somewhere on youtube I remember Dr. Donald Knuth (who is definitely an excellent programmer/engineer/computer scientist, arguably one of the best the world has known) saying that he considers himself primarily a mathematician.

Or, if you've read (or at least heard of) "the dragon book", you might find it interesting and inspiring that one of its main authors Dr. Jeffrey Ullman (whom I'd place in the same league as Knuth) went on to write another excellent (and available freely online, BTW) book "Mining of Massive Datasets", which IMHO is the one fundamental "big data" book out there.

So Data Scientists - go learn some programming languages like C and study UNIX and may be read "The Art of Computer Programming" and Engineers go read http://www.mmds.org/.

Then you'll all get along.


This is a supremely ridiculous set of suggestions that has no merit whatsoever. Companies aren't libraries. They aren't paying you to sit and read books. There is an assigned dayjob, a set of tasks you have on your jira that you have to resolve by your deadlines, and that occupies the 8 hour workday ID you are doing any justice to it. So any reading you do is on the side, on your own time.

Furthermore, people have these roles precisely because of their talents and their choices. As a Data Scientist, most of what I do is read ML literature, build ML models and write technical reports in Tex on what worked and what didn't. The skills to do this were acquired over many painful years of graduate work in math, statistics, ML. To suggest somebody can just read their way through that material is quite laudable, but you are underestimating the difficulty by orders of magnitude. Essentially, you are suggesting that all of the graduate study and mentoring and homeworks and assignments and all that went into the learning process be condensed into a book which one can just plow through and become a DS. Well, good luck with that. By the same token, expecting me to have the same level of efficiency and passion as a data engineer when faced with a Hadoop/Oozie/Presto/Pig/kafka or what have you is silly. I don't care for these technologies and how to work them. I know it takes a really long time to get good at them - that's why the engineers get paid a lot of money and also get yelled at when the ETL job fails. Because it's a set of seriously valuable skills that were no doubt acquired over lots of time and practice. It's not like I can buy a book on these things, just read through them and suddenly I am a DE! I neither have the interest nor the time to do that.

>>the distinction between data scientists and data engineers is bogus

Not at all! Both DS and DE professionals do distinctly different work and conflating everything under 1 umbrella buys you nothing.


>> Companies aren't libraries. They aren't paying you to sit and read books.

That stroke you've brushed is too wide. Smart employers will have some of the money they're paying an employee going towards learning... and if they're really smart, they can even measure their ROI. Leads to less turnover, and better long-term vision for their projects.

I get your point, but give someone passionate enough 6 months in a new work environment, and with a decent mentor, and you might find they become surprisingly adept at it. The hard part is hiring for the capability to learn (fast).


>> I don't care for these technologies and how to work them.

And I'd argue that this is precisely the problem. I personally care about all of it.


> If you're in this field, and consider yourself an "engineer" but your math sucks, go read up on all you can about mathematics and statistics, just like you did back when you were learning about programming, operating systems and networking.

This assumes availability of time. Obviously, given enough time, people could develop both top-tier engineering and DS skillsets!

Of course, if lots of free time were common we'd all be full-stack developers who also field sales calls and work on product strategy etc. etc.


I don't know... Are you saying that someone like Donald Knuth had a lot of free time? Data problems are arguably the hardest problems out there and they require deep understanding of mathematics as well as computers, that's just the way it is. It does take time and effort and even may be a bit of talent as well, not everyone is cut out for it.


No, I'm saying that someone like Donald Knuth is obligated, by the terms of his employment as a premier academic, to sit on the cutting edge of both mathematics and computer science.

/u/dxbydt commented on this very well, so I'll only reinforce the point that not everyone is Donald Knuth. If the distinction between DS and Engineer is blurred in your specific instance, and you're capable of Knuth-ian levels of work in both, you are almost certainly underpaid and need to lead a team or start a company yourself, stat, since a top-tier combination of those skills is exceedingly rare.


Title should have been "Engineers Shouldn't only write ETL." I agree with the author's statement of the problem, but not with the proposed solution. Succinctly, I think the problem is compartmentalization and specialization. These are qualities that are sometimes promoted by management so that it is easier to maintain control over the organization and to hire people who won't require much training to do their jobs. Unfortunately, compartmentalization and specialization both lead to unhappiness in the workers, and are net negative for production. I believe the solution is fostering a wholistic approach among the specialists. Data scientists (who should be statisticians or machine learning experts) should interact regularly with software development engineers that have to productionize their research and they should also both interact regularly with systems and database administrators who make it all work in production. Rather than being separate teams working on parts of the same goal, they should all be one team. By working together through the poroblems faced in each area, they can learn more about each other's areas of expertise and will create a better solution faster. This isn't true just for data science, but throughout technology, where operational software developers should work together with product development, marketing, testing and operations to break down the divide and get all team members working towards the same goal.


> “What is the relationship like between your team and the data scientists?” This is, without a doubt, the question I’m most frequently asked when conducting interviews for data platform engineers. It’s a fine question – one that, given the state of engineering jobs in the data space, is essential to ask as part of doing due diligence in evaluating new opportunities. I’m always happy to answer. But I wish I didn’t have to, because this a question that is motivated by skepticism and fear."

> "Rather than try to emulate the structure of well-known companies (who made the transition from BI to DS), we need to innovate and evolve the model! No more trying to design faster horses…

A couple years ago, I moved to Stitch Fix for just that very reason. At Stitch Fix, we strive to be Best in the World at the algorithms and analytics we produce. We strive to lead the business with our output rather than to inform it."

I find this article rather peculiar. At the start, you'd be forgiven for thinking this was an article about a company looking to find a solution to a problem, but as the article progresses it's clearer that they're selling themselves as the solution to the problem they outlined.

In other words, they start off looking like a customer, but only to set up the premise required to sell the solution to the problem their company supposedly has/had. Turned me off from taking the product seriously.


This article is so one sided it is painful. it is almost like Sheldon Cooper wrote this article. As an engineer I am offended and hurt that we are referred to as “Tony Stark’s tailor”.


Btw didn't he build the suits himself and just share it with his less able cop friend?

I'd agree, it feels like it invalidates the claim about getting credit for 'being a thinker'.


The article is about not over-engineering solutions to problems you do not have. If you don't have interesting problems that require world-class solutions, then don't hire as if you do.

> If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”.

And then comes this line:

>At Stitch Fix, we strive to be Best in the World at the algorithms and analytics we produce.

Without further justification, why does StitchFix, a subscription shopping service, need to be the "Best in the World" at algorithms and analytics? They have harder problems than Google or the Centers for Disease Control or NASA?

Unless they have justification for that, it seems a bit ironic given the article's ire for over-engineering.


I think the conclusion is this. Data Scientists, Data Engineers, and Infrastructure Engineers exist in their respective roles. Data Engineers should enable Data Scientists to be better engineers by creating frameworks for Data Scientists. By doing so, Data Scientists will be less likely to put stress on everyone else.

Another point I'd like to make is that not everyone hates ETL and pipeline management. I happen to like it. It's rewarding to stand up reliable self-healing data pipelines and ETLs.


All the angst about "big" data that "isn't big" is based on a false premise. "Big data" was NEVER just about scale, but rather intended to be equally descriptive of diversity as well as velocity. The problem is that whomever coined the term made the same strategic error as whomever coined "global warming" -- the adjectives used are too specific to adequately describe the full range of qualities involved.


The best-case outcome of many efforts of data scientists is an artifact meant for a machine consumer, not a human one.

I posit this outcome is absolutely necessary for any data science project to be worthwhile in any organization.

In the case where the project produces a report and goes no further, you still need the data and code for reproducibility, one of the main principles of the scientific method [1].

In the case where the project gets handed off to engineers to re-implement, reproducibility is even more critical, since the engineers best effort to reproduce the code will almost certainly not be successful the first time, and you will need to validate many versions of the production model. Doing this by hand even once it wasteful, doing so many times is tragically so.

In the case where the data scientists can produce a service worthy of production use, kudos!! But understand the caveat that in truly big data or big compute flows, this outcome remains highly unlikely.

[1] https://en.wikipedia.org/wiki/Reproducibility


I've done quite a few BI/DWH projects and what I found is that the best approach is to begin from deep prototyping done by data analysts and ETL developers together. It may all start with just a few spreadsheets and a simple dashboard. After many iterations and a lot of brainstorming it grows into a rather developed working prototype that the both sides have equally contributed to. Then the prototype is productionized by the engineers using standard ETL/whatever tools. So everybody gets the credit, and everybody is motivated. This experience made me create EasyMorph [1] -- a tool for quick ETL prototyping and brainstorming. It's like Excel but for tables, and it's equally suitable for data scientists and developers.

[1] http://easymorph.com


I would take this a step further and say NOBODY should write their own ETL. In a world where:

1. SaaS services have APIs

2. Your database is hosted in the cloud

3. You use a standard SQL data warehouse that is also hosted in the cloud.

ETL from (1, 2) to (3) is a completely standard problem, and you should be able to buy a fully-automated solution. My company (Fivetran) does this as a service. We've replaced lots of homebrew data pipelines built by our customers, and we always see the same issues:

* Homebrew ETL pipelines use fancy big-data tech like Hadoop and Kafka in places where it has no relevance, like syncing your 20 GB Salesforce instance.

* Homebrew ETL pipelines don't deal with all the dark corners of the data sources, such as: what happens when someone adds a new custom column? What happens when your MySQL read replica fails over and a new binlog starts? Etc.

The lesson being, don't do this yourself.


This reminds me of Spolsky's story about MS trying to create a Master Slave paradigm in coding. The Master would define functions and the slave would write the actual code in the functions. But of course no one wants to be the slave. Everyone needs to feel they are a thinker. Naturally.


In other words division of labor does not work quite so well for a data science department as for a pin factory. The proposed solution (letting data scientists code more) is not radical enough in my opinion. Why not muddle the roles even further? Let everybody feel the pains that people in the other roles experience. Foster empathy and personal connections. Let developers talk to the users and vice versa.

I worked at a company where distinction between the roles was emphasized by physical separation, presumably so that they won't interfere with each others day-to-day duties. The downside is that each group starts caring about their particular thing only, feeling that they are the ones who really keep the place running and other groups are bozos doing their job incredibly poorly.


ETL is just a small part of it, and engineers should probably have more of a role than they do. My last project was a data warehouse one, where the thing was obviously slapped together by PowerCenter users. They thought it was all about the ETL, They forgot the other 95%, how to engineer large scale, complex, maintainable software solutions.

Sure, let the powecenter users "write the ETL", but then they need to get the heck out of the way and let the big boys actually build the warehouse.


Someone with big enough clout should utter a proper rule of thumb at some prestigious software conference. Something like "if your data can fit on a single commercially-available hard drive, it's not big data". Maybe then it has a chance to filter down to university education over the next decade or so.

(Corollary to that rule of thumb: if your data fits on a hard drive, all "big data" tools you need are shell scrips and SQLite.)


How do people do ETL these days? Using spark? Some framework?

Personally for smaller projects I've used kiba[1] or transforms in pgloader [2]

[1] http://www.kiba-etl.org/ [2] https://github.com/dimitri/pgloader


Is ETL really even necessary anymore? Why not just run fast ad hoc queries over the raw data with something like Google BigQuery?


Yeah - just Extract it from your MySQL / Mongo / Postgres / logfiles / whatever system it's in right now, Transform it into a CSV or whatever the input needs to be and Load it into BigQuery. Once it's there, you can do whatever you need!


On a smaller scale, the "q" utility has been a boon for me in the handling of ad-hoc delimited data files.

http://harelba.github.io/q/

Really one of the best things I've discovered in the past 5 years. Saves so much work compared to doing stuff with sed, awk, and the like.


wow good find! I've been using my own script https://github.com/kahing/bin/blob/master/avg but q seems a lot more flexible and works with more than number types.


Getting the raw data into BigQuery or another tool is an ETL problem in itself. You have unstructured log files, you have external APIs such as SalesForce, you have various relational databases, etc etc. Someone has to come in and transform this data into a unified format that can be inserted into BigQuery.


I work at a small startup with only 2 people in analytics. We build the infrastructure, data pipelines and do the BI analysis and data science. And I really enjoy knowing how it all comes together and being able to change anything in the pipeline. Maybe not having enough money for a big data department is our blessing.


It's really amazing that there are businesses who collect a lot of data, spend most of their time transforming and loading it - then do nothing with it. Nothing effective at any rate.

I'm genuinely curious - to analyse data effectively, is there a baseline of statistical understanding you need to have? If so, what is it?


This is the same pattern as architects, coders and administrators. So yeah, nobody wants to be a coder monkey and administrator role is also tedious, and the added value of architects is actually kind of low.

Sorry, I could not get any further than when the sale pitch kicked in, so sorry when there was anything new after that.


i would bet that the author must be thinking that those who disagree with him are ... mediocre engineers or developers

personally, i think, there is nothing wrong with being average .. people with average skills built great things

mediocre is just a mean way to say average


> Most companies structure their data science departments into 3 groups:

> Data scientists ... aka “the thinkers”

> Data engineers ... aka "the doers"

> Infrastructure engineers ... aka "the plumbers"

The author is clearly not an infrastructure engineer.


Which is why he spent 4 years doing managing data plat at Netflix...


Why? I have certainly described work I've done as "plumbing".


I recommend it every time this topic comes up:

https://github.com/google/crush-tools

"Big" data on the command line.


What's an ETL? Electronic Transport Layer? Big Data N00b here ...



The original audience might not have been as broad as HN's, but it should've been expanded upon first usage IMO.


Cheers for the beautiful website that doesn't require javascript.


Maybe a bit off topic, but from the 3 roles mentioned (Data Scientist, Data engineer and Infrastructure engineer) which one (if any) is better suited for working remotely?


Everybody who writes code does ETL in some form.

That's the fundamental action of computation. Read, compute, write.

This is a stupid article.


Haha yes that's basically the Church-Turing Thesis right there.


I'm getting a DNS error - could be me though.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: