OMG, the author just described the last place I was at. Processed a few Tb of data and suddenly there's this R. Goldbergesque system of MongoDb getting transformed into PostGres...oh, wait, I need Cassandra on my resume, so Mongo sux0r now...displayed with some of the worst bowl of spaghetti Android code I've witnessed. The technical debt hole was dug so deep you could hide an Abrams tank in there. To this day I could not tell you the confused thinking that led them to believe this was necessary rather than just slapping it all into PostGres and calling it a day.
All because they were processing data sets sooooo huge, that they would fit on my laptop.
I quit reading about the time the article turned into a pitch for Stitch Fix, but leading up to that point it made a good case for what happens when companies think they have "big data" when they really don't. In summary, either a company hires skills they don't really need and the hires end up bored, or you hire mediocre people that make the convoluted mess I worked with.
I really hate pat-myself-on-the-back stories, but I'm really proud of this moment, so I'm gonna share. One time a principal engineer came to me with a data analysis request and told me that the data would be available to me soon, only to come to me an hour later with the bad news that the data was 2 terabytes and I'd probably have to spin up an EMR cluster. I borrowed a spinning disk USB drive, loaded all the data into a SQLite database, and had his analysis done before he could even set up a cluster with Spark. The proud moment comes when he tells his boss that we already had the analysis done despite his warning that it might take a few days because "big data". It was then that I got to tell him about this phenomenal new technology called SQLite and he set up a seminar where I got to teach big data engineers how to use it :)
P.S. If you do any of this sort of large dataset analysis in SQLite, upgrade to the latest version with every release, even if it means you have to `make; make install;` Seemingly every new release since about 3.8.0 has given me usable new features and noticeable query optimizations that are relevant for large query data analysis.
"I guarantee that somewhere, sometime, an engineer has been like 'hay guys, I loaded our big data into SQLite on my laptop and it ended up being faster than our fancy cluster'". We then joked that the engineer would be fired a few weeks later for not being a "cultural fit".
A few minutes later you commented with your story. I hope you didn't get fired? :)
Here is an example for SQLite: https://news.ycombinator.com/item?id=9359568
I quite enjoyed it as well.
(I'm getting better at it!)
EDIT: And my desktop machine has -- let's see -- about 500000 times the RAM my first computer had. Truly astounding if you think about it.
On a somewhat related note: The original Another World would probably fit into the caches that your CPU has as a matter of course these days.
Was it 2 TB before compression? Because sqlite does blow up the data size over the raw data usually (depending on the original format obviously). It can be kind of wasteful, and I ended up storing some fields as compressed JSON for this reason (that actually beats the sqlite format).
Also, the sqlite insert speed can be much slower than the disk's sequential write speed (even if you make sure you're not committing/flushing on every row, and if you have no indices to update, etc.)
So I think inserting and laying on the data could be nontrivial. But the queries should be fast as long as they are indexed. In theory, sqlite queries should be slower for a lot of use cases because it is row-oriented, but in practice distributed systems usually add 10x overhead themselves anyway...
The productivity factor of ruby or python in a single memory space vs hadoop is at least 10x.
Or are you talking about just rolling your own format to dump your data into? I've done that, but I'd still appreciate if I could use something that someone else had put the thought into. (Something like "cdb" by Daniel J. Bernstein, but that's an old 32-bit library that's limited to 4 GB files.)
Since this is Amazon, I suspect he is referring to SPICE (or their internal version), which was released last fall as part of AWS's QuickSight BI offering...
"SPICE: One of the key ingredients that make QuickSight so powerful is the Super-fast, Parallel, In-memory Calculation Engine (SPICE). SPICE is a new technology built from the ground up by the same team that has also built technologies such as DynamoDB, Amazon Redshift, and Amazon Aurora. SPICE enables QuickSight to scale to many terabytes of analytical data and deliver response time for most visualization queries in milliseconds. When you point QuickSight to a data source, data is automatically ingested into SPICE for optimal analytical query performance. SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, machine code generation, and data compression to allow users to run interactive queries on large datasets and get rapid responses."
As far as architecture is concerned, every team is different. At least on the retail side, we tend to center around a large data warehouse that is currently transitioning from oracle to redshift, and with daily log parsing ETL jobs from most services to populate tables in that data warehouse. Downstream of the data warehouse is fair game for pretty much anything.
* SQL Server is proprietary
* SQL Server licenses are expensive
* SQL Server runs only on Windows (or least used to?)
Customer: "How much does an Oracle database license cost?"
Oracle Rep: "Well, how much do you have?"
To your own points:
SQL Server is as proprietary as Oracle
SQL Server is cheaper than Oracle
SQL Server is being ported to Linux in 2017 :)
Where do I learn how to do this? I've tried loading a TiB (one table one index) into SQLite on disk before, and it took forever. Granted this was a couple years ago, but I must be doing something fundamentally wrong.
I want to try this out. I've got 6TiB here of uncompressed CSV, 32 GiB ram. Is this something I could start tonight and complete a few queries before bed?
Actually, out of curiosity, I looked it up on the sqlite site. If I'm reading the docs correctly, with atomic sync turned off, I should expect 50,000 inserts per second. So, with my data set of 50B rows, I should expect to have it all loaded in ... Just 13 days. What am I missing?
There are a handful of things that make a difference. First of all, don't use inserts, use the .import command. This alone is enough to saturate all the available write bandwidth on a 7200rpm drive. It is not transactional, so you don't have to worry about that...it bypasses the query engine entirely, really is more like a shell command that marshals data directly into the table's on disk representation. You can also disable journaling and increase page sizes for a tiny boost.
Once imported into SQLite you get the benefit of binary representation which (for my use case) really cut down on the dataset size for the read queries. I only had a single join and it was against a dimensional table that fit in memory, so indexes were small and took insignificant time to build. One single table scan with some aggregation, and that was it.
question: is SQLite incrementally helpful when I'm already comfortable with a local pgsql db to handle the use case you suggested? would SQLite be redundant for me in this case?
question: between postgres and unix tools (sed, awk) is there reason to use SQLite?
Reasons to prefer sqlite: it's easier to embed in an app, you want sort, join, split-apply-combine, scale, transactions, compression, etc.
Reasons to prefer pgsql: sqlite's perf tools suck compared to pgsql (last time I got stuck anyway) and I'm sure there are lots of sql-isms that sqlite doesn't handle if that's your jam. EDIT: forgot everything-is-a-string in sqlite, just wanted to add that it has bit me before.
You will see PostgreSQL will come in handy once you get beyond the initial import stage..
PostgreSQL's type system will come to your aid . SQlite essentially treats everything as string, which can turn nasty when you get serious with your queries etc.,
I think part of the problem is many engineers can debug an application, but surprisingly few learn performance optimization and finding bottlenecks. This leads to an ignorance is bliss mindset where engineers assume they are doing things in reasonably performant ways and so the next step must be to scale, without even a simple estimate for throughput. It turns into bad software architecture and code debt that will cause high maintenance costs.
The universal knowledge is learning to run well designed experiments, and this comes from practice. It's like how you would debug code without a debugger. There are profiling tools in some contexts that help you run these experiments, but at the highest level simply calculating the number of bytes that move through components divided by the amount of time it took is very enlightening.
It's valuable to have some rough familiarity of the limits of computer architecture. You can also do this experimentally; for example, you could test disk performance by timing how long it takes to copy a file much larger than RAM. You could try copying from /dev/zero to /dev/null to get a lower bound on RAM bandwidth. You can use netcat to see network throughput.
Bandwidth is only part of the picture; in some cases latency is important (such as servicing many tiny requests). Rough latency numbers are in [0, 1], but can also be learned experimentally.
Many popular primitives actually dont perform that great per node. For example something like a single MySQL or Spark node might not move than ~10MB/s per node; significantly lower than network bandwidth. You can actually use S3 to move data faster if it has a sequential access pattern :)
Now, a data warehouse where you have a hoarde of Hadoop engineers slaving over 1000 node clusters is probably overkill unless you are a company like facebook or are in the data processing biz. However, some database management systems can be very complementary.
For example, while you can do a majority of things with Postgres, some are not fun or easy to do.
Have you ever set up postgis? I'd much rather make a geo index in MongoDB, do the geo queries I need and call it a day than spend a day setting up postgis.
We find that MongoDB is great for providing the data our application needs at runtime. That being said, it's not as great when it comes time for us analyze it. We want SQL and all the business intelligence adapters that come with it.
Yeah you could do a join with the aggregation framework in MongoDB 3.2 but it just isn't as fun.
I think you should try it again. I felt the same way when it was in the 1.5 version, but post 2.0 it is really easy. And the documentation has gotten so much better. It really is as simple as `sudo apt-get install postgis` and then `CREATE EXTENSION postgis;" in psql.
Moving data from A to B and applying some transformations on the way through seems like a straightforward engineering task. However, creating a system that is fault-tolerant, handles data source changes, surfaces errors in a meaningful way, requires little maintenance, etc. is hard. Getting to a level of abstraction where data scientists can build on top of it in a way that doesn't require development skills is harder.
I don't think most data engineers are mediocre or find their job boring. The expectation from management is that ETL doesn't require significant effort is unrealistic, and leads to a technology gap between developers and scientists that tends to be filled with ad-hoc scripting and poor processes.
Disclosure: I'm the founder of Etleap, where we're creating tools to make ETL better for data teams.
I do love reading an article that supports my contention that ETL is the Charlie Work  of software engineering.
0 - http://www.avclub.com/tvclub/its-always-sunny-philadelphia-c...
Though I've never formally done work with the "ETL" label, what I've seen of it reminds me of the work I used to do for client years ago where I'd take some CSV file (or what have you) and turn it into something that their FoxBase-based accounting system could use. It was boring grunt work (or shall we say, "Charlie work"), but I billed it at my usual rate and it paid the mortgage. I would never, ever wish to make a career of it, however. (And if my assessment of what someone knee-deep in ETL does all day is completely off base, I apologize.)
BTW, that may sound polemic, but it's not -- that's really how a lot of business types, academic researchers, and others think of nearly all programming-related work.
Anyway, this is a shame. I thought "data science" still had nice R&D flavored jobs. Colour me disillusioned.
well yes this is an embarrassing fact for data scientists. A midrange stock macbook can easily handle a database of everyone on earth. In RAM. While you play a game without any dropped frames.
With all the hype around 'big data' and all that crap, many people seem to forget how far you can go with plain simple SQL when it's properly configured, and not talking about some complicated optimizations, just solid fundamentals. And no problem if you can't do it, things like Amazon RDS will help you.
Time and again, the tools prove that it's not.
To recollection, I can count the number of publicly-known companies dealing with these datasets on two hands, if being generous.
A 4 TB drive costs about $120, and you'll spend way more than that on software development and extra computers if you do distributed computing when you don't need to.
There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
This is like... your opinion. Some people find pushing around HTML / JS / CSS absolutely soul-crushing. Considering the lion's share of websites are ugly, unusable, and slow, does this mean that front-end engineering is a breeding ground of mediocrity, so server-side devs and CFOs should all be sharing in the pain?
Some people actually enjoy working with data, and don't find ETL and pipelining horrible to do at all. It is a different set of challenges, but calling people people mediocre because of ETL is a non-sequitur.
It doesn't sound like he's suggesting here that some types of technology are just generally horrible to work with or that it sucks to build ETL jobs in general. Maybe he could have done a better job defining what he meant by ETL engineer here, but he does qualify it in that quote with "ETL to produce data that you yourself never get to use or consume."
In general, roles that offer little areas of ownership do draw mediocre engineers.
I've seen careers made and broken based on whether people got to play thinker or doer.
This makes rewards for thinking very lopsided. However the problem is that actual credit for success REALLY belongs with the people who did the work.
This problem shows up at every scale in every organization. For example there are hundred people who want to be the business side of a startup for every person who wants to build the tech. Why? The business person gets to be the thinker, the developer does the work. And then the business person expects to become the CEO and get the bulk of the payout!
However these cases are the exception, not the rule. As a rule ideas are cheap, implementations are hard. And success has more to do with iterating on the implementation than the starting ideas.
But that transformation involved nearly destroying the company first, getting himself ousted. Then recreating MacOS as a UNIX platform for a market that did not want it, only to be finally integrated by the dying Apple as a Hail Mary play by both Jobs' and Apple.
Jobs' get a lot of credit for the vision, and even the execution. But it could have turned out lots of different ways. If Pixar had not been successful (in no small part because of Jobs dumping millions and millions of his own money into it), one could imagine NeXT not being bought by Apple.
> However the problem is that actual credit for
> success REALLY belongs with the
> people who did the work.
> And I didn't say that the doer always
> deserves the credit.
Take my Steve Jobs example. Do you really think that he didn't work hard?
What matters is the technical skill to make the idea go from a fantasy to a reality semi-reminiscent of the idealized fantastic version, whether that skill is in business, accounting, programming, marketing, or whatever.
You can execute as hard as you want, but if you're headed in the wrong direction, the only good thing that's going to happen is that you're going to learn when ideas really are important.
The real pain is making a decision and expending resources on your challenging/risky idea. There's very little appetite for the responsibility and risk that come with big ideas (in a BigCo).
Got the ability to think up new ideas, sell them within an organisation, and get them executed (hello 'doer') in a way that provides value to that organisation? You're gold, and worth way more than the 'doer'.
No one can "have" this ability because it's transient. Unless you control the entire corporation (in which case you don't need to influence anyone else anyway), there is always someone who can come in and break your previously-perfect ability to "sell" your ideas inside the org. You're claiming that artful politicians (or, more blatantly, "good bullshit artists") are more valuable than skilled engineers. I don't believe that.
The valuable thing in a big company isn't the idea; it's the way the idea is executed. You'd be amazed what kind of nonsensical ideas will work if your team gets it just right.
The problem arises when someone gets into a position where they can think big thoughts without having to do any nitty gritty. Effectively, they end up jumping in right when the real producers have finished the actual work, and then coming up with some polish that makes it look like they came up with some interesting result.
This is not actually a way to get work done. It's a way to play politics.
Worse yet, it's actually completely detrimental to getting things done. When you have things split up between thinkers and doers, what do the incentives look like? It's quite simple. I may order some analysis, and I may not fully understand the nuances. But whatever happens, as a thinker I'll have to have something grandiose to say, and I'll need to keep the doers busy. That way if I don't find a real conclusion, it's everyone's fault. If I do find something, it's thanks to me.
Where I worked the people with the big plans couldn't code their way out of a paper bag. Ask them what Big-O is, they draw a blank. Ask them how their trading strategy will actually send orders to the exchange, they draw a blank. But ask them something that sounds like strategy, and they will feed you plenty of unsubstantiated BS.
My new venture is coders all the way down. Strategists who can actually use git without asking what it is, understand that algorithmic complexity actually matters, and so on. Coders who understand what the market is.
In the horizontal setup you describe (layer of thinkers on top of layer of doers), credit hits a barrier at the thinkers. The gradient isn't propagated.
In the vertical setup in the article (layer of thinker-doers), of course the backprop will be good because it is only one layer thick. You gain proper incentives, proper treatment of data on the whole pipeline. And the engineers can also concentrate on a purely orthogonal thing: writing tools.
But you lose the benefit of having the layer being able to focus on one thing. The author acknowledges those efficiencies (his word). It is hard to find people with a wide set of skills. Although in this case it is balanced, as now the engineers have gained specialization.
But I digress. My point was: humans in orgs are bad at backprop. Why share the credit at all? Organisations can be seen as neural networks/graphs, and they can lack proper backprop.
I'd love to see the results of some pagerank-like backprop. Every employee gets one base point. Every week, he is asked: "who helped you the most in doing your job this week?". Sales would credit analysts who would credit engineers, etc. Or Sales would credit analysts-engineers who would credit tool-writers, etc. It could go both ways: engineers could credit sales or analysts for writings well thought-out problem descriptions.
Then you would run pagerank on it, and base every promotion, every salary increase on it. Information would flow well, and everybody has a clear direction (his gradient) of what he can do to shine.
Also, by injecting revenue at the sales layer in a certain period of time, you could identify who conctributed the most in an increase of revenue.
Also, I posit that managers have a tiny view of what happens in a firm. They only get to see a fraction of interactions, while the brunt of what matters happens in the long tail of one-to-one interactions. Should you chose to promote people with the highest PR, you would have a true result-based bottom-up org.
As a hardware engineer, ETL is a NRTL that competes with UL and CSA. Oh, excuse me, Thomas Edison's Electrical Testing Labs is a Nationally Recognized Testing Laboratory that competes with Underwriters Laboratories and the Canadian Standards Association.
The common abbreviation for a four letter abbreviation is ETLA. That is, Extended Three Letter Abbreviation.
I'm not sure I get why writing ETL code for data you'll never consume is any more soul-sucking than, say, refactoring JS code for a website you couldn't begin to care about (and which will never be properly re-designed anyway); or even doing "thinker"-level work but for an industry you couldn't begin to care about (advertising), etc.
In other words, what most developers of whatever technical stripe do for a living.
A lot of people think that certain DBA/ETL/BI/similar work is boring and simlpy don't want to do it and so don't learn to do it well. Which is fine by me: it means those of us who can do it well can get paid good money when someone needs it.
The only problem with this theory in practise is that many also think such work is easy and free of complications; so they baulk at paying for people genuinely can do it well, get people less experienced who say they can do it well but do it badly, and judge the rest of us by that standard and assume database people are thick and can't do easy jobs properly...
There's a large, active community of engineers who specialize in data, whose job is to technologically enable data scientists the means to perform their analyses. I know these people exist because I'm one of them, and I work with them, and I've met them at meetups and conferences. I don't know why the author doesn't think these types of engineers exist. Not all of us who code want to work with the web.
> If you read the recruiting propaganda of data science and algorithm development departments in the valley, you might be convinced that the relationship between data scientists and engineers is highly collaborative, organic, and creative. Just like peas and carrots.
Almost every data team I've worked with is structured this way. I work daily with data scientists. I have a data scientist sitting to my right, two data scientists sitting across from me. Our teams are highly integrated and I can't imagine it working any other way. If the teams the author is familiar with don't operate in this manner, then I can see why he'd think the endeavor is hopeless.
I also disagree with the author's conclusion. The data scientist's job is to analyze and interpret data. They should not be spending any time thinking about how to get that data. They should not be concerned about where the data is coming from. The more time scientists have to spend thinking about ETL, the less time they have to do what their training is in, statistical analysis.
The data we use comes from relational databases and document stores operated by different departments, external APIs and third party services, SalesForce, server log files, etc. A stats PhD does not have the training to gather this data themselves.
In terms of a hybrid scientist/engineer role, I don't know many software engineers who are also good at stochastic calculus or ensemble learning. Likewise, I don't know many data scientists who are also comfortable writing cronjobs to retrieve external API data or have the ability to diagnose server problems.
"In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?" for his appointment to the H. C. Carver Professorship at the University of Michigan. In this lecture, he characterized statistical work as a trilogy of data collection, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists."
From the same article, a quote from Nate Silver:
"I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician."
If your skillset differs from a statistician, then calling yourself a data scientist is not going to be a differentiating title in common parlance.
The notion that a data scientist's only job is to "write a statistical model" and then it's someone else's problem to run it in a distributed environment only exacerbates the problem and lowers DS code quality.
Full disclosure: my company Pachyderm  is trying to solve exactly the problem Jeff is talking about in the post. We've built a data processing platform on top of the container ecosystem. Basically, the data scientist has complete control over the runtime environment for their analysis since everything is bundled into a container. It scales to work for actual "big" data, but it also great for small teams that don't have massive infrastructure resources.
Granted that yes, lots of solutions don't exactly require a Hadoop cluster with thousands of nodes, this is a pretty gross and mean-spirited dig at "mediocre engineers" a number of times. It would be nice if we didn't treat people that don't work at Amazon/Google/Twitter/LinkedIn as lesser beings because they find their jobs at a probably-doesn't-have-Big-Data company.
(Does StitchFix have Big Data? If the answer is no, are their "Data platform engineers" mediocre?
A better idea imho is to have small crossfunctional teams where scientists and engineers work together to build only what they need with short iteration cycles.
If everyone involved doesn't have at least a broad perspective on the end-to-end purpose of what they're working on, they're probably going to build the wrong thing.
"Big Data" is like sex in high school. Everyone talks about it but few people really have lots of it and some just don't have any.
For many startups, even if they audit everything they won't have petabytes of data.
Because that's what it is:
- It is attempting to question, critique, override, everyday decisions made by the management (including the CEO) based on available data.
- It is doing that with maximal knowledge of the whole organization. That means all the records, finances, secrets, what not, have to be divulged to the data-science team. (which in itself is an unsurmountable challenge, i.e., to convince the management to allow full data access; think emails, chat logs, meetings minutes of CEO's, VP's, etc, etc).
This will make the management go, "so let me get this straight, I authorize you access to data of the whole organization, and you come up with a conclusion (some of the times at least) that I'm full of it?"
I highly doubt any organization would be up for this kind of internal disruption, even if that means more success for the company.
Very few people are interested in making data-driven decisions. They want an employee (subordinate) who will prove that the decision they've made or are planning to make, is correct. Anything else is, as you say, very internally disruptive.
Being a data scientist or data analyst at a startup is (for the most part) a completely miserable existence. You are relegated to doing interesting things that are usually discarded. It can make you feel like your job is pointless.
In the end, one either makes the decision to be (at best) useless, or (at worst) a puppet. That, or you quit.
Thank you so much for your comment - it's refreshing to see I'm not alone in feeling like this.
This is one of the toughest parts of building a scalable organization (with or without big data). Getting past the idea of efficiency and being OK with redundancy.
This means allowing two teams to both build a common feature they might need, rather than establishing a dependency. It means making one teams job broader even if it overlaps with another team.
I find it interesting that we are perfectly willing to have redundancy on the software side (load balancing, slaves, etc) but not on the development side.
This just reads like a puff piece for another valley startup by some guy who's better than you. Oh, and here's how we do it, you should try doing it this way too, because we think it's totes the best.
The distinction between "Data Scientists" and "Engineers" is bogus, and the point about whether your data is "Big" is a red herring.
In reality, there should not be any distinctions between "scientists" and "engineers", you must strive to be both a "doer" and a "thinker". You can't think without doing, and can't do without thinking.
If you're in this field, and consider yourself an "engineer" but your math sucks, go read up on all you can about mathematics and statistics, just like you did back when you were learning about programming, operating systems and networking.
If you consider yourself a "data scientist" but don't know anything other than R and basic Python, go study programming and operating systems and networking, like you studied math at some point.
Somewhere on youtube I remember Dr. Donald Knuth (who is definitely an excellent programmer/engineer/computer scientist, arguably one of the best the world has known) saying that he considers himself primarily a mathematician.
Or, if you've read (or at least heard of) "the dragon book", you might find it interesting and inspiring that one of its main authors Dr. Jeffrey Ullman (whom I'd place in the same league as Knuth) went on to write another excellent (and available freely online, BTW) book "Mining of Massive Datasets", which IMHO is the one fundamental "big data" book out there.
So Data Scientists - go learn some programming languages like C and study UNIX and may be read "The Art of Computer Programming" and Engineers go read http://www.mmds.org/.
Then you'll all get along.
Furthermore, people have these roles precisely because of their talents and their choices. As a Data Scientist, most of what I do is read ML literature, build ML models and write technical reports in Tex on what worked and what didn't. The skills to do this were acquired over many painful years of graduate work in math, statistics, ML. To suggest somebody can just read their way through that material is quite laudable, but you are underestimating the difficulty by orders of magnitude. Essentially, you are suggesting that all of the graduate study and mentoring and homeworks and assignments and all that went into the learning process be condensed into a book which one can just plow through and become a DS. Well, good luck with that.
By the same token, expecting me to have the same level of efficiency and passion as a data engineer when faced with a Hadoop/Oozie/Presto/Pig/kafka or what have you is silly. I don't care for these technologies and how to work them. I know it takes a really long time to get good at them - that's why the engineers get paid a lot of money and also get yelled at when the ETL job fails. Because it's a set of seriously valuable skills that were no doubt acquired over lots of time and practice. It's not like I can buy a book on these things, just read through them and suddenly I am a DE! I neither have the interest nor the time to do that.
>>the distinction between data scientists and data engineers is bogus
Not at all! Both DS and DE professionals do distinctly different work and conflating everything under 1 umbrella buys you nothing.
That stroke you've brushed is too wide. Smart employers will have some of the money they're paying an employee going towards learning... and if they're really smart, they can even measure their ROI. Leads to less turnover, and better long-term vision for their projects.
I get your point, but give someone passionate enough 6 months in a new work environment, and with a decent mentor, and you might find they become surprisingly adept at it. The hard part is hiring for the capability to learn (fast).
And I'd argue that this is precisely the problem. I personally care about all of it.
This assumes availability of time. Obviously, given enough time, people could develop both top-tier engineering and DS skillsets!
Of course, if lots of free time were common we'd all be full-stack developers who also field sales calls and work on product strategy etc. etc.
/u/dxbydt commented on this very well, so I'll only reinforce the point that not everyone is Donald Knuth. If the distinction between DS and Engineer is blurred in your specific instance, and you're capable of Knuth-ian levels of work in both, you are almost certainly underpaid and need to lead a team or start a company yourself, stat, since a top-tier combination of those skills is exceedingly rare.
> "Rather than try to emulate the structure of well-known companies (who made the transition from BI to DS), we need to innovate and evolve the model! No more trying to design faster horses…
A couple years ago, I moved to Stitch Fix for just that very reason. At Stitch Fix, we strive to be Best in the World at the algorithms and analytics we produce. We strive to lead the business with our output rather than to inform it."
I find this article rather peculiar. At the start, you'd be forgiven for thinking this was an article about a company looking to find a solution to a problem, but as the article progresses it's clearer that they're selling themselves as the solution to the problem they outlined.
In other words, they start off looking like a customer, but only to set up the premise required to sell the solution to the problem their company supposedly has/had. Turned me off from taking the product seriously.
I'd agree, it feels like it invalidates the claim about getting credit for 'being a thinker'.
> If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”.
And then comes this line:
>At Stitch Fix, we strive to be Best in the World at the algorithms and analytics we produce.
Without further justification, why does StitchFix, a subscription shopping service, need to be the "Best in the World" at algorithms and analytics? They have harder problems than Google or the Centers for Disease Control or NASA?
Unless they have justification for that, it seems a bit ironic given the article's ire for over-engineering.
Another point I'd like to make is that not everyone hates ETL and pipeline management. I happen to like it. It's rewarding to stand up reliable self-healing data pipelines and ETLs.
I posit this outcome is absolutely necessary for any data science project to be worthwhile in any organization.
In the case where the project produces a report and goes no further, you still need the data and code for reproducibility, one of the main principles of the scientific method .
In the case where the project gets handed off to engineers to re-implement, reproducibility is even more critical, since the engineers best effort to reproduce the code will almost certainly not be successful the first time, and you will need to validate many versions of the production model. Doing this by hand even once it wasteful, doing so many times is tragically so.
In the case where the data scientists can produce a service worthy of production use, kudos!! But understand the caveat that in truly big data or big compute flows, this outcome remains highly unlikely.
1. SaaS services have APIs
2. Your database is hosted in the cloud
3. You use a standard SQL data warehouse that is also hosted in the cloud.
ETL from (1, 2) to (3) is a completely standard problem, and you should be able to buy a fully-automated solution. My company (Fivetran) does this as a service. We've replaced lots of homebrew data pipelines built by our customers, and we always see the same issues:
* Homebrew ETL pipelines use fancy big-data tech like Hadoop and Kafka in places where it has no relevance, like syncing your 20 GB Salesforce instance.
* Homebrew ETL pipelines don't deal with all the dark corners of the data sources, such as: what happens when someone adds a new custom column? What happens when your MySQL read replica fails over and a new binlog starts? Etc.
The lesson being, don't do this yourself.
I worked at a company where distinction between the roles was emphasized by physical separation, presumably so that they won't interfere with each others day-to-day duties. The downside is that each group starts caring about their particular thing only, feeling that they are the ones who really keep the place running and other groups are bozos doing their job incredibly poorly.
Sure, let the powecenter users "write the ETL", but then they need to get the heck out of the way and let the big boys actually build the warehouse.
(Corollary to that rule of thumb: if your data fits on a hard drive, all "big data" tools you need are shell scrips and SQLite.)
Personally for smaller projects I've used kiba or transforms in pgloader 
Really one of the best things I've discovered in the past 5 years. Saves so much work compared to doing stuff with sed, awk, and the like.
I'm genuinely curious - to analyse data effectively, is there a baseline of statistical understanding you need to have? If so, what is it?
Sorry, I could not get any further than when the sale pitch kicked in, so sorry when there was anything new after that.
personally, i think, there is nothing wrong with being average .. people with average skills built great things
mediocre is just a mean way to say average
> Data scientists ... aka “the thinkers”
> Data engineers ... aka "the doers"
> Infrastructure engineers ... aka "the plumbers"
The author is clearly not an infrastructure engineer.
"Big" data on the command line.
That's the fundamental action of computation. Read, compute, write.
This is a stupid article.