

Surviving Data Science at the Speed of Hype - mistermcgruff
http://www.john-foreman.com/blog/surviving-data-science-at-the-speed-of-hype

======
mwetzler
I'm a data scientist that works with companies on their analytics problems
every day. This article is spot on.

By far the biggest factor influencing the success of an analytics project is
that the company has a _human_ who has the time and inclination to think and
reason about the business. They figure out what questions are important to ask
and then go look at the data to see what they find. Collecting the data is the
easy part. There is no analytics product that asks & answers your most
important business questions for you.

I enjoyed the jab at predictive modeling; it's almost comical how many
companies dream about predictive when they haven't yet got basic tracking in
place for what's _already_ happening in their business.

Love the post, thanks for sharing.

~~~
mathattack
Exactly - the human with domain knowledge is vital. I get scared when I see
people trump up black boxes. Black boxes don't help with "Which questions
should we be asking?" and "What are the missing variables?"

~~~
dagw
Domain knowledge is also really useful for spotting bugs. I recently worked on
a project where I had very little domain knowledge. So anyway I wrote my code,
ran my tests, crunched the data, double checked that all the results seemed
reasonable, produced the pretty pictures and everything looked spot on.
However once I started showing the results to a domain expert it took him 30
seconds to point to one of the outputs and go "that's impossible, you have a
bug in your code". Sure enough I did. As a generalist the results looked fine
to me (right size, seemingly reasonable relationship to surrounding values
etc.), but to a domain expert the error stuck out like sore thumb.

~~~
mathattack
True. The ability to sanity check is very important.

------
rm999
Good article. The author is completely correct that people often underestimate
the fragility of predictive models, and that summary analysis (I group this
into a more general concept called "insights") are simpler and more robust. I
think the article is a little harsh towards predictive models though.

The primary difference between a model and an insight is that insights require
a human to process - anything more automatic is a model. Insights are easy to
implement and are great for finding patterns and anomalies (the human mind is
basically designed to pick these out). But the human element makes insights
less scalable with significantly higher latency. For some problems these are
unacceptable tradeoffs, and this has little to do with how stable a company's
environment is. It's purely a product/strategy question, and about
understanding all the tradeoffs.

~~~
LiweiZ
Great to know model/insight difference. You have drawn a clearer line for me,
which I always struggled to think it through, even I was quite aware that they
are quite different.

------
LargeWu
I once worked at a major big box retailer where somebody came up with a
visualization that purported to show, for a given product category, purchases
made in other categories. One surprising purchase correlation was customers
bought TV stands after buying DVD players. So, this nugget was trumpeted at
countless meetings about the value of big data analytics. Multiple marketing
campaigns were designed around this discovery.

Of course, that made no sense, so I checked a little deeper. You know what
else people also buy when they buy DVD players? TV's. The DVD/furniture
relationship was an artifact of the high degree of correlation between TV's
and DVD players, which the visualization tool failed to account for.

I brought this up immediately, but received tepid response. Of course, months
later, I was still hearing about DVD players and furniture. It had become part
of the institutional lore, and no facts were going to replace that.

~~~
tmarthal
Whenever you say "that made no sense", I think that you are using too much
bias and not giving enough credit to what the data is telling you.

If you look at the most "controversial" data science paper from 2013 where a
study correlated intelligence to Liking the Facebook pages "Curly Fries" and
"Thunderstorms" (here is a summary: [http://www.wired.com/2013/03/facebook-
like-research/](http://www.wired.com/2013/03/facebook-like-research/)), there
were a lot of proponents saying that there was no causation, and the
correlation was not founded, etc.

Of course, you would say the study "makes no sense". Intelligence can't be
predicted by Facebook Likes. There is no correlation there, etc. But why not?
If you read the paper
([http://www.pnas.org/content/110/15/5802.full.pdf](http://www.pnas.org/content/110/15/5802.full.pdf))
their logic is sound. Is the marketing campaigns that the company bought based
on the TV Stand<>DVD Player connection any different than other marketing
campaigns? Facebook does all of their ad display based on similar data
analysis as above, and it seems to be working for them.

Note: There is the not-so-hidden machine learning feedback loop now (explained
better here: [http://www.john-foreman.com/blog/the-perilous-world-of-
machi...](http://www.john-foreman.com/blog/the-perilous-world-of-machine-
learning-for-fun-and-profit-pipeline-jungles-and-hidden-feedback-loops)),
where people Like the 'Curly Fries' and 'Thunderstorms' pages because of the
research.

~~~
joe_the_user
_Whenever you say "that made no sense", I think that you are using too much
bias and not giving enough credit to what the data is telling you._

What? If a data scientist sees something seems illogical, there is no reason
not to investigate it and see if he/she can find a more logical explanation.
Sure, if the effect seems real but unexplained, you can accept and use it but
_advocating_ a kind of big data mysticism, "don't investigate, accept" seems
to be buying into the senseless hype. And if you read the post, you'll notice
the parent actually discovered the association was just an artifact of an
easily explained association.

And, no, there's no much reason for companies to advertise just a TV stand and
DVD player. Common sense tells one what the data actually data, that those two
items, _by themselves_ aren't and weren't what many people were just dreaming
about.

------
mfdupuis
Very good post. Refreshing.

I think that the hype and buzzwords around Big Data and data science cause
more than just bad business decisions. I believe they are also damaging the
industry and creating a larger sense of disillusionment (I'm mostly thinking
of "deep learning"). Not sure what this means for data science in the long
term though, just thinking out loud.

I'll also add that I frequently see sledge hammers being used to hang a
picture frame. By that I mean using huge clusters to run algos that would
actually run in Tableau, Excel etc.

~~~
IndianAstronaut
I had a conversation with a 'big data consultant' some time back. He mentioned
one of his clients needed to set up a Hadoop cluster and wanted me to work
with him. I said, 'why do they need a cluster, they probably don't have that
much data'. His response was 'if a client wants to jump off a building, you
don't say don't do it, you ask them what floor'.

------
threeseed
Firstly, someone needs to explain to me why smart people get worked up over
vendor marketing. Since the beginning of time it has always been about
exaggerated claims, bold, specific numbers e.g. 80% better and always targets
those who make purchasing decisions. Do people really expect them to say, "Hey
our product is great but you know you probably don't need it. But maybe buy it
anyway ?".

Secondly, the author seems to have conflated two different parts of the data
science picture. Yes great analysts who do amazing work is important. But it
relies on (a) having data available and (b) in the right format. For those of
us doing significant volume ingestions it is not trivial to do this. Hadoop is
painfully slow and overall data science end to end tooling is slow, fragmented
and incomplete. Some of us do need vendors to be bold and coming up with new
technologies/approaches.

And the point about IBM is just stupid. Did you ever think that maybe Watson
DID help them slow their sales losses ? Weird that a data scientist would make
predictions based on inadequate data.

~~~
sgt101
>Secondly, the author seems to have conflated two different parts of the data
science picture. Yes great analysts who do amazing work is important. But it
relies on (a) having data available and (b) in the right format. For those of
us doing significant volume ingestions it is not trivial to do this. Hadoop is
painfully slow and overall data science end to end tooling is slow, fragmented
and incomplete. Some of us do need vendors to be bold and coming up with new
technologies/approaches.

I think you are doing Hadoop wrong, or confusing current technical reality
with "Hadoop". Hadoop is very cheap, and it allows __all the datas to be in
one place __This is huge for large scale data science, because in the past we
had to pull data across networks fiddle, sample and chuck. The business case
for single enterprise datawarehouses was difficult to make (because of the
cost) and maintaining them when a CIO with vision did make the case was
impossible because it took about 10 minutes for some genius to start running a
tactical operational system on it, which was followed (in about 10 minutes
more) by a howling call of rage from an MD about why his operational system
was locked up due to someone doing stupid queries, which was followed by a
lock down on queries in the warehouse.

If your hadoop cluster is slow then 1) move to CHD5 and use spark, use Impala,
upgrade to 40Gbe throughout and make sure that you have balance in your
architecture, for god's sake do not be telling people Hadoop is slow if you
are using AWS. 2) brew your own cluster with GPU's and the various crazy
infrastructures supporting said architecture (good luck) 3) go talk to an FPGA
vendor or a super computer vendor and upgun (but you must be rich) Exalitics
or Yark might work for you.

>And the point about IBM is just stupid. Did you ever think that maybe Watson
DID help them slow their sales losses ? Weird that a data scientist would make
predictions based on inadequate data.

Every IBM rep I have met for the last 3 years has told me that Watson will
deal with churn and provide better offer management. I have repeatedly tried
to get POC's and always always failed. Then we saw Watson tools on Bluecloud
and all our suspicions of what Watson is and was are confirmed. Cudos to the
Watson team, they spotted that Jeopardy questions can be rewritten as search
queries, and spotted that search responses can be rewritten as jeopardy
answers.

BTW. did anyone get far with Deepdive?

~~~
pwang
> If your hadoop cluster is slow...

You're right, there is a lot of misinformation and hope about Hadoop out
there, and I think there is a lot of value in Hadoop as a cheap data
integration archive. But I think the parent poster's point still stands. A
Hadoop-based infrastructure currently has a lot of impedance mismatch for full
end-to-end advanced analytics with a bunch of stats, linear algebra, or graph
stuff from native code which are not Java-based.

I would love to see a TCO analysis on Hadoop+analytics versus buying a more
traditional "supercomputer" stack with infiniband or one of the nifty Cray/SGI
NUMA systems. Current data warehouse and BI folks are fixated on cost per PB
of storage, and Hadoop is very cheap based on that single metric. I suspect
that if enough human factors and accuracy/agility of modeling results are
considered, the latter may be quite cost effective. It's just that the "big
iron" vendors are still in the middle of retooling their marketing for the
BI/DW/ETL crowd. When they finally figure it out, it's going to be a
bloodbath.

For instance, SGI UVs can give me 24TB-64TB of RAM in a single "system". I
still have to make sure I do multithreading/multiprocessing well, but the
interconnects are lower latency than 40GBe.
[https://www.sgi.com/products/servers/uv/](https://www.sgi.com/products/servers/uv/)

HP ProLiants now can fit 48-60 cores and 6TB in a single 4U system:
[http://www8.hp.com/us/en/products/servers/proliant-
servers.h...](http://www8.hp.com/us/en/products/servers/proliant-
servers.html?compURI=1479235#.VMz3SWTF-Yo)

Buying a few of these scale-up systems is a LOT cheaper than hundreds of nodes
of Hadoop sitting around maxing out I/O while their expensive Xeons have 10%
CPU load. Especially given than you can hire anyone out of science/engineering
grad school and they can program these scale-up systems, whereas writing a
bunch of Java MR jobs for Hadoop is quite foreign to them.

~~~
sgt101
I think that the disruptions are : \- Twill with everything (inc unikernels)
under Yarn (or Mesos) \- The Machine (if it's real) \- Datacentre scale
integration (so things like 500 different processors in each u which are
powered up by the fabric manager to efficiently meet the workload at hand)

I think any vendor who wants to compete with the Open-Source/commodity world
will need to do as well as / better than the above to get anywhere!

Programming MR is all done - I wrote MR in Java in 2008->12; never will I
again as it's rdd's, transformations and actions now, and it's dead easy (MR
is too but the API wasn't)!

------
tel
This is perhaps the first halfway sensible post on "big data" or "analytics"
that I've seen hit the front page of HN in a _long_ time.

------
qthrul
Timely. I did a "big data" presentation yesterday and hoped to convey how
important it was to read original source materials to form opinions and avoid
the hype.

Since slide decks get busy I moved my bibliography of links to a gist. So,
while it didn't factor into my presentation I've now added this blog post. :-)

[https://gist.github.com/JayCuthrell/8bcd9597d37a8602c639](https://gist.github.com/JayCuthrell/8bcd9597d37a8602c639)

------
dchuk
I just love the way this guy writes. His book, Data Smart, is hands down the
most approachable intro to data science you could ever possibly read if you
don't have the sufficient math background to dive into full on textbooks. And
it's hilarious too.

~~~
strictnein
I have mixed feelings about that book. I enjoyed his writing style and humor,
but the amount of beating on Excel he has to do to manipulate all that data
hurts my head. I kept thinking of how much easier it would be to do with code.

Maybe it's just not a good book for developers? _shrug_ I would love to have a
copy of that book that doesn't use Excel.

~~~
dchuk
I share your same yearning for a code equivalent of the book. However, I think
writing the book using only Excel was a smart move on his part, simply
because:

1) Excel is "visual" in the sense that you can watch the data change as you
tweak things. There is no command line or program to execute, it's all
happening live

2) For programmers, there's no "well I'm a python guy and this book is written
in Java so it's not for me." None of us as coders really depend on Excel for
writing code (basically) so it's kind of a way to take the technology
decisions out of the equation. It's just the techniques.

All that being said, it's not trivial to port the logic of a spreadsheet over
to code, and I think if anything that would make a great followup book.

~~~
strictnein
I agree with both your points. For #2, the only decent option may to make the
book more focused on R, instead of just chapter 10.

------
kiyoto
First of all, John Foreman is great. Read his book "Data Smart" and
[http://analyticsmadeskeezy.com/blog/](http://analyticsmadeskeezy.com/blog/)

(disclaimer: I am in no way tied to John Foreman. Also, I work at a company
that provides a data processing/collaboration SaaS...for big data!
[http://www.treasuredata.com](http://www.treasuredata.com))

A quote from the OP:

>If your business is currently too chaotic to support a complex model, don't
build one. Focus on providing solid, simple analysis until an opportunity
arises that is revenue-important enough and stable enough to merit the type of
investment a full-fledged data science modeling effort requires.

This is consistent with what we see in our customers. The use cases we see
most with processing big data boils down to generating reports.

Generating reports may sound really prosaic, but as I learned from our
customers, most organizations are very, very far from providing access to
their data in a cogent, accessible manner. Just to generate
reports/summaries/basic descriptive statistics, incredibly complex enterprise
architectures have been proposed, built by a cadre of enterprise architects
and deployed with obscenely high maintenance subscription fees billed by
various vendors. That's the reality at many companies.

As bad and confusing the buzzword "big data" is, one good byproduct is that it
has forced slow-moving enterprises to rethink their data
collection/storage/management/reporting systems.

Finally, I am starting to see folks do meaningful predictive modelling on top
of large-ish data (in the order of terabytes). Some of them are our customers
at Treasure Data, some aren't, but they are definitely not "build[ing] a
clustering algorithm that leverages storm and the Twitter API" but actually
doing the hard work of thinking through how (or if) the data they collect is
meaningful and useful.

And that's a good thing.

------
tbjohns
An important distinction is that the author's experience is mostly with the
businessy side of data science, and his jab is at people who use buzzword
tools that add complexity rather than simple solutions.

In defense of the hype, many tools like storm are worth their hype many times
over when used for the right application.

The author makes this distinction, but it can easily be lost in the post.

------
Fomite
I'm a working scientist, rather than someone in the corporate world, but this
rings true for me as well. During a recent outbreak, we've had very fast
turnaround demands, and while we've done great work in that time, I think some
of our best ideas have come from being able to slow the hell down and think.

------
muser
There's lots written in the credit scoring space that I think other industries
could look at - especially when it comes to calibration of models. It doesn't
matter if the prediction is weak just as long as it is consistent over time
periods. Banks rely on this consistency to ensure they are provisioning
properly for losses.

------
maxxxxx
I view IBM or especially HP jumping on a bandwagon as a strong negative signal
for that technology.

------
whatsgood
all this will change with the internet of things. once every "thing" is
networked, then these optimization platforms won't need to wait for some human
to input info about altered environments. the platform will "sense" it.

------
nartz
Amen

------
trhway
>And that is not primarily a tool problem.

>A lot of vendors want to cast the problem as a technological one. That if
only you had the right tools then your analytics could stay ahead of the
changing business in time for your data to inform the change rather than lag
behind it.

many people like the author just don't get it and it is fine. The same way
like people didn't get the search before Google.

>But how do I feel good about my graduate degree if all I'm doing is pulling a
median?

the graduate degree is what allows to receive $Nx10e5/year (for a respectable
value of N) for that pulling of a median

>If your goal is to positively impact the business, not to build a clustering
algorithm that leverages storm and the Twitter API, you'll be OK.

on the other hand if your goal is power(OK, OK) instead of just OK then the
clustering algorithm/storm/twitter is the way to go.

