
Too many tools not enough carpenters - Earlbus
https://ckmadvisors.com/b/160212.html
======
greggyb
I agree with the article in broad strokes, but I think it focuses a bit too
much on the data science portion. I work for a BI consultancy. There are so
many enterprises where the measure of success in data analysis projects is
being able to perform arithmetic accurately and in a timely manner.

Data science is a pipe dream for many companies currently. Having up-to-date
sales figures that can be sliced by dimensions as simple as product hierarchy,
customer, and region is a realistic goal for many of our clients.

I am not trying to discount data science. We have a data science practice that
does some cool things, but it's so much smaller than the BI practice, because
there's currently so much more opportunity for doing basic ETL correctly and
simple reporting and dashboarding.

There are multiple clients who were looking for "real-time access to data",
which was actually synonymous with "automatic weekly refresh would make our
lives so much easier. Right now we have to wait on $overworked_analyst to get
this out monthly, and depending on workload that can lag by a week or two
after the end of the month." These are not exact quotes, but entirely accurate
paraphrases.

This is the reality for so many organizations that I think sometimes falls out
of context in an HN-like environment.

~~~
IanCal
A couple of partially relevant things:

Something that drives me to really ask what the requirements are is building
something for my Dad. It was a tool to do deconvolution of MS spectra, and he
said it needed to run "quickly". I got versions down to an hour, then 15
minutes and bottomed out at about 5 minutes for a decent result. After a while
I talked to him about the timings and he said that "quickly" meant "under a
day". A failure on my part to clarify what is a really fuzzy term. Linked to
that, at the time they had a process which involved someone frequently
_manually_ looking up an item in a ~3-5k list.

I'm a data scientist, and I think that the amazing tools available now can
really cloud the problems that are faced by many organisations. Sure, we can
use word-sense vectors to create a deep neural net to do a thing, but 4% of
your data has a country of "NONE". Or you've got dates that don't make sense
(1000 years into the future), or a suspicious amount at 1/1/1970, 1/1/1900 and
1/1/1904\. I've seen important things with "ZZ TEST DO NOT USE" as a field,
there's truncated data, broken encodings and more.

This isn't to make fun of the people with these errors, getting and keeping
good data is _hard_ and often overlooked. But unless you've got that, you're
probably not going to be able to get anything useful from the fancy
algorithms. And even then, there's a huge amount to be gained from improving
simple interactions at the point humans and computers interface.

~~~
greggyb
I can't express surprise at any of this.

A few responses:

I hope that list was sorted.

4% of Country = "NONE" sounds like they're running a tight ship compared to
some places I've seen.

At least those date fields don't have "Unknown" as a value. Yes, I have seen
the date field stored as text.

You are absolutely right. The ETL side, nailing down appropriate business
logic, chasing down source errors and inconsistencies, altering processes to
capture the data we need to give good answers; these are the challenges that
take >80% of our time. If we do our job right in these pieces, the reporting
can be done by an intern who learned how to use a pivot table yesterday.

~~~
IanCal
> I hope that list was sorted.

Ish. If I remember rightly I think they were trying to compare multiple fields
on those elements but it could be narrowed down.

> Yes, I have seen the date field stored as text.

Oh yes, that is always fun. Particularly when you see both 23/05/99 and
05/23/99 in the same column. Something I've ended up building bits and pieces
of is work to try and find these kinds of inconsistencies. I'm slowly trying
to automate a lot of the initial checks on a new dataset:

* Does it have a consistent number of columns in the CSV file? * Does it have fields with a surprising amount of question marks in? * Are the dates parseable with a single format? If you need to be precise, how many can be parsed by only one of the formats that's seen in the whole dataset? * How many things are blank? * How many things are blank-ish? NONE, FALSE, Empty, N/A, etc. * What does the encoding look like, are there any particularly weird characters? * What control characters can you see? (after hitting an enormous XML file which failed to parse half way though) * What does the type look like for each column? Currency (and then proportions), date, etc. * Are there number separators? Are they consistent? (1,000.00 vs 1.000,00)

Basically, what will trip me up later and leave me scratching my head before
having to add yet another bit of code to ignore a field?

One of my side projects at the moment is to pull this stuff together from rag-
tag bits of scripts and split up code to something I can just throw files at
and get an initial report.

Also, something very important in this is how things overlap. 5% of each
column being empty/broken might mean you have 7% of your data with almost no
information or 90% of your data missing at least one thing. Depending on what
you want to do, either might be OK or terrible.

> If we do our job right in these pieces, the reporting can be done by an
> intern who learned how to use a pivot table yesterday.

Yes, exactly! The best end point is where new questions and updates can be
done either by someone else like this or by the experts who really know their
field.

Ahh, thank you, I needed a bit of a data rant :)

~~~
infinite8s
What sizes of data are you talking about here? 1e5 or 1e8 (or higher)?

------
nostrademons
I think the "natural" structure of the big-data/data-science/data-
analytics/machine-learning market is that of a number of small vertical-
focused consultancies, each using a handful of very narrowly-focused tools to
solve specific problems.

Data science is an interesting market for three main reasons:

1.) The client owns the data.

2.) The skills needed to effectively manipulate the data are very tied up with
domain knowledge of the data itself. This tends to throw people accustomed to
the rest of the software industry for a loop: if you're a frontend engineer,
your effectiveness is much more based on your knowledge of JS/iOS/Android than
on your knowledge of the clients' problem. But when I watched the best data-
scientists at Google work (and did some unstructured data-mining and machine-
learning myself), I found that the majority of your time is actually spent
_looking at data_ \- it's pulling examples, collecting golden sets,
determining outliers, graphing data, etc. And it's often non-transferrable:
people who were very effective with the News corpus were often totally
ineffective with Social, people familiar with the Web corpus might be
ineffective with News, etc.

3.) You often need an interdisciplinary grab bag of tools to get useful
insights. The most effective data scientists at Google were the ones who knew
some basic HTML and JS or Flash charting libraries, because they could really
quickly graph their data and send the graphs around for comments. Deep
knowledge of machine-learning is usually less effective than pragmatic
knowledge of machine-learning _combined with_ a basic knowledge of stats
_combined with_ an intuitive sense of the users who're generating all this
data _combined with_ basic presentation skills.

This makes it very difficult to develop effective all-in-one tools that are
broadly applicable. Unfortunately, many VCs follow the hype machine and aren't
interested in consulting businesses, which means that a lot of money has
chased the last wave and not gone into efficiently solving problems. There's
probably a tradeable business opportunity here, but I have little interest in
going into consulting, so unfortunately I'm not the one that can take
advantage of it...

~~~
jamii
> There's probably a tradeable business opportunity here, but I have little
> interest in going into consulting, so unfortunately I'm not the one that can
> take advantage of it...

If you were, how might you trade in on it?

------
Eridrus
Heh, I thought this was going to be another post about JS frameworks, where
the title could be just as applicable.

Or just as applicable to cybersecurity, where hiring is a total disaster and
not even well known/liked multi-billion dollar companies can't find enough
good people.

Everyone agrees that you want good people, but you just can't find them
because there are so few of them.

The ones who are really feeling the pain are hiring more junior people and
trying to train them on the job, but you don't manage to successfully train
them all, so you either need to fire them or find them something else to do,
or you buy tools that let them leverage the skills they do have to provide
something useful, and maybe they get better over time.

Reminds me of when I was an application security consultant and I thought Web
Application Firewalls were really dumb since I knew several generic ways to
bypass all the firewalls on the market, but 7 years later WAFs have gotten
better, and consultants still haven't gotten any cheaper.

So while I'm sure there is immaturity in the big data tools space, there is
probably an 80/20 solution there that lifts much of the load off your skilled
data science professionals.

------
vinceguidry
Way easier, way cheaper, and way less risky to buy a tool that a less-skilled
person can use.

I see too much of this. Blog posts that attack a business's decisions, without
demonstrating adequate insight into the tradeoffs involved in making that
decision.

Skilled people are rare and valuable and hard to keep, _even if you do
everything right_. Obviously the most effective choice in any situation is to
use and develop human capital. But if and when that person leaves your
company, your investment just went up in smoke.

Fungibility of skilled labor is something we should all learn to love. We want
it to be easier to jump around and not harder. We want more options and not
less.

~~~
code4tee
Well I would assume the point is to build up expertise more broadly... not
just one person that knows everything and screws you when he/she leaves.

~~~
vinceguidry
Way, way more difficult. Companies are made up of lots of different people
with differing kinds of expertise. You don't want everyone learning
everything.

Derek Sivers wrote about how he had everyone in his company answer customer
phone calls. There were phones everywhere, and there were incentives for
employees to pick them up. It took a _lot_ of work, but it paid off.

------
cpg
"Just hire us" wrapped in a ton of buzzwords.

> Don't let your enterprise make the expensive mistake of thinking that buying
> tons of proprietary tools will solve your data analytics challenges.

...

> The truth is that a top team of data scientists can achieve great results
> ... This is broadly the approach that our own data science teams take when
> working with clients.

~~~
derefr
Every opinion piece can be rephrased this way. If someone believes in
something enough to tell others to do it, they very likely also believe that
thing enough to put it into practice themselves.

You'll only hear about the advantages of, say, git, from people who use git.
Does that mean they're telling you about git in order to get git some more
market-share, so that development effort will be even more concentrated on it?
Probably not; they probably just think git is a good idea, and _therefore_ use
it themselves.

------
CloudYeller
I wouldn't say it's a "lack of carpenters", but (to continue the metaphor) a
lack of good carpenters who work in nice woodshops. In other words, it's
failure to hire high quality people to implement high quality ETL systems.

I know plenty of teams are using Flume/Storm/etc, with unit tests, code
reviews, etc. But that's like <0.01% of all ETL-related jobs on Earth. For
everyone else, the biggest challenge with BI is doing basic, absolutely
_trivial_ stuff the right way. Why? (Forgive my generalizations, I'm too
cynical to tone them down).

ETL is usually done by inexperienced people because it's unrewarding,
uncreative work that doesn't require a lot of skill to do. With remedial SQL
skills and some Excel magic, a high school dropout can do ETL tasks that are
up to par with businessy (read: not Engineering) standards. Why are
Engineering standards not followed for ETL? Because ETL happens whenever
EndangeredMiddleManagerX says "We need this new report because SomeGuy wants
to feel relevant/smart at his next meeting." If they can produce that report,
even once, and the numbers are believable, everyone gets a pat on the back.
Repeat ad nauseam. In that kind of work environment, you can't hire a decent
engineer even if you try. Best case, you get someone with ~2 yrs of experience
who will leave the moment they find something better. Everyone else knows to
stay the hell away from that bs.

Big surprise when the same EndangeredMiddleManager types look at the past
couple years of data and notice undocumented tables, columns with NULL+"N/a"+"
" for missing values, databases that have disappeared and can't be rebuilt
because the code was never checked in...Business schools are supposed to teach
you that focusing on the short-term is dangerous. Guess that doesn't matter
when your career plan is to keep switching companies/getting promoted every
1-3 years, while you leave untold trails of festering crap behind you.

------
sandworm101
>>> We’re just scratching the surface of what the next wave of innovations in
data science can do for large enterprises. Our business is dedicated to
helping clients realize that potential.

Whenever I see this level of execuspeak I have only one advice: run. This is
not the pitch of an efficient outside contractor. This is the pitch of a
resource-sucking nightmare of a project that will survive for years generating
nothing much beyond powerpoints.

~~~
zzleeper
IMHO the best contractors are the best that try to actively make themselves
redundant and no longer needed. Sure, you might get called again later down
the line, but anythings else feels parasitic

~~~
swingbridge
Agreed. That's all the more reason why I dislike a lot of these software
vendors. They deploy all sorts of "solutions engineers" that are essentially
just an extension of the sales team and their solution to every problem is
either buy more licenses or buy this module you didn't get the first time
around. Sigh.

~~~
Silhouette
I have an analogous problem with my UI guy hat on.

I want to develop a UI that makes a complicated product simple, so users can
actually use the features effectively and get their job done. I don't want to
build a system that is just putting pretty colours on a convoluted mess of
configuration settings and interacting corner cases. That's what some of my
clients' competitors do, and their customers have to spend a fortune on
consultancy just to make the box work, and my clients pay me because hopefully
I'll do better.

Oh, wait, back up. My clients' competitors are making a fortune in consultancy
_because their customers can 't make their boxes work_. Good UI is an _anti-
feature_ if you can sucker customers into paying for consultancy as well as
your product but you can't reliably convince customers to pay more for a
product that is better in the first place even if it's cheaper overall, and
both of those things may be true more often than we'd like to admit.

This does not leave me in a comfortable position either commercially or
ethically.

------
meritt
If you found this article interesting, you should read Signal by Stephen Few.
Great book (as always from him) which sets out to solve this exact issue by
educating people on how to effectively analyze data.

> In this age of so-called Big Data, organizations are scrambling to implement
> new software and hardware to increase the amount of data they collect and
> store. However, in doing so they are unwittingly making it harder to find
> the needles of useful information in the rapidly growing mounds of hay. If
> you don't know how to differentiate signals from noise, adding more noise
> only makes things worse.

------
swingbridge
There are indeed a lot of companies selling expensive pre-packaged data
analysis tools and other magic black box solutions at the moment. In my
experience most of these are not that great and cost a small fortune. Large
enterprises are certainly a sucker for just writing a cheque for bad software
so I certainly understand why there was a bunch of VC investment in this
space. However I also agree though that this "a tool will solve our problems"
approach is vastly out of line with reality so am not surprised many of those
firms have seen their valuations slashed recently.

------
hodgesrm
This article reminds me of a great talk by Truecar at Hadoop Summit 2015 in
San Jose about how they build their Hadoop cluster. One of the takeaways from
the talk was that Truecar focused on training their team on basic Hadoop
programming techniques, such as teaching everyone how to implement map/reduce
in Java. The presenters felt this was critical to their success, which they
measured both in cost of the analytic processing compared to existing
solutions as well as the new analytic opportunities it opened up.

Among other benefits, training allowed them to step up quickly to solving real
problems rather than doing time-wasting POCs.

Unfortunately the talk is not posted online that I can find but it was a great
antidote to silver bullet technology fixes. Ironically the conference had a
huge vendor pavilion with about 100 companies trying to sell silver bullets.

------
KirinDave
Many large companies, even those that author software, view enterprise
software development as primarily an integration task, with components to be
sourced. While you can see how they might arrive at this conclusion and buying
a product is ostensibly easier than building a competing product, these shops
never seem to factor in the cost of integration. Legacy systems, inter-
departmental coordination, standardization on training... All of this is
really expensive.

And its made more expensive by the having to have a very expansive security
group attempting to fit external pieces into an internal puzzle.

What I think is the worst part of this is that the assumption is that software
architecture and design then becomes a task that revolves around the existing
and purchased tools, and very abstract pricing considerations can strongly
influence these toolchains. It leads to very strange forces acting on your
definition of a good technical hire. Almost none of them are beneficial for a
robust and/or diverse technical organization.

------
elliott34
Yeah I mostly agree with this. There are some small wins though like how BigML
allows you to export random forests into plain text node.js and you can just
deploy in shitty php pipelines and boom you're doing machine learning.

------
boomlinde
The article content totally aside (because reading it is a pain), but these
guys obviously went out of their way to design a website for mobile, so how
could they fail so horribly? In landcape mode, the fixed menu bar uses half of
the screen. In portrait mode, it _only_ uses about an fourth or a fifth, but
the social media sharing sidebar and the margins take up a fourth of the
horizontal space, leaving 3-4 words per line in the article.

At least it didn't prompt me to sign up for their news letter in the 30
seconds I spent on the site.

------
paulryanrogers
The premise does align with my experience. Skilled people can adapt to and
comprehend the business needs better than packaged tools. That said, well made
tools can include lessons learned that could be expensive to develop in house.
Often though those are only technical corner cases.

------
code4tee
Broadly speaking I'd say the premise of the article is spot on so far as most
large non-tech companies are concerned.

Data in the enterprise is a mess. A fancy tool won't fix that, only skilled
people can.

------
jkot
My impression is there are many types of hammer and chisels, but pneumatic
drill was not yet invented. And adding workers is not going to fix the core
problem.

------
dredmorbius
Insufficient contrast; didn't read.

