
Big Data's Big Problem: Little Talent - Liu
http://online.wsj.com/article/SB10001424052702304723304577365700368073674.html
======
_delirium
I'm not sure that the kinds of employees that this article describes will ever
be a large number. There could be more of them in the future, but someone who
is top-notch at all of statistics, programming, and data-presentation has long
been less common than someone who's good at one or two of those. Companies
might consider looking at better ways to build teams that combine talent that
exists, instead of pining for more superstars.

I'm reminded indirectly of an acquaintance of mine who works on repairing
industrial machinery, where companies complain of a big skills shortage. They
either fail to realize or are in denial about what that means in the 21st
century, though. It might've been a one-person job in the 1950s, a skilled-
labor type of repairman job. But today they want to find one person who can do
the physical work (welding, etc.), EE type work, embedded-systems programming
(and possibly reverse engineering), application-level programming to hook
things up to their network, etc. Some of these people exist, but it's more
common to find boutique consulting firms with 3-person teams of
EE/CE/machinist or some such permutation. But companies balk at paying
consulting fees equivalent to three professional salaries for something they
think "should" be doable by one person with a magical combination of skills,
who will work for maybe $80k. So they complain that there is a shortage of
people who can repair truck scales (for example).

~~~
disgruntledphd2
I completely take most of your points, but I think that pretty much all
quantitative PhD's are going to be close to "data scientists". Given that
stats and explaining your research are requirements, all that's left is to
train them to program, which a lot of people are already doing. As a matter of
fact, since I heard about this big data stuff I've been honing my skills in
this area, in case the hype actually manifests.

~~~
Dn_Ab
_delirium's post acknowledges your point but is looking for an even rarer
person:

 _"There could be more of them in the future, but someone who is top-notch at
all of statistics, programming, and data-presentation has long been less
common than someone who's good at one or two of those"_.

Someone that can program, understands statistics and can present the data in
an appealing manner without losing significant fidelity. Many people
underestimate the difficulty and skill required in presenting data in a way
that makes sense and also actually says something.

There is a significant gap between presenting data that is satisfactory to a
research advisor and something that a business person with barely enough time
to think can grasp without misconception.

~~~
disgruntledphd2
Again, I completely see the difference (and am actually in the process of
moving full time to the private sector from academia, so will probably
understand a lot more in six months) but visualising data well is not that
hard. Step 1: learn R Step 2: Learn PCA Step 3: Learn ggplot2 Step 4: play
with the different geoms until you understand them (seriously though,
everyone's eyes are optimised to find patterns, and if you can apply
significance testing to these then you should be good) Step 5: profit!? Note
that I am being somewhat facetious here, but I suspect that the mathematical
knowledge and ability to apply this to business problems will be the real
limiting factors, as good practices in data analysis, programming and
visualisation can be learned. Granted that will take a long time to learn, and
there will be individual differences, but its doable.

Whether or not it will be done at all though is another matter.

Again, delirium's point is trivially true if one requires these people to know
_all of_ statistics, programming and data presentation as I don't think
there's anyone who knows all of any _one_ of these subjects.

I suppose it somewhat depends on what the skill levels for each of these areas
need to be, and that varies from person to person as well as from application
to application.

~~~
ryanlchan
Allow a short vignette from a former academic and now management consultant.

We spent six months at a major pharmaceuticals client examining their
reimbursement data. Poring over many millions of rows of transaction data and
thousands of payment codes (which, of course, were unique across sales
geographies), we determined the ten regions at highest risk of reimbursement
collapse. R was used, maps were created, beers all around.

But almost none of it was used for the executive presentation. In fact, the
only part that was included was that we had ten regions that needed fixing,
and our suggestions on how to fix it. You see, the CEO was dyslexic, the
chairman of the board was colorblind, and the COO was a white-boarding kind of
gal, so given this audience the nuts and bolts of our advanced statistical
analysis were simply irrelevant.

This is hardly surprising. If we are having so much trouble hiring people who
are fluent in Big Data, how can we expect business leaders to be even
conversant? With only slight exaggeration, the way you do your analysis and
the visualizations that you create are not important.

Companies are demanding Big Data scientists because they suddenly have lots of
data and see the term Data Scientist in the news. But what they really want is
not Data Scientists, it's business insights and implications from Big Data.
The customer needs 1/4" holes, but we're all arguing over which brand of laser
powered diamond drill they should buy.

~~~
thuzarsky
Nailed it.

------
pmb
"claims of severe talent shortage in Big Data
[http://online.wsj.com/article/SB1000142405270230472330457736...](http://online.wsj.com/article/SB10001424052702304723304577365700368073674.html)
Ok... where are the high salaries (500k$ a year)? No? No real shortage."

<https://twitter.com/#!/lemire/status/196245665951649793>

Business has a shortage of "big data" folks in much the same way I have a
"huge sailboat" shortage. Neither of us want to pay for it. We want it, but
not for the going rate. Only one of us has a media platform, though.

~~~
jandrewrogers
The salaries are already moving north of $200k even outside of Silicon Valley
and New York City and getting more expensive by the month. How high do they
have to be before we have a "shortage"? The problem is not lack of money, it
is that demand has greatly outstripped a finite supply.

Very high wages do not automagically create new people with the requisite
skills and this is the real bottleneck. It takes significant aptitude and
years of training/experience to become useful as a "data scientist". It is not
as easy as I think people are imagining. We train people with excellent raw
skills where I work, usually strong applied mathematics backgrounds with
natural programming skills. It is much easier than trying to find someone
outside with these skills, though we do attempt outside recruitment. It still
takes years to develop the people we train into a good, basic data scientist.

~~~
bearmf
Look, this job title is at most 2 years old. How can someone have years of
experience in this? OTOH, there are plenty of people with strong applied math
and good programming skills.

~~~
jandrewrogers
The set of skills existed before it had a trendy job title so you can have the
experience even if it was called something else. This is true of most of the
people currently working as data scientists. In a similar vein, I was
designing big data systems years before "big data" became a term or trendy.
For any particular odd skill mix you can come up with, there are people with
that skill mix who are already doing a similar job. But usually people do not
intentionally build that skill mix until it becomes an official job title and
career path in the eyes of the public so it is a very small pool of people.

In the case of modern data scientists, having strong applied mathematics and
programming skills is about halfway to where you need to be and a good
starting point. The demand has temporarily grown much faster than the
convertible talent pool can develop the additional set of skills required.

~~~
bearmf
I am of opinion that if demand is high enough, companies will start hiring
"halfway there" people. But this will happen only if the market grows big
enough. Right now it is still a niche market where companies are cherry-
picking right candidates, it seems. At least this is the impression I get from
reading this thread.

The question of the size of the market is crucial. Small labor markets are
very inefficient. This means that the number of qualified people is small
enough, but the number of companies they can choose from is also small. It is
hard to find a job when the number of companies hiring is probably less than
100.

------
mjw
As an engineer who's investing in developing "deep expertise in statistics and
machine learning" I can only stand to benefit from it, but something about the
current wave of Big Data hype makes me instinctively a bit wary.

Does this skills shortage really exist to the extent claimed? are there really
enough people out there who would know what to do with a 'data scientist' if
they were able to hire one? I see more talk than action, I see vendors
circling around looking to flog freshly-buzzword-compliant BI tools,
prognosticators trying to push nervous businesses into engaging in an arms
race over data.

Of course there's real value there too, for some at least. I hope my concerns
prove unfounded, but worth retaining a healthy skepticism I feel :-)

~~~
NyxWulf
As someone in the big data field on the ground (VP of Engineering). Let me
give you my thoughts on it.

Your impression about the hype is correct. There are a lot of vendors offering
BIG solutions, if you pay them BIGGER money. Where I used to translate the
word enterprise to $$, now I translate Big Data to $$$$$$$.

When I'm hiring, I don't go looking for Big Data people, because generally
they don't exist. Statistics is a really great general addition to a
programmers toolkit. Machine Learning is valuable as well, although in my
experience the application is more limited. What this article doesn't mention
is a whole host of other skills required.

Modeling, and not just a formal mathematical model, but applying any type of
model to your data to get insight. Check out the model-thinking class on
coursera.

Exploratory Data Analysis, much different skill than confirmatory statistics.

Design of Experiments, specialized subfield within statistics.

Logistics, how to setup, maintain, and maximally utilize an efficient
distributed cluster and build a pipeline getting your data to the cluster,
cleaning it, building it into a model, and then extracting insight and
delivering that end value.

Those are a couple of the skills at a high level. At a more nuts and bolts
level, Hadoop is the defacto standard for Big Data. Learning how to build a
data pipeline out of the Linux tool chain is very common in the data science
world.

The overall value stream for Big Data is deep and wide. Most companies don't
have expertise in much of these, and so at the current time you have to learn
them yourself or find a company focused on building a team around it.

If you are just learning this yourself, you'll probably get an academic
knowledge. If you want to make yourself valuable in the marketplace, you'll
really want to get hands on experience. Knowing a z-score is one thing,
building a process to gather data and compute a model against it is a whole
different ball game. As the article mentions, if you have nice clean data it's
easy to apply a model. If you have messy ugly data from 20 different vendors
and 200 clients with various failures, anomalies, and you have to figure out
what type of model is helpful, oh and you have a deadline because for 500th
time someone promised something impossible to the client, then you have
something closer to what Big Data is today.

* grammar edits

~~~
pgroves
_deadline because for 500th time someone promised something impossible to the
client_

This is a killer in machine learning applications. The toolsets rarely cover
the entire extent of what needs to be done, so at least some custom code needs
to be written. But results aren't deterministic - you don't really know if
it's going to work until you run it. Several iterations are often needed to
get to the first useable results. It has all the problems of building any
piece of software, plus another layer of risk that the accuracy just won't be
there with the first thing(s) you try.

My point is... actually agreeing to be the machine learning guy on a project
totally sucks because time estimates are almost meaningless, and the modern
business culture is to label anything late as a failure.

~~~
tgflynn
The company I used to work for had a performance based product. They only got
payed if they actually showed improved accuracy against a given evaluation
set. Then they got a fraction of the cost savings (say 1 year's worth).

This seems like it could be a good model for machine learning consulting, and
one that I would certainly be willing to explore.

It would work something like this :

    
    
      1) You show me your problem and your data.  
    
      2) We  come to an agreement on how accuracy would   translate 

into financial results and on a fair split of the savings or earnings.

    
    
      3) I develop a model.
    
      4) You evaluate it based on 2.
    
      5) I get payed based on 2.  
    

If my model doesn't meet minimum performance criteria I don't get payed. If it
does very well, and assuming the problem was economically interesting in the
first place, you save a lot of money and I get a fair sized chunk of it.

Feel free to explain why this business model wouldn't work.

Edited for formatting.

~~~
NyxWulf
Most business people aren't interested in model accuracy as a term. They want
something that provides benefit, e.g. cost savings, increased revenue,
increased profits, etc.

The sales process of convincing someone they need an accurate model is tough,
especially because robust models are time consuming and expensive to build.

If you can come up with a model that shows good results, and people know they
need those results, then you can start a company selling either a service or
product to get those results. If people don't know they need your results -
then you have to educate them, in which case it's a much more difficult
business to start.

I don't know many business people with the temperament, understanding, or the
pocket book to deal with general research type problems.

------
gambler
Managers frequently wail about skill shortages, but very often it's pure
hypocrisy. The real problem is the reluctance to do any training (and I don't
mean formal training) combined with the desire to get _proven experts_ in
whatever field. Proven experts must have years of experience in applying their
expertise. If _no one_ lets people with less experience to work in that field,
where the hell would those experts appear from? Another dimension?

Can I be a 80% developer and 20% "data scientist" in your company to try the
new role out? The bigger your company is, the less likely the answer to be a
"yes". Since Big Data implies a big company, the resulting "shortage" is not
surprising. It's self-made.

~~~
ImprovedSilence
True statement. It seems the trend now days is "get a grad degree, foot the
bill and time yourself". Many companies I see and work with tend towards that
mentality, as opposed to building a base of highly skilled workers on from the
inside. It's easier and cheaper for a company to ask if you have a piece of
paper, than for them to train you and get you up to speed.

------
chintan
How media sees Big Data:

BIG database => BIG machine learning algo => BIG MODELS => PREDICTIONs,
Insights => $$$

How it is actually done:

awk -F"\|" '{print $1}' SCRAPED_file_pipe.txt | sort | uniq -c | head -n 10 =>
$$$

~~~
gaius
Ah, you've used Ab Initio then.

------
paulsutter
This article is nonsense.

Talented developers are talented developers. At Quantcast we used Hadoop in
production before it was even called Hadoop and now we process 10PB a day. We
forbid our sourcers from using Hadoop as a resume search term because it meant
absolutely nothing.

Statisticians who can code are scarce, but companies that know how to use them
are scarcer.

------
tomjen3
Actually that is silly -- McKensey should now that there is and will never be
a talent shortage. There will only be shortage of talent at a particular wage
rate.

If the companies paid newly graduated 'data-scientists' (what other kind of
scientists are there? The tea-leaf reading kind?) 200k/year then they would
have a lot more. It is pretty simple economics.

~~~
yummyfajitas
Companies already do pay close to $200k/year for entry level data scientists.

 _(what other kind of scientists are there? The tea-leaf reading kind?)_

"Data scientist" refers to the guy who can set up a hadoop cluster, do
statistics on TBs worth of data, derive useful conclusions and speed it up by
tweaking the low level data formats or microoptimizing the calculation.

The issue is rarely paying these guys an extra $20k, it's simply finding them.

Setting up some lasers and a photonic crystal, imaging the output, making a
graph in excel or matlab and drawing conclusions is a different skillset.
Someone who can do the latter is a scientist who uses data, but he is not a
data scientist.

~~~
Tichy
How hard can it be, though? Like taking a normal CS person and making them
versatile with hadoop and so on? Could it be done for 20K$?

~~~
achompas
_How hard can it be?_

Very hard. You run into all types of candidates who just aren't there yet:
people working on research that's irrelevant to real world applications,
people who have done data analysis/BI work that brand themselves as "data
scientists," those who have the pedigree but cannot process and explore real-
world data, those who have good analytical chops but not the distributed or
advanced modeling experience, etc.

I've witnessed it first-hand, and it's tough to find the right person.

~~~
bearmf
If it is that hard the bar is probably set too high. Most of the skills are
learned on the job after all. Most smart PhDs who can program well and have
sound knowledge of statistics can learn to do this stuff.

~~~
achompas
Given enough time, anyone smart enough to finish a PhD can acquire a set of
skills. :)

But it's more than just solid statistics. We're talking about having enough
mathematical fluency to develop models _rigorously_ (not just "oh, we'll
minimize MSE!!"), test those models, then implement those models--possibly
using a distributed algorithm.

From what I hear, these skills take years to develop. Choosing to groom the
wrong person is an extremely costly mistake, so making the choice is
difficult.

~~~
bearmf
All mathematics consists of rigorous models. But choosing and tweaking a model
is more of an art. Most data scientists apply existing models to new data,
they do not develop new ones.

I am sure it takes much less than "years" for any smart PhD in applied
mathematics to learn most of data analysis tricks. It is not theoretical
physics after all.

~~~
achompas
_Most data scientists apply existing models to new data, they do not develop
new ones._

I meant "develop" in the software sense. Data scientists use off-the-shelf
libraries during initial research, but those libraries usually lack an
important feature preventing them from going into production (typically, no
support for concurrency).

 _I am sure it takes much less than "years" ... to learn most of data analysis
tricks._

I used to be cynical about "data science," too. After four months of working
on a data science team, though, I'm a believer.

A data scientist is really a "full-stack data developer." He or she needs the
ability to work with advanced models, use them to analyze large amounts of
data, and modify those models to work concurrently or in a distributed system
if desired (and its often desired). It's more than just "analysis tricks."

------
radikalus
Until the pay is comparable to finance, good luck?

I'd love to work on (arguably) cooler problems, but the combination of lower
pay and the constant need to use the "hot new thing" to solve problems doesn't
make transitioning look remotely attractive.

Really, the second is the HUGE obstacle: \- You don't know anything about
aNNs? Sorry, no job. \- Nobody uses aNNs anymore, SVMs are all that matters.
Sorry, come back after you catch up. \- SVMs? Man, we need someone who's got
expertise in optimizing RFs and Bayesian Trees. We don't want "black box"
machine learning. We need to "understand" the results. Sorry no job. \-
Decision trees? GTFO man. We're doing rNNs now. \- I'm pretty impressed with
your data mining knowledge, but we're looking for someone with a background in
DLMs and GPs. Sorry, no job. \- repeat until vomit/suicide

I kind of wonder about the need for "badass" math skills; I'm not terribly
convinced that math wizards are extraordinarily high value relative to people
with other types of data analysis skills.

------
tlogan
It seems the problem is that some companies are looking for person who is
expert in setting up scalable systems (Hadoop cluster, storage, high
availability, etc.) and that she/he also knows statistics and efficient ways
of processing and understanding the data. Good luck with that.

My observation is that requirements like this come from people who did mainly
web programing (and actually that was making a lot of money so with money they
become influential): assuming that this equivalent of writing both ruby code
and javascript code.

Building team is hard and in order to solve "big data" problem you need to
build a balanced team.

------
sandee
Basically a solution looking for a problem.

They are right, the complexity that big data caters for requires expertise at
both technical and business level that would be costly (though may not be at
infrastructure level). In the current economy, it looks even more difficult
where businesses want to squeeze the maximum out of dollar investment.

IMO, its too early stage for big data solution adoption. However stage could
be set for startups who can come up innovative solution that brings the cost
level down together with simple and useful easy to grasp solutions.

------
thornad
I've done this kind of thing most of my career, including doing it for NASA
and Unilever Research. You can't really train an average graduate to do this.
You need someone with a pretty highly developed integration between
1:intuitive/creative abilities, 2:mathematical/analytical skills, and
3:engineering/ability to make things happen. Add to that 4:work experience in
the real world, and 5:ability to easily understand how things work in a field
you delve into for the first time... And there's very few people in the world
who can do this. At my previous work place we tried for a whole year to hire
someone who would at have at least some of these skills and seems promising to
develop the rest on the job. We couldn't find anyone although we interviewed
about 30 different people (from about 500 resumes most of them with a PhD in
ML from a good university). And this was in central London, UK.

~~~
bearmf
But you never tried training them. Sure, no one can do it right off the bat,
without prior experience.

------
why-el
So I was wondering if any fellow HNer is on a quest to be at least comfortable
around these problems. Can you share your plans? Currently I am starting with
some linear algebra and I have plans to move to statistics then pick up a book
on machine learning. I would really use some advice.

~~~
seanharnett
Andrew Ng's machine learning class on coursera is a very nice, easy
introduction to the subject.

~~~
why-el
Yes, and I started that before realizing that I need more background
knowledge, hence starting with some math. :)

------
kylemaxwell
Not entirely sure this is true, to be honest. Most of "data science" lies in
the work of collecting and cleaning the data to get it into a usable state. A
recent story on the"fallacy of the data scientist shortage"[1] goes into more
detail, but in reality what we want in this quantity are better data
__analysts __. I love the idea of data science as, essentially, viewing
statistical analysis from a computer science perspective, but the breathless
predictions of a huge shortage seem a little overblown.

[1]: [http://smartdatacollective.com/nraden/48952/fallacy-data-
sci...](http://smartdatacollective.com/nraden/48952/fallacy-data-scientist-
shortage)

------
chrisrhoden
Can anyone explain what talent refers to in this context? Is it someone who
has learned this stuff, someone who is capable of learning it, or someone who
was born with an innate understanding?

------
giardini
Looks like the prelude to yet another H1-B buildup.

~~~
tosseraccount
The only way we can really measure "shortage" is via compensation. If that's
so, we need more hedge fund managers and surgeons, not grunt data crunchers.
There are many problems with guest workers. The richest people in the world
get special access to indentured labor. It targets specific industries thus
amounting to a subsidy. It helps big business crush small business. The H1-B
in particular is a tool to increase outsourcing and keep wages down (wages
which are typically earned in the highest cost of living areas in the country
at 60 hours a week). H1-B ? No, thank you! On the job training and good wages?
Yes, please!

------
ambiate
I just listened to a lecture on this at BU. Emerging Internet Technologies at
IBM or something of that nature. He was basically trying to sell us his
product that crawled the internet (mainly a firehose at Twitter) and gathered
statistics for advertisers and presented it in pretty graphics.

The main issue they had was developing language recognition. Deciding if a
user 'liked', 'loved', 'hated' or was 'neutral' about a product. Another issue
that stood out to me had to do with their reliance on the internet. Just
because 200 users tweeted that 'this movie is going to suck' does not really
represent the overall opinion.

To reiterate, the whole buzz of the lecture was the biggest turn off. He
wasn't explaining about how to expand on his product or where to go from here.
Just that they had developed a product and we could use it instead of
attempting to develop one ourselves.

~~~
ohashi
Maybe 200 is too small a sample size, but you can glean and predict stuff with
that sort of data. My master's thesis was about predicting box office sales
based off twitter data. (coincidentally these guys published a couple months
before me: [http://www.fastcompany.com/1604125/twitter-predicts-box-
offi...](http://www.fastcompany.com/1604125/twitter-predicts-box-office-sales-
better-than-anything-else)) but we had incredibly similar results. It is
pretty interesting to see what you can do with that data.

If it makes you feel any better, you can build it yourself, I certainly did. I
didn't even use any libraries like NLTK to build my sentiment analysis. Read
some research papers and built it from scratch (code wise at least, the ideas
used were fairly common). It's a fun challenge. I still work with that code
every day and use it in my startup now :)

------
harscoat
Surprised Ben Rooney did not mention IBM acquisition of Vivisimo the day
before (Apri. 25) this article (Apr.26)
<http://www-03.ibm.com/press/us/en/pressrelease/37491.wss> "IBM Advances Big
Data Analytics with Acquisition of Vivisimo"

------
mardack
This is good news.

------
maeon3
Why don't the people who hire doctors, dentists, and lawyers suffer from the
same talent shortage that the people who hire 'big data' computer scientists
feel?

Because it's better across the board to start your own startup than work your
ass off for a 4% raise at a place which recognizes you as top talent. I'm on
the verge of starting a startup myself, removing myself from the people in
this list. There is a shortage of talent in computer science, but never in the
other disciplines, it may take another 30 years for the suits to have the
ability to understand why.

~~~
enfilade
I agree with your overall line of thinking.

In your last sentence you wrote: "There is a shortage of talent in computer
science, but never in the other disciplines, it may take another 30 years for
the suits to have the ability to understand why."

Could you elaborate on this point -- do you feel that there is not a shortage
in disciplines such as dentistry and law because many people _are_ willing to
work very hard for only 4% raises?

Thank you!

~~~
tomjen3
You can actually make good money by going into the law or medicine. You have
to work and and be skilled of course, but lets be honest that is also required
for a start-up.

I can't help but think that ability is because you can't run a law firm
without being a lawyer so the boss has some idea of what it means to be a good
lawyer and how to treat them.

~~~
anothermachine
Most lawyers at the top-income end hate their bosses and jobs. Law firm
partnership track at large firms is a dog-eat-dog 80-hour week hell. People
are only happy when they are the few who claw to the top, or the many who drop
out. The ones in the middle are suffering as bad as any stereotypical bank
programmer.

