
Why becoming a data scientist is not easier than you think - misiti3780
http://www.josephmisiti.com/why-becoming-a-data-scientist-is-not-actually
======
svdad
I couldn't agree more with this. The ML courses that Ng and Koller teach are
really missing a lot of the statistical tools you need to do real-world data
mining and ML.

My experience: I had basically zero math background, but I took ML with Ng and
probabilistic graphical models with Koller, and later was a TA for Ng's ML
class, during my Masters' degree and thought I was all set to go into machine
learning jobs. To my surprise, I consistently found myself in interviews
stumped by questions from basic stats, particularly significance testing,
which people with more traditional stats backgrounds assume is basic knowledge
(and it should be), but which wasn't taught in any of my ML classes.

I'm in a job now that involves some machine learning, but the ML component is
50% marshalling data (formatting, cleaning, moving), 40% trying to figure out
how to get enough validated training examples, and 10% thinking about the
right classifier to use (which someone else already implemented). Which to be
honest is not very interesting.

So yeah, becoming a real data scientist is hard, requires a lot more knowledge
than you get in one ML course, even from Andrew Ng, and the reality of the
work often doesn't make it some dream career. And the competition for jobs
isn't from other people who also just took that course -- it's from PhD
statisticians and statistical physicists who might have taken one ML class to
show them how to use all the mathematical tools they already have to do the
new hot thing called machine learning.

~~~
cageface
I was pretty gung ho about getting into ML two years ago and put a lot of time
into online courses like Ng's, books, and ground-up implementations of a lot
of the common algorithms. I enjoyed it, but after a while it became clear to
me that a lot of this stuff is better described as applied statistics.

And this can be powerful, of course, but it doesn't really have much of the
magic of AI.

~~~
plinkplonk
"a lot of this stuff is better described as applied statistics"

This is a _key_ insight. Bravo!

To generalize a bit more, most of ML is applied _mathematics_. Getting a good
grounding in the underlying math is the most illuminating step to learning ML
(spoken as someone who wasted a lot of time doing other things thanks to an
irrational fear of learning mathematics and am still bad at it)

 _Deep_ math/stat understanding _combined_ with the engineering bits(like
programming, cleaning the data, running clusters) and the communication bits,
(like visualization) brings you to (what should be) 'data science' (imvvho
ymmv etc etc).

I am still not sure one person can pull it all off-it probably needs a solid
team of specialists. But hey 'data scientist' is a hot job description, and so
you can't blame people who know bits and pieces (sometimes _very_ small bits
and pieces ;) ) calling themselves 'data scientists' or whatever. "Machine
Learning for Hackers" and all that jazz. We've seen all this before with "HTML
coders" from the nineties.

~~~
robrenaud
Work as a search quality engineer at Google and you do pretty much all of
that.

Except for running the clusters[1], I've done pretty much all of those steps
myself. I started with a nice statistical idea, built some simple models,
played with feature selection and learning algorithms, built model viewers,
built classifiers, validated classifiers, built demos, validated demos, built
a production implementation[2], optimized the production implementation to
make it small/fast enough, and finally launched a big search quality
improvement.

[1] I certainly write distributed code that runs on them, but maintaining the
DCs definitely isn't part of my job description.

[2] Validation of the final quality in prod is actually someone else's job,
not because I couldn't do it, but you might not want me to tell you how good
my stuff is, cause you know, I might be biased.

------
rjurney
I think the idea is... at the moment there is this elite few people that
actually have the entire skill-set of 'data scientist,' to one degree or
another... but we need more of them. And the way to achieve this isn't to say,
"Get a decade's experience across an enormous area of math, computer science,
domain experience, and then come talk to me." The way to achieve this is to
make data science seem more approachable than it is at the moment... in the
hopes that it will be more approachable as we build courses like the one you
critique.

This of course devalues your own skills, as you are one of the elite few.
Unless you start writing textbooks. Which you should do, if you're one of the
few. And self-promote like hell. If that course doesn't cover it - what does?
Do you acknowledge that in a few years some shortcuts might be possible - that
budding data scientists might not need to hae read every book that you have? I
bet if you try, you can make your own shortcut in the form of a book.

Which is followed by a link to my own book on this topic, Agile Data:
<http://shop.oreilly.com/product/0636920025054.do> which attempts to demystify
as much as teach.

~~~
tel
There are different classes of skills. For lots of organizations simply adding
someone who can run a MongoDB mapreduce into a t-test (or SVM or whatever
flavor you like) and print out a bar chart is going to be an upgrade.

To fill that kind of need we can use a lot of bodies with Coursera degrees.

------
greenyoda
I still don't understand why it's important to have all these areas of
expertise embodied in a single person called a "data scientist". Rather than
hire one of these rare and expensive people, why couldn't a business hire a
statistician and a couple of computer science people and have them work as a
team? Given how few data scientists there currently are and the high demand
for them, you might even be able to get these three people for less money than
one data scientist.

Also, someone who has to constantly shift their attention between statistics
and database servers might get less done than somebody who can concentrate on
the mathematics and let their co-workers handle the implementation details.

~~~
marshallp
Not only that, "data scientist" is just a marketing gimmick for consultants.
It's a standard set of skills that anyone trained in science or engineering
has. It's ok to use it in front of pointy headed bosses but in front of nerds
it's slightly dishonest.

~~~
leroix
^this. I've often wondered where the science in data science is. I don't see
it. If you're an engineer or scientist, you probably don't either.

~~~
icelancer
The science in data science involves testing hypotheses, like anyone following
the scientific method would do. It's mandatory for validating ML models.

~~~
Evbn
But most public examples of "data science" don't do this. They just publish
pretty graphs that a great completely unrigorous.

~~~
icelancer
OK, but that doesn't mean "if you're an engineer or scientist, you don't see
the science in data science" is a true statement. It's categorically false.

------
benhamner
Well written, but I believe you missed the point of the original article.

No one ever claimed that taking one class made someone an expert "data
scientist." Instead, that single class wetted Luis, Jure, and Xavier's (the
three competition winners) appetites, and pushed them more to learn more about
machine learning and natural language processing. They then went on to dive
much deeper, and excelled specifically in one area of applied NLP.

However, without that first class, there's a good chance none of them would
have ever focused on (or heard of) machine learning. Their story is growing
increasingly common. Like the Netflix Prize, Andrew Ng's first Coursera class
did its part in shining a spotlight onto our dark little corner of the
universe.

I'd be very cautious about a long checklist of items that are necessary to be
a successful data scientist (which is a pretty ill-defined and encompassing
term at this point). That is a decent summary of many useful tools of the
trade, but they are by no means useful for all problem domains. For example, I
could spend years working on machine learning for EEG brain-computer
interfaces without a good reason to use databases or "big data" NoSQL
technologies. I especially enjoyed MSR's take on the matter in "Nobody ever
got fired for using Hadoop on a cluster"
[http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...](http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf)

When we're hiring data scientists or seeking successful ones, we've found
focusing on demonstrated excellence in one relevant area plus general
quantitative competencies and the curiosity and tenacity to learn new tools
and techniques works far better than a laundry list of skills and experiences.

~~~
notimetorelax
Absolutely agree, also author didn't take into account that there are software
engineers with good math background who don't know how cool it is to do
machine learning. I'm one of them and I've spent last month and a half doing a
lot of ML and enjoying the heck of it! My next step is Kaggle.

------
alexatkeplar
As someone who writes software for data scientists
(<https://github.com/snowplow/snowplow>) I definitely agree with his analysis.
But I would go further: without _domain knowledge_, a data scientist is really
just a ETL guy who cleans up big data for the real analysts to make sense of.
Applying the whole toolkit to a specific domain (and SaaS B2B looks totally
different from supermarket loyalty schemes and from F2P mobile games) is key.

------
srconstantin
Look -- as long as "data scientist" is a sexy job title, a lot of different
jobs are going to claim they fall under that umbrella. I have an applied math
background, and I'm fine with scientific computing, but I have much less
experience with databases. I'm a very different candidate than a software
engineer who took a machine learning course. Maybe in a few years we'll have
more intelligent language for making those distinctions.

It shouldn't be surprising or bad news that some "data scientists" have deeper
knowledge than others. We're going through a quantitative revolution -- many
fields and industries are nearly untouched by statistical analysis/machine
learning, and so there's a lot of low-hanging fruit in going from "nothing" to
"something." Even somebody who only knows a little can add value at these
margins. But, of course, that won't be true forever -- look at quantitative
finance, which is very competitive and requires a lot of education, because
the low-hanging fruit was picked in the 90's.

There's room in this world for the statistician, the mathematician, the
database engineer, the AI guy, the data visualization expert, the codemonkey
who knows a few ML methods, etc.

------
plinkplonk
"Coursera skipped over Bayesian learning"

This probably needs to be clarified a bit to say that Ng's course skipped
this. Daphne Koller's "Probabilistic Graphical Models" (running now at
Coursera) covers this in great detail.

Minor tweak in an otherwise nice post.

~~~
icelancer
"This probably needs to be clarified a bit to say that Ng's course skipped
this."

That seems weird, considering Bayesian learning/methodologies are a big
cornerstone of ML.

~~~
Evbn
How much can you expect someone to learn in 10 1-hour sessions, starting from
a college freshman background?

~~~
icelancer
A fair amount; I've seen some of his work and I think it's pretty good. I
don't really like Octave (his program of choice), but I understand why he used
it. I would have gone with something higher-level, since a lot of decent tools
are out there that add a layer of abstraction to ML.

Anyway, I consider Bayesian logic a cornerstone of ML modeling. It's not so
much the content that needs to be memorized/understood as much as it is the
way of thinking that Bayesian methodologies present over frequentist
statistics.

Still, I'm sure it's a good class.

------
rm999
As I argued in the comments of "becoming a data scientist might be easier than
you think", the attitude that entering the field is easy is dangerous because
it is very untrue. I really wish there were more qualified people in the field
(do you know how hard it is to hire?), but entering it without the proper
knowledge doesn't help anyone.

I'm excited about what the future brings. Many industries has seen the value
in data sciences, and Universities are following (see, e.g. Columbia's new
data sciences institute).

------
Rickasaurus
Really though, unless you have a strong understanding of both calculus and
statistics you'll never be a "data scientist", you'll just be a library
jockey.

~~~
tel
And probably database theory, probability calculus, matrix analysis, graphical
models, stochastic processes, information theory, &c &c...

~~~
notimetorelax
Yes, yes, it's not like you started from something. Were you born data
scientist? Just try to accept that there may be people with almost full skill
set who didn't know where to apply it? I'm pure mathematician by education and
I really didn't know where to apply my skills, having followed ML, NLP, Big
data, PGM courses I have much better understanding now.

Now even if I weren't a mathematician it doesn't mean that following Coursera
courses for a couple years and doing a lot of work at home wouldn't get me
somewhere. You don't have to switch jobs as soon as you finish ML course but
you can certainly practice your skills at home.

~~~
tel
I'm not saying the Coursera classes are bad---just that there's a chasm
between being able to implement or derive Naïve Bayes and being able to do
meaningful work or study in applied statistics.

~~~
notimetorelax
The way I see it you also learned it somewhere sometime ago. Coursera offers
more and more courses some of which are quite in-depth. Motivated person may
be capable to finish university degree in couple years. Coupled with the fact
that this person may be gainfully employed in similar area, it will not be
that hard to imagine him or her to eventually switch to a position that
requires all the old and new knowledge. IMHO, examples that were given in the
original article are in line with what I described.

------
Homunculiheaded
Maybe I'm mistaken but I think most of the people interested in becoming 'data
scientists' are either currently doing lots of software with an interest in
stats, or people doing lots of stats with an interest in software. Given one
or the other half of this list is probably already very familiar territory.

I actually found this list encouraging because the things I don't know well on
that list are things I'm working on and am aware are holes in my knowledge.

But in the end the reality will always be that the people who are "real" data
scientists will be the people that are actually solving real problems whether
or not they can check off every bullet point on a check list.

------
jkimmel
I think it's interesting to note the number of professions that may already
fulfill a "data science" role, just with a different title. I worked a job
where my primary role was data analysis: parsing data with Unix commands,
feeding it to classifiers, applying standard algorithms, drawing meaningful
conclusions, etc.

Sound familiar? The entire team I worked with had a similar workflow, but we
went by life science domain specific titles rather than "data scientist." I'm
willing to bet that other professions have similar roles, merely called
something else.

I think "data scientists" are out there in the sciences. They just don't go by
the latest buzzword.

------
Irishsteve
Replace scientist with analyst and all of a sudden 75% of the people
interested in this career path don't care anymore. I don't get the absolute
obsession and sexification of a role that has existed for a long long time
already.

~~~
michaelochurch
Data scientist: in many companies, this means a software engineer with an
additional credibility that gives him dibs on the most interesting projects. A
lot of data scientists end up working on distributed systems problems that
would typically be considered closer to hard-line engineering than machine
learning or data analysis.

It's an XWP vs. JAP issue:
<http://michaelochurch.wordpress.com/2012/08/26/xwp-vs-jap/>

Once you're at or near 30, you realize that you won't be able to stand the
software career unless you get an edge in picking projects, because the vast
majority of the engineering workload is line-of-business bullshit that you
don't learn much from. To grow as a programmer, you have to beat (or cheat)
the project allocation game. One avenue is to go into management, but that
doesn't work because bosses who take all the interesting work for themselves
get undermined. An alternative is the "architect" designation, but becoming an
"architect" is even more political than moving into management. Right now,
"data science" is a title that has enough of a "+1" to it that it gives
engineers the ability to put themselves on the most interesting projects.

~~~
srconstantin
This makes sense.

It's a "The way to Tara is via Holyhead" kind of thing.

------
jnazario
what's missing from any discussion here or in many of these "this is the new
hotness" posts is this: science.

where's the science? it is, after all, a data scientist role. where is
learning to do actual science?

what the world's been describing is an analyst or an engineering position, not
science. if you don't know how to ask questions, interpret results, structure
experiments - then you don't know science, so quit calling yourself a
scientist. science involves a rigor of thinking and doing that has been
omitted here.

~~~
icelancer
Indeed. It may come as no surprise to you, then, that I push people with a
degree in physics far ahead of the line when hiring for my Data Science team.

------
svasan
While it is not easy to become an expert in any field or pursuit, one should
neither overplay the "it is a very hard field / not very easy" argument nor
should one underplay the effort involved in becoming good. 10000 hours
(equates to roughly 5yrs at 40hrs per week) to expertize seems like a good
rule of thumb to keep in mind. Someone said - "The test of a vocation is the
love of the drudgery it involves."

For any field, one has to provide positive encouragement (and a good
platform/set of tools and techniques) to people seeking to get into that
field, while being grounded in reality.

------
001sky
Isn't the real skill for a Data Scientist one of scalability and abstraction,
from Data? While its critical to be able to get the data and make it more
plastic, for measurement of metrics, real-time pricing, or even various weak-
form predictive variables, its ultimately the analysis and understanding that
is Critical to monetization/value extraction. And to build a good system for
this level of data transparency, you need some good high-level understanding
for clarity of vision. There are lots of people good at all manner Quants, but
like the interview question, how much complexity can you explain in 5 minutes?
is not one all answer equally. The ability to scale from granular detail to
abstract levels of organization, meaning, and pattern recognition, are
critical to extracting value in these contexts.

Also, its not clear folks are using consistently the term vis a via
scale/scope. Consider an anaolgue of knowledge and expertise (real estate
example):

L1 Architect>

l2 Contractor>

L3 Sub-contrator>

L4 Builder/Laborer>

Is Data scientist an Architect? Or the person that builds the building? Is he
the guy that does the plumbing? Although the up and coming "quantitative
system analyst" probably doesn't quite ring the same tune on a biz card. And
most refer to lower level quant mastery, eg. social engineering or
quantitative finance, as a "black art" not a science. Without a high level
vision, the concept/title seems...grandiose, until you get to very extreme
levels of skill. And then it makes sense.

~~~
EwanToo
I think quite a few data scientists are taking the position that they're a
specialist plasterer in your metaphor, someone who does one job very well, but
can't help design the skyscraper they're working inside.

Good plasterers get paid (and treated) pretty well, but they're never going to
be the architects who designed the building.

~~~
001sky
You raise an interesting point here. A good (old world) plasterer is best
thought of as an Artisan. This word has been lost in the modern era
(Artist!=Artisan), but thats arguably at our loss. The "black arts" are
inherently those that require practioners to get their hands slighltly dirty
in the details.[1]

_____________

[1] Alternatively, i'm starting to see the clearer picture: "Data Scientist"
is more a contraction than a description. viz: "Data-Analytics-Literate
Compter Science" becomes _"Data_Science"_ if you just omit the middle terms.
For marketing purposes, moniker "Scientist" is highlighted as status signifier
(Academically ~akin to "Artist").

------
theschwa
I have to really thank you for this article, but it may have had the opposite
affect on me. I've always felt like I have a disjointed skill set, but this
makes me a bit more confident in looking into this field and it give some good
ideas of what I should brush up on. I know this may not be what the author
intended, but it's appreciated never the less.

~~~
randomdata
I came away with similar thoughts. The article made it seem like becoming a
data scientist is far easier than I ever imagined.

------
wojt_eu
What supposed to be a rant turned out to be a nice little list of things worth
learning in data analysis field. Thanks!

------
niels_olson
Thanks for validating I have been pursuing the right skill set. I just wish
that list existed 20 years ago when I started college. Instead, I have been
teaching myself off and on since 1999.

------
samg_
If I have learned anything from ml-class, pgm-class, nlp-class, and now
neural-nets, is that becoming a data scientist is one of the hardest things
I'll ever eventually succeed in doing.

~~~
sandee
The good man was trying the opposite. Which is to make things as simple as
possible for early students and get them do the hardest thing they will ever
do: "solve some problems using these tools"

Instead of chasing after some title ("data scientist ..."), find ways of
solving some useful problems with whatever you learnt. The article argues that
becoming an expert is difficult, which may be true. But that does not mean you
don't know enough to start digging at your problems in hand

------
JoelJacobson
What are the best tools to visualize table data?

I've been testing Tableau, but it's only for Windows and I'm on a Mac.

I'm looking for something which easily connects to your SQL database and
allows you to produce all kinds of fancy graphs, with a easy user-interface.
Tableau comes close to what I'm looking for, but my gut feeling is there
should be a whole bunch of good commercial and free software in this field out
there, but my googling haven't gave any good results yet.

So if anyone have any good suggestions, please let me know.

------
pheon
being a good data scientist is about having enough intuition about the dataset
to _ask the right question_ aka form the hypothesis.

working out the question is what makes it a hard(and creative) process, and
then you can apply your ML toolbox.

edit: whats different from a data scientist vs analyst/statistican is they
build their own tools as the datasets are too massive & non-standard for the
usual toolset.

------
jiggy2011
So is there a good middle ground between being a web developer and a data
scientist? If so what would be the most useful problems for such a skillset to
solve?

------
michaelochurch
I was a math major in college, with a focus on pure math. I did a year or grad
school (math PhD program) and left to work on Wall Street (and worked for a
couple startups, and Google, in that mix). In all, I spent 6 years as a mix of
quant, trader, software engineer, startup entrepreneur, data scientist.

The software engineering career is in somewhat of a mess right now. It comes
down to the "bozo bit" problem. Being a software engineer (even with 10+ years
of experience, because there are a lot of engineers who only do low-end work
and don't learn much) is not enough to clear the bozo bit, and you won't be
able to prove that you're good unless you have a major success, and it's hard
to have that kind of success without people already trusting you with the
autonomy to do something genuinely excellent.

It's not enough to write code, because LoC is a cost and source code is rarely
actually read at the large scale. At least for backend developers, the only
work-related (as opposed to political) way to establish that you're worth
anything as a engineer is to have an architectural success, but it's very hard
to have architectural successes unless you've established yourself as an
"architect" to begin with. So there's a permission paradox: you can't do it
until you've proven you can, and you can't prove you can do it until you've
done it. Hence, the vicious politics that characterize software "architecture"
in most companies.

Functional programming is one way to put yourself head-and-shoulders above the
FactoryFactory hoipolloi. The problem is that most business people don't
understand it. They just think Haskell's a weird language "that no one uses".
Elite programmers get that you're elite if you know these languages, but most
companies are run by people of mediocre engineering ability (and that's often
just fine, from a business standpoint).

It is true that functional programming is superior to FactoryFactory business
bullshit, but not well-enough known. Good luck making that case to someone
who's been managing Java projects for 10 years. What is better known is that
mathematics is hard. It's a barrier to entry. I doubt more than 5% of
professional programmers could derive linear regression.

So I see the data science path (and yes, it takes a long time to learn all the
components, including statistics, machine learning, and distributed systems)
as a mechanism through which a genuinely competent software engineer can say,
"I'm good at math, and I can code; therefore, I deserve the most interesting
work your company can afford to fund." It's a way to keep getting the best
work and avoid falling into that FactoryFactory hoipolloi who stop advancing
at 25 and are unemployed by 40.

~~~
ericmoritz
I don't know how you can say that knowing functional programming inherently
makes you a better programmer than another. A bad programmer is a bad
programmer regardless of domain.

I have seen some ninja FP artists that have written some really terse
solutions in a FP style which are completely unreadable to the uninitiated.

Also, I do not think that there is anything intrinsically difficult about FP.
Knowing it proves that you are ambitious and that you have mental plasticity
but not necessarily that you are a good developer.

You're just as likely to become that weird dude that writes code no one
understands as you are to become Mr Wizard.

~~~
Evbn
Mr Wizard was accessible to children!

------
marshallp
The author clearly has a bias (I'm a data scientist so respect me and pay me a
lot). He's then gone on to describe some standard programming and maths skills
that a huge number of people have (taught to
engineers/scientists/programmers). I'm going to get downvotes and be labeled
troll but I just have to plainly disagree. Data science isn't some magic new
career field, it's simply the application of standard scientific tools to
tables of numbers. As netflix and kaggle competitions have clearly
demonstrated, literally anyone from anywhere has a shot at be the best on any
particular spreadsheet of numbers (that what it boils down to, (possibly
large) spreadsheet of numbers).

~~~
dantkz
You actually do not contradict the author that much. The thing is, the skills
that are required are somewhat standard programming and graduate level maths,
which narrows down the number of people to graduate level computer scientists.
As he mentioned, the data science is NOT JUST application of someone's
algorithms to the data, you need to have skills to preprocess the data, be
able to use distributed systems for big data, actually know what is going on
under the hood, etc. Also, I doubt that the netflix and kaggle competition
winners were anyones from anywhere, they probably already had quite a bit of
experience with ML.

~~~
OverKAnalytics
Agree, unless your arguing that graduate level math(s) are only 'truly
understood' by individuals with post-graduate math(s) degrees. In my opinion,
reading (and practicing) is a viable alternative to the same skills. Or, as
more famously said by Matt Damon...
(<http://www.youtube.com/watch?v=ymsHLkB8u3s>)

~~~
icelancer
Actually, the strongest candidates I've screened have been self-learners who
list Coursera or their own OSS projects on their resume with little academic
background. (Our best analyst is a guy who is qualified to repair VCRs with
his trade skill diploma in rudimentary electronics.)

