
Becoming a data scientist might be easier than you think - EzGraphs
http://gigaom.com/data/why-becoming-a-data-scientist-might-be-easier-than-you-think/
======
rm999
Meh. I've said it before, and I'll say it again: those contests aren't
necessarily a good indicator of who will be a good data scientist, in the same
way programming contests tell you little about who will be a good software
engineer.

Being a good data scientist requires a lot more than machine learning,
including a solid understanding of the business side (deep domain knowledge),
the ability to write production-grade software and tools,
scripting/hacking/data munging, math/statistics, and common sense. Running a
sanitized dataset through machine learning algorithms is maybe 5-10% of it.

I'm not trying to discourage people, I'm thrilled so many people are taking an
interest in data sciences and I want to push interested people in a direction
where they can excel at it. But this article is dangerous - becoming a data
scientist requires _a lot_ of hard work. I've seen a large sample of people
(through interviews) who think a single online class is enough to get into the
field. It's a great start, but if you want to be valuable you need a wider set
of skills.

~~~
marshallp
All the evidence from kaggle indicates that deep domain knowledge is not
required. Jeremy Howard has some youtube videos discussing this. Pretty much
all the skills you outlined (except for production grade code - which is a
software engineer problem, not a data scientist problem) are covered by the
contest.

~~~
svasan
Winning a kaggle contest and how a particular statistical model performed
under normal business conditions are totally different. (metrics of interest
for evaluating the model - how much money was saved, did the model bring about
requisite behavior change, etc.) And performance under real life business
conditions is what matters, not who won the contest. And to get good models
for a specific business need, you do need domain knowledge.

Does kaggle publish how the models performed under normal business conditions?

~~~
marshallp
I'm not following your line of reasoning. Everything is data at the end of the
day. All you're doing is creating a predictive model. If the business
conditions change, you wold simply send it to the data scientists to
reformulate a new model. That's no different to sending it to kaggle again.

~~~
svasan
>> Everything is data at the end of the day.

But you have to interpret the data within the context of the business
need/requirement.

Building a credit risk model is vastly different from building a (personal)
insolvency/bankruptcy model though both may entail the same set of steps in
developing the model. The variables that make it to the model depend on the
business need.

In kaggle, one of the datasets that I messed around with had variable labels
as Var_1, Var_2, ..., Var_X. So while fitting a model, I would not know why a
particular variable made it into the model. You can see that this kind of
variable labeling does not give me any insight into how that variable was
generated. I need to know whether the variable was raw/aggregated/transformed
etc. And that takes you back to understanding the data in the context of the
business/domain.

~~~
marshallp
Why not just give all data (or as much as you can afford to) to the data
scientist and let them figure it all out.

~~~
svasan
A data dump would not really help because the data could have been influenced
by

1) a key company policy 2) specific business activity 3) input coming from
another model

It is better to

a) define the problem, b) collect the data, c) build the variable library, d)
and then fit the model

rather than jump to step (d) directly because the modeler/scientist has
greater understanding of the entire set of data going into the model
development. It is very likely that the modeler would uncover any/all of the
three influencing factors I mentioned above, during the data collection stage.

While kaggle is an interesting concept, from a different perspective it looks
like an "effort harvesting" operation. For a pittance, the
companies/institutions that are sponsoring the contests are getting a steal.
(I am not sure if the million dollar prize is still up for grabs.) However,
for folks who do want to break into data sciences/statistics field, kaggle
certainly is a good platform to get acquainted with data science/statistics
related skills.

------
dschiptsov
_Bad programming is easy. Idiots can learn it in 21 days_..

I'm one of those who actually completed this course (with a score of
73.10/780) but it doesn't make you a data scientist. It is only the very
beginning.

The course itself is a brilliant work of a passionate top-of-the-field
professional. No wonder coursera.com is such a huge success.

~~~
Evbn
Is 73 out of 780 a good score?

~~~
dschiptsov
73 out of 80 and 780 out of 800.)

------
pitiburi
In an interesting turn of events, the article about machine learning was
trolled by a bot.

------
ahi
If you are an actuary you are already a data scientist.

~~~
EzGraphs
Yeah - particularly the mathematical/statistical side of things. But a big
part of the profession is software development / computer science. An actuary
might construct useful models and choose the proper techniques to analyze
data, but choose inferior technologies or suboptimal implementations that
won't work for large data sets.

A lot of actuaries spend most of their life inside of Excel. Excel won't cut
it with Big Data.

~~~
hessenwolf
A big part of the profession, and zero part of the training in the profession.
Actuaries write worse code than electrical engineers.

------
amalag
I guess the professor isn't kidding when he says in the videos. "After you
finish this course, you will know as much or more than the silicon valley
programmers doing machine learning" The material he presents is quite
distilled. He gives a lot of real examples, but the programming exercises are
sort of fill in the blanks. A lot of the hard work is done. You can still
learn a lot though.

------
betawolf33
Did anyone else notice the comments on this article?

~~~
ryaf
Yes. I can't tell if that's a bit or not.

~~~
ryaf
Bot _

~~~
Evbn
Aww, I would be much more excited to go read if it were a bit. (A comedy
routine, or alternatively, very very short)

------
waterlesscloud
This is a crucial step for the mooc's: students who have completed their
classes and go on to real world achievements. It's precisely the way that a
school builds a reputation, by the success of its students.

------
marshallp
Which brings up the question of what all these academic machine learningists
(especially the theorists) are up to? Why aren't they winning kaggle? Why
aren't Andrew Ng's own grad students collecting the prizes? Some self-
reflection needs to go on.

~~~
imgabe
Maybe they are spending their time conducting research instead of entering
contests? I'm not trying to be snarky, but do you know that they're even
entering the contests?

~~~
marshallp
Why are they not entering contests? What is the point of their research if it
isn't to engineer better algorithms for the task of machine learning? How do
they prove that their research led to better algorithms?

~~~
imgabe
I would imagine they publish papers that are reviewed by other academics. It
is possible to use their algorithm on a dataset without entering a contest to
do so.

It's a little like asking why Electrical Engineering PhDs are not designing
the new iPhone (probably _some_ are involved somewhere, I know). Research and
applications are two separate endeavors.

~~~
marshallp
The researchers actually do set up competitions to see who is best. However,
there are few entrants. In kaggle there are 1000s of competitors, so it's a
better judge of whether they are legitimately improving the state of the art
(rather than simply schmoozing their local funding agencies).

~~~
Evbn
Oh, just noticed the username. I hope someone hires Marshall soon, or he
finishes the book he is working on , or whatever, so this long form trolling
can come to conclusion.

~~~
marshallp
Well, that's rather rude. Maybe you haven't realized this is HN and not 4chan.

