
How to Become a Data Scientist, Part 2 - iamjeff
https://www.experfy.com/blog/how-to-become-a-data-scientist-part-2-3
======
stevehiehn
I don't like the term 'Data Scientist' at all. I think its far to loaded. Its
a bad thing. Many developers already building sophisticated analytics and
predictive systems will avoid identifying with what they are really doing in
fear of being challenged as to whether they are a 'real' scientist or not.

~~~
manigandham
Yes. All scientific disciplines involve data, and while there is a field of
studying "data" as a concept itself, this is most certainly not what these
people at these companies do.

The accurate title is Data Analyst, something that has already existed for a
long time and works just fine.

~~~
gaius
Or business analyst, or quant in the world of banking/finance. Someone who
calls themselves a "data scientist" lacks the awareness that people have been
doing this stuff for _decades_ , and therefore also lacks the body of
knowledge that has built up over that time. Actually, anyone can download a
couple of open source libraries and run the numbers through them, so where is
the value-add of "data science"?

------
mastazi
In part 1/3, the author writes that there are 2 branches of data science:

> Data science for people (Type A), i.e. analytics to support evidence-based
> decision making

> Data science for software (Type B), for example: recommender systems as we
> see in Netflix and Spotify

Isn't "type A" business intelligence, and isn't "type B" machine learning? Why
doesn't he use those more widely known terms? Or maybe he is referring to
something else?

~~~
SatvikBeri
Both business intelligence and machine learning are narrower terms with a more
specific meaning. Business intelligence has the connotation of a certain set
of techniques (e.g. reporting, SQL, querying, PowerPoint presentations) while
machine learning is a fairly specific set of tools that is much narrower than
what most software data scientists do. E.g. I'm a software data scientist and
I spend much more time on descriptive statistics than machine learning.

~~~
mastazi
Yes, I agree that both ML and BI are narrower definitions compared to that of
data science.

------
tomrod
I liked this series and this part. I think it's important for people using
data science in the industry to continue giving insight into best practices,
feedback to academic programs, and occasional insights into the problem
applications. In my mind, this ultimately improves the quality, education, and
marketability of data science.

~~~
iamjeff
I discovered the series earlier today on HN and the discovery could not have
been any timely-er. I am just about to embark on the first six to eight months
of a learning journey and see immediate utility in insightful series such as
this one. I also came across a really helpful post that gives recommendations
on progress markers for the self-taught developer [A Better Way to Learn
Programming? Notes on The Odin
Project;[http://everydayutilitarian.com/essays/notes-on-the-odin-
proj...](http://everydayutilitarian.com/essays/notes-on-the-odin-project/)].
Guides like these, while they take a lot of time to write and refine, are
complete lifesavers for entry-level professionals and prospective
practitioners (and especially if they come from professionals that have been
"tried and tested").

~~~
tomrod
Guides like these inspire me to quit being lazy and get back to writing one!

~~~
iamjeff
You are a data wrangler? Perhaps a guide would do; noobs like me have no
perspective on what to learn and how to learn it. I mean, it wasn't even a
year ago that I was convinced that I could go from 0 to 100 data science-wise
in under a year: I wanted to learn it all. It took me the better part of a
year to realize that I had wasted innumerable hours devising a curriculum and
timelines that were plain dumb. A practical guide could have spared me a lot
of hurt and while I cannot at this moment compensate you (or the community for
that matter), I am sure that opportunities will certainly arise for me to pay
my debt. Would love to see a guide from you- it would come with the added
advantage that you would be accessible to the brilliant HN community. I would
give away a limb to see such a discussion go down: what to learn and where to
learn it from (a lot of folks, I imagine, would not mind recommendations for
openly accessible material; I know I wouldn't mind that)? how fast should you
expect to go/move/learn? time commitments? tools and frameworks? motivation
hacks? where would I go to find remote jobs? what level of proficiency should
I achieve in the first sprint?

~~~
minimaxir
The best way to learn how to wrangle data is _practice_ , especially outside
of academic settings, where the example data is not necessarily reflective of
real-world data.

Helping others wrangle data is one of the reasons I publish my Jupyter
notebooks open-sourced. A few examples my data wrangling with R:

Processing Stack Overflow Developer data: [https://github.com/minimaxir/stack-
overflow-survey/blob/mast...](https://github.com/minimaxir/stack-overflow-
survey/blob/master/stack_overflow_dev_survey.ipynb)

Identifying related Reddit Subreddits:
[https://github.com/minimaxir/subreddit-
related/blob/master/f...](https://github.com/minimaxir/subreddit-
related/blob/master/find_related_subreddits.ipynb)

Determining correlation between genders of lead actors of movies on box office
revenue: [https://github.com/minimaxir/movie-
gender/blob/master/movie_...](https://github.com/minimaxir/movie-
gender/blob/master/movie_gender.ipynb)

~~~
nl
This.

I'd add that Kaggle is very good for the "other end" of data science: they
generally have pretty clean data, and clear problem descriptions.

In real life the data is never clean and the problems are rarely known in
advance.

~~~
benhamner
We also are growing
[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets), which
won't necessarily have clean data, clear problem statements, and a well-
defined task.

------
apathy
COI: author is a "data science" recruiter and the field has not coalesced down
to a static definition. Caveat lector

------
mrjaeger
In the article Alec mentions it is important to be able to read academic
papers properly. Does anyone have good resources for this? I've read some
papers before but do not have a research/academic background where I really
had to dig deeply into them.

~~~
Obi_Juan_Kenobi
I'm not sure there's a good answer to this.

The best I can say is to go to grad school. That's a terrible answer, but it's
perhaps the only realistic one. It's in that situation, or one very similar,
where you're exposed to loads of criticism and discussion. Basically any paper
that was competently written (even if it wasn't competent work) is going to
sound convincing to the naive. After hearing a few papers get torn down,
you'll see the cracks in weak arguments, the poorly supported conclusions, and
the seemingly boring stuff that's absolutely brilliant.

Very generally, the best sign of good work is a 'masochistic' author. What I
mean by this is an author that writes as though every result they get is
deeply suspect and needs to be corroborated in multiple ways. When it's almost
exhausting to read because it feels like they're just beating themselves up,
you're probably reading something really special.

Likely the most difficult thing to do as an 'outsider' is to get a sense of
how 'trustworthy' certain results are. Some methods are almost binary in that
you either get no result, or a great result. If an author shows this, there
might be very little reason to doubt it, and thus independent lines of
evidence not really necessary, especially if there's context that supports /
is consistent with that result. Other methods are notoriously terrible and
need a great deal of careful controls and analysis to even be considered, and
then only as one angle of attack. Sometimes you can find reviews that discuss
methods like this, which would be an invaluable resource. Reviews are
generally a great way to start reading a field, anyhow.

_______

As a practical guide, a well written paper can just be read start to finish.
Then reflect on it to see if you understand it. Could you explain the paper to
someone else? That's a good sign of whether you understand it. After that,
think of critiques. Could the results be interpreted different ways? Was the
analysis appropriate for the data? Are the methods reliable? _All_ papers have
weakness; we live in a world of finite time and resources. All papers could be
better, so think about what could be done. After that, consider what would be
reasonable to do. Did the authors skip something conspicuous? That's a good
sign that there was some difficulty there were avoiding. That might be fine,
but it also might mean there's data that doesn't fit with their conclusions,
which would be a very big issue indeed.

That latter part is the most important, but also the most difficult to do. It
requires reading dozens, really hundreds, of papers so that you learn about
some 'unknown unknowns'. Hearing talks really helps with this, too, as many
people will give a sort of history of their work that includes some of the
twists and dead ends.

_______

That all said, anyone can read a paper. It's not 'magic' that lets you do it.
You'll miss some of the nuance, and occasionally be lead astray, but peer
review works reasonably well enough that papers are mostly quite good with the
devil held to the details. Like most things, it likely follows the Pareto
Principle, with a little effort bearing outsize results.

------
darkhorn
BS in Statistics? If you don't have one you are very likely not a data
SCIENTIST. You will be data guy.

I know that some meteorologist have used normality in forcasting. This is an
example of why you cannot become a data SCIENTIST. Another example; applying
regression to your data. If you think that regression is as simple as its
formula then you need at least 4 years to understand what I mean.

~~~
taeric
Does the word scientist mean something different if it is in all caps?

I mean, I get it. You would like it for the word to remain some pure version
of meaning that it actually never had. Similar to getting upset that people
using the word literally in a very figurative sense.

The relevant question on this style of article is not about word smithing.

That said, I'm in a field where we often through the title engineer on people,
but we don't know why.

~~~
Obi_Juan_Kenobi
I don't really want to weigh in on a semantic argument, but I was considering
what would be a good definition for a 'scientist' vocation.

To me, a scientist is someone who engages in research with generalizable
results.

That would exclude someone who does experiments and analysis, but only applies
established methods to a particular problem. Call them an analyst, perhaps,
but science is an ongoing dialog that they are not participating in.

I think that's part of why many consider it really presumptuous for 'data
scientists' to call themselves such. Some of them are certainly developing new
methods and engaging in a kind of dialog that is definitely science. Others
are addressing business needs with a new sort of analysis, drawing on the
field but not giving back to it. There is absolutely nothing wrong with that,
but it does seem like that pursuit is different enough to bother calling it
something else.

~~~
taeric
Historically, I think this would exclude a lot of well known scientists.
Consider, many of them were running experiments to find an answer. Could have
been aiming cannon balls or planting crops. Or, in the case of many of them,
looking for a way to transmute matter. :)

More recently, a lot of scientists worked to figure out atomic energy. Most,
in likelihood, we do not know the names of anymore.

I think of it as musicians. If you define musician to be rockstar, there are
not nearly as many as if you include school teachers, symphony players,
conductors, etc. Yet, for most people, this later set would definitely be
considered a musician.

