

Ask HN: What tools do Data Scientists use most commonly? - elon_musk

I have started learning some machine learning and data mining and was wondering which technologies I should try to master. Should I learn R or Python (nltk, scikit-learn, pandas etc.)? Should I learn hadoop or focus more on machine learning techniques? There is a lot of content on internet from free courses to kaggle competitions. Where should I invest most time?
======
DaFranker
Disclaimer: I am not a data scientist. I just know a few, and I have some
Google-fu.

> Should I learn R or Python (nltk, scikit-learn, pandas etc.)?

Honestly, _whatever is most convenient_ at first. Getting a good grip on your
first one is orders of magnitude more important than what language or system
it is, and since convenience and ease of use help a lot with focusing one's
attention on learning an mastery, that's what counts. Eventually you might
grow out of the one you picked, and if you keep coding stuff you'll inevitably
branch out to other languages and systems. All the data scientists I know know
and use both R and Python (with bells and whistles). All of them.

> Should I learn hadoop or focus more on machine learning techniques?

From what I read, and from what one of those data people told me, hadoop can
be a waste of time in many situations. Namely, for almost anything small
enough to store and process on a desktop computer, you'll probably do it
faster using something else (and with less headache). But hop to page 8 of
this survey for a better picture from the horse's mouth:
[http://www.paradigm4.com/wp-
content/uploads/2014/06/P4-data-...](http://www.paradigm4.com/wp-
content/uploads/2014/06/P4-data-scientist-survey-FINAL.pdf)

The rest of the survey I linked above is probably well worth your time as
well. Obviously, take into account the survey was run and presented by
paradigm4, and do your mental corrections and debiasing accordingly.

------
vkb
Data scientist here. It really depends on what kind/size of data you intend to
work with. If you want to work with enormous data sets, focus on Hadoop and
Machine Learning. You won't be able to learn a lot of Hadoop on your own since
it really requires having a working cluster and breaking it, but you can get
some idea.

For smaller data sets, R will work, and Python acts as glue. In general, on a
daily basis, I spend about 70% of my day finding and cleaning data (a lot of
data scientists will say the same) and getting it in the shape I need, and
only then running algorithms over it.

I work with fairly small-to-medium sized data (now..previously I was at Hadoop
scale) and my tools of choice are: SQL for getting the data I need (you will
need to become really good at SQL..almost everyone uses it everywhere, really
great universal skill to have), Python (pandas) for cleaning the data and
making that cleaning process reproducible, then high-level algorithm analysis
in R, and presentation either in R or Tableau.

------
svrgn
I'm not a data scientist, but I tinker often enough to know a fair share of
languages.

I would say learn _A_ statistical computing language and a programming
language. If you could mix the two (via apis or other) it becomes even better.
I interface matlab+c and Ruby+R, the former for speed and performance, the
latter to prototype.

Learn something statistical, be it R/Octave/Stata/Matlab. If you have a
budget, go with matlab as there are lots of packages and scripts available,
although R is catching up quick. If performance is an issue, R has caveats,
since the vanilla version has no multithread support.

I don't know python and use C for everything, but there are so many packages
for python, including (GPU accelerated) machine learning that it is IMO
definitely the way to go in your case.

------
mswen
Everything that you have mentioned is good. I would really focus on
understanding a wide range of traditional statistics, ranging from various
descriptive and exploratory techniques, to dimension reduction and latent
factor discovery as well as a variety of regression techniques. Then dig into
the whole machine learning space as you already suggested.

I would also get very comfortable with SQL - much of the data that you will
work with in an enterprise setting will be accessible through SQL querying or
some close variant.

If you come from a programming background Python will be more sensible and
accessible for you as you learn. If you come from a more classical research
background then R will generally be a more comfortable place to start.

Enjoy the journey

------
cblock811
> Should I learn R or Python?

My vote is on Python. It's more broadly useful and R has some restrictions
handling really big data sets.

>Should I learn Hadoop or focus on machine learning

Focus on machine learning. If you don't need to know the Hadoop architecture
then don't learn it. If you're picking up machine learning and later find you
need to know Hadoop, you'll be able to pick it up.

If you need distributed computing power for projects you are working on here
are some options:

[http://zillabyte.com/](http://zillabyte.com/)
[http://databricks.com/](http://databricks.com/)
[http://aws.amazon.com/elasticmapreduce/](http://aws.amazon.com/elasticmapreduce/)

------
weishigoname
I am newbie of machine learning, too, I prefer Python, I think it can find
answer quickly if we encounter some problem of this language because it is
very popular.

------
santa_boy
R & Julia seem to be the choices many are leaning towards.

------
ig1
What would you like to do with your data science skills ?

------
kirk21
R is getting quite popular in the scientific community.

