
Ask HN: Any "Data Scientists" here at HN? - polyfractal
Anyone here at HN consider themselves to be a Data Scientist (or alternatively, Data Analyst, Competitive Intelligence, etc)?  Would you mind sharing a bit about your job, your day-to-day work, what skills are absolutely required?  Do you work 9-5 or as a freelance contractor?  Is it possible to break into the field without a specific degree?<p>I'm a molecular/cellular biologist that is wanting to leave the field.  I think data science looks interesting for a number of reasons.  I have past experience programming a number of languages and work with large data sets on occasion due to my current day-job.  One of my hobby projects involved data-mining Medline.  I have cursory knowledge of statistics (Ttest, anova, correlations and linear regressions, etc) but no advanced skills yet.  Essentially, I think I have the foundation skills required for the job but no marketable experience.<p>Is it possible to re-define yourself as a data scientist?  I know in programming, you just need to code up a few side projects to demonstrate your aptitude.  Is the same possible in data science?
======
dvcat
I do machine learning and would consider myself a data scientist. I was an
engineer who decided to do an advanced degree in statistics and computer
science just because I liked this stuff. I currently work in the analytics
division of a small company:

1\. Its not necessary to have the necessary degrees: I did but a lot of people
in my team come from the social sciences and other fields. You might find it
hard to cross HR but that is something that can be rectified by cleaning up
the "weird parts" of your resume and highlighting the "right parts". You seem
to have a good handle on what is what on that front.

2\. Your statistics, linear algebra and probability skillset need to be upto
par. People from a more statistical background will grill you on those things.
Its extremely easy to see whether a person can think statistically by giving
them a toy data problem and asking them to hack at it. The way to train for it
is to play around with small datasets and I see you have been doing that a
bit.

3\. People who come from a more C.S. side of things will try to explore your
knowledge about "machine learning algorithms" which typically are easy to
learn if you know your math background. The field has a lot of jargon which
might appear to make it fancy. Again, the math behind these algorithms is not
hard but there are things that you learn about how these algorithms work in
practice that really make a difference. So again doing small projects and
putting them up on github will help you learn more and make your resume look
good.

4\. Technology: There are loads of languages that are used in practice. Make
sure you know one scripting language (R, SciPy/NumPy or even Matlab) and are
comfortable using that as your scratch pad. The people who are statistically
oriented in my team use R. Other skills that are extremely valuable but won't
kill you to know are to learn the Map Reduce Stack (Java (uggh), Pig).

I am currently doing machine learning on a dataset. This involves typically
playing around with the data in NumPy and sometimes Matlab. Once I am
comfortable with a particular choice of algorithms, I try to write it up in
Pig. I use Java (Hadoop) for the worst case scenario.

Hope this helps...

~~~
polyfractal
Very helpful, thanks!

2) This is really my weakness since my stats and linear algebra are passable
but not great. There are several free datasets (mostly from data.gov) that
I've been playing around with. Should I "publish" the results of my practice
studies on a portfolio-esque site? Or is it sufficient to just know the
techniques well enough to answer interview questions?

3) I'm fairly well acquainted with machine learning - my interest in machine
learning is one of the driving forces for me to take up neuroscience as a
career.

4) Great information, thanks. I'm glad to see people use technology more as a
scratchpad and less as a regimented "You must know XYZ tech stack". R and
SciPy are on my to-do list, I'll add Map Reduce.

As a machine learning guy in an analytics department, are you hunting through
internally generated numbers to find trends (like sales, ad placement, etc?)
Or are you hunting through externally generated data to find new
trends/products/markets?

~~~
dvcat
2) "Publishing" (Blogging about this) will help. I am not too sure what you
mean by "know the techniques well enough". In my case, I cracked a few books
on LA/Stat and worked through them and encountered situations where I had to
use those skills repeatedly as part of courses I had taken in School. Doing
tiny projects where you are able to acquire mathematical intuition for certain
concepts will be useful.

4) So the fact of the matter is that you should be versatile enough to absorb
their tech stack as soon as possible. That is possible only if you have worked
in depth in at least one tech stack. This is especially true for Map Reduce
based technologies.

Its hard to be specific: All those things are doing in analytics departments.
There are not many machine learning people in my place of work so I mostly
find problems that require algorithmic solutions and try to see if I can solve
them. Things probably are more structured in larger companies..

------
stuartcw
I have become an Data Analyst after many years in Software Development. As a
team we are studying the DAMA DMBOK c.f.
<http://www.dama.org/i4a/pages/index.cfm?pageid=3345> which is the Data
Management Body of Knowledge in order to become certified Data Management
Professionals.

A Data Analyst might not be exactly what you want to do as you might be
analysing a large Enterprise's data to find out about it's data's quality,
what is the definition of the meaning of the data and whether it is the master
data or a copy or derived from some other data etc.

It looks very a dry area of study initially but an awful number of real world
problems are problems with data that fall into predictable patterns.

