

Ask HN: What does data scientist actually do in each day? - zkan

Any idea? :)
======
eshvk
Umm...Let us see. Assume I am trying to build a fancy new recommendation
system for movies on a completely new media device. Let us assume that for now
we have been collecting user data and recommending them random movies. The
first thing we do is to go through this data and try to figure out what would
be relevant user features. While you are at it, you might find that there are
some odd issues with the way data is being collected which is creating
statistical biases: Say for example, you only have server time when the user
rating of the movie hit your system and not the actual time at which the user
rated the movie. So you go back and fix those. Repeat, rinse and redo. Now,
you go out and find out that your potential feature set is extremely sparse
and high dimensional. So you think about adding other features from the Movie
set. Go through those: Hopefully you are not starting the problem from scratch
and you already have gone past the data collection and feature extraction
problem when it comes to movies. Now, you start thinking algorithmically,
nothing too complicated because everything has to be made production ready and
adding bugs in machine learning algorithms is extremely easy (See Mahout ;-)).
You prototype a few algorithms in Hive or your favorite scripting language and
then get some data out for A/B testing. You run your A/B tests and hopefully
you have something that looks significantly better than baseline. Now you go
through the boring part of making that algorithm production ready: This means
thinking in terms of scale: 1) How do I deliver these recommendations on the
fly to millions of users who are using my fancy new media device. 2) How do I
do this while not stressing out my system. So you write a bunch of code,
integrate into your current APIs, run a cron and hope nothing breaks and
monitor graphs like a nervous little rabbit.

Then, you repeat this and do all this over next week.

EDIT: Obviously depending on the company and the job, this entire story might
completely change etc etc.

------
cityhall
The first big chunk of time is getting data into a usable format. Customers
(internal or external) all have different data formats, and nothing is
consistent. So you write parsers, deal with the corner cases, and get data
into a form you can analyze.

Then you have to write software that lets you both experiment and release your
models to production. That involves writing pipeline architectures to apply
things like feature extraction and pruning in a consistent way, and to make
sure the result can be serialized and deployed. Off the shelf packages
typically haven't solved these problems very well, so you have to make sure
the thing that looks good in the scripting environment is reproducible on new
data.

Then when you have a model and can deploy it, you start working on automating
the training process so the model automatically adapts as new data comes in.
Usually the customer has gotten the impression this was happening on day one,
so you have to rush to deliver it.

Then, you deal with customer complaints that something that would have been
obvious to a human is wrong even after they corrected it.

At some point you try to measure the gains you're offering over the non-ML
system you replaced, and try to tweak those metrics until they make you look
good.

If you're lucky, you got to experiment with some interesting algorithms
somewhere in the middle, but you probably got the best results from something
fairly standard like random forests and not the latent bayesian slice sampler
you dreamed up when you first heard about the problem.

------
lsiebert
Getting data ready, clean and formatted for analysis. That is a huge part,
unless you are super lucky. Think about how to analyze said data. Come up with
a hypothesis, test, analyze results, proceed to secondary hypothesis. Repeat.

------
_delirium
I imagine it depends on the company. A friend of mine with a data-scientist
job reports that the most important part of _his_ particular job is making
convincing PowerPoints to present "insights from data" to management.

