
Ask HN: What are your biggest pain points as a data scientist? - uptownfunk
Some examples, productionizing code, cleaning data, documentation etc.
======
magneticnorth
1\. Dealing with biased data

2\. Cleaning/understanding data - nearly all data sets I've used have
duplicates, missing data, highly anomalous distributions in some fields that
indicate we aren't measuring what we think we are, etc. So a lot of my time is
spent figuring out what's going on in the data, cleaning up the issues, figure
out what subset of the data is reliable, and then dealing with the biases
introduced by what is missing or wrong.

3\. Dealing with people who don't understand or respect statistics and data
science. For example, I've been brought in to "do the analysis" on an "A/B
test" where a team didn't appropriately randomize their samples, and also
hadn't done a statistical power test beforehand so had an underpowered test
anyway, so there was just no hope of validating that their change was an
improvement.

~~~
chewxy
I tried scaling the task of #2 across multiple people recently. It took way
longer than had I personally done it myself.

I want to know if there is a way to spread the load for this.

------
airza
Getting the dimensionality correct between layers in an untyped language is
incredibly painful. Getting a prototype working on my personal machine and
then having to play AMI roulette on AWS to get it running on a GPU is
frustrating and expensive. Every time I read the tensorflow documentation I
think, "I wish i could pay 500 dollars for a version of this that looked like
someone cared about it."

~~~
gtrevize
You can skip AMI roulette by using AWS Deep Learning AMIs
([https://aws.amazon.com/machine-
learning/amis/](https://aws.amazon.com/machine-learning/amis/)) or the new
container equivalent ([https://aws.amazon.com/machine-
learning/containers/](https://aws.amazon.com/machine-learning/containers/))

------
bsg75
That other groups in the business want simple, black and white answers to very
complex questions.

An expectation that data can eliminate the need for reason and thought is
problematic.

I have tried to communicate the reasoning for things like judgemental
forecasting but success is hard to achieve.

~~~
magneticnorth
Agreed. Many people don't want to think, they only want to know.

One of the things my team does is build data tools. The number of people who
want to take data they have hardly looked at, put it through a tool they don't
understand, and rely in important ways on the output, is astonishing to me.

------
sillyguy123
Having high expectations by stakeholders on some ‘AI’ magic when a heuristic
will get us 80% there in 10% of the time

~~~
apohn
IME a number of Data Scientists reinforce this thinking. When people think you
are are a genius, it's painful to admit that 80% of project goals can be
achieved with a simple heuristic.

It's incredible how many "Data Science" problems can be solved with a better
dashboard that enables people to look at data in a more useful way.

------
r0f1
That it is 90% cleaning, and only 10% modeling. Never had a dataset that was
ready to use like the ones on Kaggle. Most of the time I get a mixture of
Excel Sheets with weird formatting, .csv files and SQL dumps that have
questionably encoding, and lots of unnecessary information and missing values.

~~~
apohn
I think Kaggle participants miss out on some of the best parts of being a Data
Scientist. Fiddling with data, writing scripts to clean/transform/ingest data,
interacting with data owners and subject matter experts, etc. IME that's
actually a lot more fun than fiddling with parameters and looking at model
performance metrics.

------
p1esk
Getting lots of data (for deep learning models), cleaning that data, and
labeling it.

------
avin_regmi
what about playing with different hyperparameters? I always found that time
consuming? What do you guys think?

------
Iwan-Zotow
Data!

