
The State of Data Science and Machine Learning - ishan_dikshit
https://www.kaggle.com/surveys/2017
======
kmax12
In "What barriers are faced at work?", I really wish they broke down the
"dirty data" response into more categories. In particular, I'd love to know if
people are dealing with data quality issues, feature engineering issues, or
something else all together.

In my opinion, this is representative of the problems with data science tools
today. There is so much focus on the machine learning algorithms rather than
getting data ready for the algorithms. While there is a question that lets
respondents pick which of 15 different modeling algorithms they use, there's
nothing that talks about what technologies people use to deal with "dirty
data", which is agreed to be the biggest challenge for data scientists. I
think more formal study of data preparation and feature engineering is too
frequently ignored in the industry.

~~~
radarsat1
Of course everyone agrees that "cleaning data" is difficult and boring, and
it's always mentioned, but what I don't really understand is what kind of
tools people expect for this beyond what are already available. E.g. pandas is
pretty good at merging tables, re-ordering, finding doubles, filling or
dropping unknowns etc. There are also tools for visualizing large amounts of
data, look for outliers, etc. Beyond the basic tools it seems to me that each
dataset requires decisions to be made that can't be automated. (e.g. do I drop
or fill the unknowns?) I don't see how this could be improved, as every
decision has a solid, semantic implication related to whatever is the
overarching research question.

So statements like "getting data ready for the algorithms" seem kind of
meaningless to me, in the sense of general methodologies. How could you
possibly "get the data ready" without considering what it is, how it will be
used, etc. How can it possibly be generalized to anything beyond the specific
requirements of each problem instance?

I'm just really curious what you are imagining when you say that better tools
are needed here.

~~~
kmax12
I am the lead contributor of a python library called Featuretools[0]. It is
intended to perform automated feature engineering on time-varying and multi-
table datasets. We see it as bridging a gap between pandas and libraries for
machine learning like scikit-learn. It doesn't handle data cleaning
necessarily, but it does help get raw datasets ready for machine learning
algorithms.

We have actually used it to compete on Kaggle without having to perform manual
feature engineering with success.

[0] [https://www.featuretools.com](https://www.featuretools.com)

~~~
ScottBurson
Wow, this looks very cool!

------
antgoldbloom
For interest, the raw data is published here:
[https://www.kaggle.com/kaggle/kaggle-
survey-2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

And some early analysis from our community here:
[https://www.kaggle.com/crawford/analyzing-the-
analyzers](https://www.kaggle.com/crawford/analyzing-the-analyzers)

Some things that jumped out at me:

1\. more people learn data science and ML from MOOCs than university courses

2\. Tensorflow the tech people most want to learn in the next year

3\. 40% of people survey spend >1-2 hours per week searching for another job.
Surprising given all companies complain about the difficulty in finding data
scientists/machine learners.

~~~
PeachPlum
Re 3.

"There's a skills shortage (at the price we want to pay)"

~~~
antgoldbloom
Yeh. The median salary for a machine learning engineer, which is definitely
higher than what most companies are used to paying (even for software eng
roles).

My argument is that machine learning also higher leverage than most roles. One
algorithm written by one machine learner can generate a huge ROI. Think of an
algorithm to predict loan defaults or customer churn for a bank. That
algorithm in the hands of a great machine learner can generate a huge ROI.

------
godelski
Anyone else find it weird that when you click "other" for gender that the data
looks more like garbage?

I was trying to actually compare male and female salaries out of interest but
have a hard time believing so many people earn <$20k/yr. Even when you switch
the filters around. The best I could find is just filtering for the US, but
the number of respondents are so low, ~1k total (~200 Females, ~800 males),
that it becomes difficult to make accurate comparisons ($22k diff but women
had more masters degrees and similar PhDs, by percentage).

Has anyone sorted through this data and tried to account for these factors?
I'd be interested at the uncertainty and how the information was gathered.

~~~
jerednel
I pulled a few gender stats here.
[http://bit.ly/2zjrSJD](http://bit.ly/2zjrSJD) Accounting for country,
education, and industry you really reduce the population you're sampling from
but those deviations are huge. You need to account for industry especially.

~~~
godelski
Well this really doesn't discuss the error associated with the data. Which is
what I was trying to get at. There seems to be a lot associated with it, which
makes accurate predictions difficult to make.

------
followmeon
Next it'd be interesting to see Python 2k vs. Python 3+. My own experience
tells me that the majority of top Kagglers still use Python 2k, despite Kaggle
Kernels being Python 3+ exclusively.

I also am quite amazed with the predominant use of Logistic Regression. I
wonder if that is less about interpretability / ease of engineering, and more
about the barriers that data scientists face when using more complex methods:
lack of data science talent, lack of management support, results not used by
decision makers, limitations of tools.

If Kaggle results are anything to go by, all businesses that care about best
performance on structured data, should be using a form of gradient boosting.

------
technologia
Comparing my own situation, I fit in pretty much in the median for my field,
age and salary.

It'll be interesting to see what folks dig up over the coming weeks from this
dataset.

~~~
willis77
Nothing warms our icy, cold, statistical hearts quite like hearing that a
randomly chosen person is near the median. <3

~~~
antgoldbloom
Unlike the statistician who has his legs in the freezer and his head in the
oven but who is on average the right temperature.

------
glial
With the rise of Tensorflow and sklearn, the strong Python showing makes
sense.

However, I wish Python had a solid IDE for interactive work like RStudio.
Jupyter notebooks are fine but being able to easily inspect variables is super
convenient.

Spyder doesn't cut it. Y-hat's Rodeo was still a bit buggy last time I tried
it. Any other suggestions?

~~~
reallymental
tried pyCharm?

------
jinonoel
Interesting that the most common models being used are the simpler ones,
logistic regression and decision trees. This is despite all the hype for the
more complicated techniques like neural nets and GBMs. Is it just because
these models are faster to train and easier to interpret or something else?

~~~
knn
in my experience, doing deep learning is a lot harder than building simpler ml
models. training times are killer, need lots of data, overfitting is a
challenge, hard to interpret results, lots of things can go wrong. deep
learning is the future from a mathematical standpoint (with neural nets you
can essentially learn arbitrary functions in some borel space or something
whereas simpler ml models are basically a special case of deep learning) but
it's definitely harder.

------
denfromufa
The results are income do not reflect the location, which is really important
in US.

------
bootcat
Really worthy insight. Especially for people like me, who wants to get a
deeper understanding of the ecosystem before being an actual data scientist !

