
One Year as a Data Scientist at Stack Overflow - var_explained
http://varianceexplained.org/r/year_data_scientist/
======
insulanian
I always liked data crunching, databases and data topics in general. Also,
most of the software development knowledge I have I didn't learn in the
official school curriculum, but rather from books, online courses and real-
world experience.

Now, having that in mind, how realistic is for a guy in mid-30s, a very good
software developer (enthusiastic about functional programming, if that
matters) to pick up enough knowledge about data science to be able to actually
take the data-scientist role in some company?

Edit: Rephrase the question.

~~~
IanCal
Totally realistic. It'll depend on what the company is, and what they're
looking for (and hopefully, this will match what they _should_ be looking
for).

While people can often focus on applying the latest deep learning thought-
vector approach to their BIG DATA, there's an enormous gulf between the common
condition of data and this aspiration.

You don't need PhD level stats and machine learning to apply the things that
many companies actually could benefit from. Storing, maintaining and managing
the data properly is a start. Then working with the people in the business to
get insights from the data they have. Often simple aggregations and
visualisations can provide enormous benefit. Being able to show correlations,
and sometimes even just being able to show how noisy things are can be
important.

Big questions: How is reality different from what we think? How are these
differences important for the company?

That might be a correlation that we don't expect, or a lack of one we think is
there. Part of a next step might be to become more "pro-active" and design new
experiments to answer questions that can't quite be answered yet due to a lack
of data. Beyond that you're heading towards bringing a new feature into a
product.

Does your current company have some data? Do they spend a lot of time emailing
spreadsheets around, or have salesforce or a proper database? See if you can
make something useful for your work that's based on an analysis of that data
(and possibly do some work to smooth those workflows of passing data around).

~~~
mswen
IanCal - I upvoted you but let me comment for extra emphasis. I remember
coming out of PhD level quantitative social science studies in academia back
in the late 90s where I was using K-means clustering, Factor analysis,
multiple linear regression, ANOVA and more. When I moved into marketing
research and dealing with company data it was shocking and disheartening how
little of those skills I could actually deploy. Data quality, data management
and just the cost of capturing relevant data was so high that we were reduced
to much simpler analyses.

Over time I came to appreciate exactly what you were saying. Many companies
can be helped by fairly simple analyses.

Fast forward - data is getting much cheaper and now all these years later my
more advanced stats skills seem to actually matter. But even in this apparent
abundance ... data management and data quality assurance is often lacking or
significantly underfunded and simple aggregations and analyses still make the
most difference in many organizations.

~~~
Declanomous
I went into marketing from a biology background, and I felt the transition was
very easy. Collecting data in biology, especially on an ecosystem scale, is
extremely difficult, and your data has a _lot_ of noise in it. How the data is
collected and processed is almost more important than how the data was
analyzed.

Collecting and processing data is going to be extremely time consuming and
expensive regardless of what you do, but you can make it easier on yourself
through smart experimental design. P-values are still widely used for analysis
in biology, to the chagrin of most statisticians. I don't think it is as much
of a problem as it comes across as though. Bayesian reasoning is inherently
included in the scientific process as part of the experimental design stage.
The reliance on p-values becomes a problem when journalists report on research
findings in a single paper that is "significant" because of p-values. The
professionals in the field are capable of balanced analysis of p-value based
research, but we end up with issues like anti-vaxxers when research is
presented outside the field of professionals.

Of course there are still ways the process can allow incorrect research to
present itself. For one thing, journals are not interested in publishing
studies that do not have "significant" results, which means that bad research
can stick around for far too long.

I personally think that the solution to these problems is open data, since so
much of the research depends on how the data was manipulated and 'cleaned'
prior to analysis.

------
mkagenius
> It makes me sad when brilliant software engineers open up Excel to make a
> line graph!

Why do people get religious about tech.. funny, let him/her use excel for god
sake its a great tool :)

~~~
Declanomous
Excel is a great tool, but its graphing ability leaves a lot to be desired. I
feel that graphing in Excel has gotten worse as they have tried making it
easier to use as well. I'm 99% sure the graphing engine in Office 2007-2016 is
the same one as in 2003 and earlier, but all of the damn menus they have added
slow down the graph making process so much. If you are trying to use Excel to
make a graph for a reasonably technical audience you are going to need to make
a lot of tweaks to the graph to make it acceptable, and tweaking each part of
the graph takes dozens of clicks where it used to take one or two.

I personally think that Excel is one of the best tools I have ever used. I do
about 40% of my work in Excel. I graduated from college with a degree in
biology, and I spent hundreds or thousands of hours in Excel manipulating data
and graphing. Excel can and will graph almost anything you need to graph, but
anything more complex than the simplest line graph requires getting creative
with the formatting of your data and how you use series and data sets. The
maximum complexity of a graph in Excel is technically restricted by the 254
series limit, but creating a graph with 254 series will probably take the
better part of a week.

So ultimately, Excel isn't a bad tool, but when graphing it's a bit like using
a shovel to dig a hole. It has a time and a place, but at a certain point
using an excavator will be faster, safer and less expensive than digging the
hole by hand.

~~~
nonbel
>"I graduated from college with a degree in biology, and I spent hundreds or
thousands of hours in Excel manipulating data and graphing...creating a graph
with 254 series will probably take the better part of a week."

I don't see how this is an argument for using excel. You could have spent a
small fraction of that time learning the basics of R and gotten everything
afterwards done 10-100x faster. I know, because I used to be one of the people
who spent insane amount of time on incredibly basic stuff. That said I still
do sometimes use excel to inspect data and for some other simple tasks.

~~~
Declanomous
I wasn't arguing that one should use Excel. I wasn't arguing that you
shouldn't use Excel either. I was arguing the exact same point you just made -
you should use the appropriate tool for the job.

Excel is a great tool for manipulating data, especially if you are working
with other people. Excel produces serviceable graphs, but if you need to
produce a graph of any sort of complexity (and know how to program) R is
certainly a better tool. Sometimes you need a quick visualization of some very
simple data. I'd prefer to use Excel rather than R or Python at that point.

------
marmaduke
Ah man, I wanted that job. I guess I didn't waste enough time on the Internet
answering questions during my PhD.

~~~
amelius
The article is a very nice write-up, but in the end he is still optimizing
advertisement click-through rates (a zero-sum game).

While interesting from a technical point of view, why not make yourself 1000x
more useful to society by working, for example, on "cognitive health"
problems? These are problems that lean heavily on statistics, and are
interesting and imho more rewarding at the same time.

~~~
lovestats3
How is matching people to jobs a zero-sum game? If you are able to devise a
method to find the ideal job for people I think that society will benefit from
it, better products and services will be constructed, so this is in no way a
zero-sum game. Is a win-win situation is jobs seekers and jobs providers are
able to find the match the right candidate with the right job what require to
pay the right price.

------
mooreds
> public work is not a waste of time

This is the best line in the entire (interesting) post.

Doing some kind of public work will make your next job hunt so much easier.
Whether that is SO answers, a blog, or a github profile, showing bests
telling.

------
amelius
> For example, if you visit mostly Python and Javascript questions on Stack
> Overflow, you’ll end up getting Python web development jobs as
> advertisements

But what if you are an excellent C++ developer, who needs some assistance with
Python and Javascript?

~~~
rossipedia
I'm one of the Ad Server devs that David mentioned in his post. We have plans
in the works for allowing a user to specify what technologies/tags they're
more interested in seeing jobs for, as well as things like customizing the
geographical location (if any) you'd like to see prioritized. We hope to roll
those out this year (we're a small team - just got our 3rd dev)

------
lovestats3
I just read your R course lessons and I find it very well explained, I enjoyed
the lessons about data.table and ggplot2.

The beta distribution: In the free book "think stats Probability and
Statistics for Programmers" there is a chapter about how the beta distribution
can be used as a prior to model an unknown probability and how Bayes' Theorem
allow us to update that prior with a posterior distribution that is also a
Beta distribution, that important property is called the self-prior property
of the beta. Since the two parameters of the beta in that intuitive
explanation are just the number of experiments (battings) and the number of
successes (runs) that example constitutes a very intuitive and clear way to
explain what is Bayesian Statistics. I think that you would enjoy the think
stats book, it is aimed for programmers and it tries hard to enhance
intuition.

I also enjoyed how you describe the atmosphere in your office, it seems that
you work in a lovely place in which statistics is a well respected tool and
people try to explore and innovate in a fun way without excessive pain. Nice
post, I enjoyed it.

~~~
lovestats3
If I were to work in your place to try to match people to jobs I should study
some psychology and NLP trying to predict from what an user write and answers
what is his mental state and how that mental state fits in the jobs.

Unfortunately hiring seems to be a very difficult problem and I can imagine
that many key features are hidden and can't be obtained since the candidate
can't be put into a controlled experiment that should allow us to obtain deep
information that is usually hidden, perhaps a way to gain more information and
wealth is to communicate with your users and clients in such a way that what
is now hidden can be measured and new features can be obtained. That is you
need to think about a model for your users, and that model must use features
related to mental and human capabilities.

~~~
JasonPunyon
What about the millions of people who've never written anything on Stack
Overflow?

------
amelius
> For that, I might look at another source of data, Stack Overflow Careers
> profiles, and see which technologies tend to be used by the same developers

>
> [http://varianceexplained.org/images/network2.jpeg](http://varianceexplained.org/images/network2.jpeg)

This shows "git" and "github" in a separate cluster from "C++" and "Python"? I
don't understand this. These tools are used regardless of what other
technologies are being used. For example, there are many Python and many C++
projects on github.

~~~
mikk14
That is a heavily-filtered network: if he didn't drop most of the weak
connections he'd end up with everything connected to everything else. On the
other hand, if you don't use a sophisticated way to define what a "noisy" edge
is, you'll end up with some curious cases like the one you point out. He might
have used a naive global threshold -- it's the easiest way to go about it:
drop all connections with weight lower than x. But it's also very wrong most
of the times :-) Something like the disparity filter [1] works usually well,
although you have sure that its null model hypothesis is aligned with what you
think is the generative process of your network. The field is "network
backboning" and it's a nice one from network science.

[1]
[https://en.wikipedia.org/wiki/Disparity_filter_algorithm_of_...](https://en.wikipedia.org/wiki/Disparity_filter_algorithm_of_weighted_network)

