My experience as a data scientist (and toolset) (jeffersonheard.github.io)
28 points by jeffheard 4 hours ago | 10 comments





'Data scientist' is just title inflation by statisticians.

There are cases where this may be the case, but did you look at the tools in the blog post? Can statisticians be expected to write mongoDB code, create a web scraper, and make interactive visualizations in D3?

Title inflation exists, but there is a real-world role here that isn't really captured by "statistician" at all.

I think the discussion is moot.

The usual points made are:

Statisticians are too theoretical.

Data Scientists are more in touch with reality.

The answer lies in the French educational system where Statistics is an engineering field so you end up with statisticians who can implement their own models. What would you call them? It doesn't matter.

Actually, I’ve noticed a meaningful distinction between people who learned statistics from machine learning (and are more likely to call each other data scientist) and statisticians (the least experimental of whom used to go by the title analyst): what to do when there is either too little, or too noisy data. Interestingly, those two are happy to be called Data scientist, but in my experience, they rarely meet.

A traditionally trained statistician would evoke negative result and decide not to use the model and support to maintain the pre-existing approach. A machine learning expert might not care, apply the coefficient out of the model as is because they are presumably closer than a guess and is more likely to be openly skeptical of human expertise.

That has lead to some frustrating situation for me: me arguing we should censor things like negative speeds, while I was told that there was no problem because the results were regularised anyway. Building and picking proper factors to use in regression is something that you can partially get away with when having larger databases, and back-propagation can take over; before that, insights still do matter.

I have not meet many who can articulate that transition effectively.

It seems that you’ve met mostly the second category; they are possibly the larger group, but not necessarily the most influential. There is a core of people who are meaningfully different. The linked article seems to be from someone in between but closer to the second group.

More like 'analyst' in how easily it is thrown around. Calling a built in function in python or R is just about equivalent to calling one in Excel. Sure, you can claim that folks need to know more about what is going on, but honestly, how many have actually gone through the work of deriving the functions they're calling to begin with?

I'm wondering how useful deriving functions yourself is in the age of computers. I feel like knowing axioms about the mathematical structure you're dealing with and how to do proofs is very important, but it always struck me as odd that were still stepping through complex applied maths functions manually in pen and paper. Programmers don't bother say, writing our own hashtable implementation more than a handful of times in our lives, do we? Does forgetting how to derive hashtables mean we won't know how to use them effectively?

Genuine question - more than happy to be proven wrong.

I agree. A smart data scientist doesn't waste their time reinventing the wheel: they build off the hard work of others. When necessary they can create what is needed, but they don't do so typically.

They are both more and less, in my experience, than statisticians (more flexible and solution-oriented, less rigorous and classical), than analysts (they can do more, in general, but a great analyst will be better at analysing and visualizing), than developers (they know more stats, less software engineering, and have great patience for wrestling data into submission). I like to think of data scientists as people who combine the skills of all the above to solve hard problems which exceed the domain of any of specialty (analyst, statistician, developer). It doesn't mean we're amazing at everything, just that we are effective, flexible problem solvers.

And for the record, machine learning, statistical modeling, and data mining are just a small portion of the pie. Being good at modeling and machine learning will not remotely guarantee success as a data scientist.

>stepping through complex applied maths functions manually in pen and paper.

We do that because:

A it helps us understand them better

B it teaches us how to think, the way Feynman said "Know how to solve every problem that has been solved". Granted, it seems pointless to work through what is easily accessible through machine BUT it teaches how to solve new problems. I wouldn't consider using NumPy or Matlab as the first step towards solving a new math problem.

It's like using Assembly vs using a higher level programming language.

Completely agree. There's a lot of nuance in these algorithms, they're not as cut and dry as simply calling a package method and oftentimes they aren't optimized to your use case. I work in Machine Learning, specifically on NLP, and it is really obvious when interviewing potential employees who knows what SVD means and who just know the NumPy function. Most "data scientists" I've interviewed fall in the latter category.


Some say [0] it's title deflation for statisticians.

[0] http://bactra.org/weblog/925.html

