
Being a Data Scientist: My Experience and Toolset - jeffheard
https://jeffersonheard.github.io/2017/01/being-a-data-scientist-my-experience-and-toolset/
======
achompas
These types of posts validate my concern about the people entering my field
right now.

Data science, as a line of work, is distinct from other technical roles in its
focus on _creating business value using machine learning and statistics_. This
quality is easily observed in the most successful data scientists I've worked
with (whether at unicorn startups, big companies like my current employer, or
"mission-driven" companies).

Implicit in this definition is _avoiding the destruction of business value by
misapplying ML /statistics_. In that sense, I am concerned about blog posts
like these (which list 50 libraries and zero textbooks or papers) and those
who comment arguing the relevance of "real math" in the era of computers.

Speaking bluntly: if you are a "data scientist" that can't derive a posterior
distribution or explain the architecture of a neural network in rigorous
detail, you're only going to solve easy problems amenable to black-box
approaches. This is code for "toss things into pandas and throw sklearn at
it". I would look for a separate line of work.

~~~
SatvikBeri
I think the "Data Scientist" job title is overloaded–I see several clusters of
skills being useful, and in my ideal world they would have similar but
slightly different job titles:

–Medium Stats/ML, medium Engineering ("Data Scientist" or "Data Engineer")

–High Engineering on very large datasets, low/medium Stats/ML ("Data Engineer"
or "Backend Engineer")

–High Analysis, medium Stats/ML, low Engineering ("Analyst")

–High traditional Stats, High Analysis, low ML/Engineering ("Statistician")

–High ML, medium Stats, medium Analysis ("Data Scientist")

–High ML, medium Engineering ("Machine Learning Engineer")

~~~
tangue
One of the lessons of the web (in the 1990s everyone was a _webmaster_ until
the field mature.) is that after the coders, specialists emerged in fields
like design, management, UX, seo and content. For data science the most
obvious is data visualization but I guess there's plenty of new jobs ahead in
addition to core data science jobs.

~~~
di4na
the second most obvious is cleaner. 90% of time spent when i try to do any
analysis is just spent trying to get data that are usable, or in an usable
shape.

And let's not talk about the different CSV you can encounter... or how to get
data out of databases that are literally the only reason some "guardian DBA"
still have a job. It can take months to get any access, if you ever get it...

------
stillsut
The roles of statistician and data scientist are not substitutes but more like
complements. This guy definitely _is_ a data scientist. Here's some ways to
tell:

\- Works on non-mission-critical components, e.g. he's not doing statistics
for the when the wing will fall off your airplane, but he can help you figure
out business problems more open to interpretation, e.g. subject line open
rates.

\- His publishing tools favor flair over convention, e.g. Ctrl+f for "latex"
has zero results, but he does have D3, C3, Bokeh, surprisingly no tableau.

\- Not sure he even references a single classical statistics package. The vast
majority of people publishing in social sciences or "old school" life sciences
are using Minitab, JMP, R, or SAS (correct me if I'm wrong, please, it's an
outsider's perspective).

This skillset is not inherently "cutting edge!"\- or deceptively "all talk, no
walk". They really are completely different roles, that use some of the same
tools and formulas and jargon. To cut to the heart of it: When a company
builds a plane and says "I wonder how unlikely it would be for the wing to
fall off?" that creates the demand for a statistician. When a company is
trying to out-compete others, or maximize profit/charitable-effectiveness,
often in a service or a field that is heavily influenced with human
psychology, that creates the potential for a data scientist to add value.

~~~
jeffheard
I knew I was forgetting packages. I _do_ in fact use Tableau. Will add it.
Thanks for the catch!

As for LaTeX, it would have never occurred to me to add it. I have no idea why
not, but it doesn't. Maybe because it feels more like a _chore_ than a tool.
It's like an anti-tool. I mean, I do or did in the recent past use LaTeX, but
in more recent years I would farm that out to someone junior to me who hadn't
worked with it for long enough to prefer pouring bleach in their ears to being
faced with tweaking one more broken LaTeX template.

I probably should include classical stats packages. They really should go in
here. But I've been coding since I was a kid and typically eschewed classical
stats and math packages because of my perception that they were slow walled-
gardens, and that as soon as I had a method figured out in Matlab or SPSS I'd
end up rewriting it in C, C++, or Java to make it work with other things or at
scale. That was hammered home in the first company I worked with where we did
modeling in SAS and then rewrote every model in Java because SAS couldn't keep
up.

I'm not suggesting that classical stats packages aren't data scientists tools.
I think they are. They're just not _my_ tools because of the curious niche I
found myself in.

~~~
bigger_cheese
I think my job is similar to yours. My background is in engineering at an
industrial manufacturing plant.

I have some of the same issues. The Engineers here tend to reach for
spreadsheets first (or Access databases - these things are everywhere at my
work) and inevitably they run into scaling problems and end up with a huge
bloated mess. I step in to re-architecture these monstrosities (using "real"
databases when necessary).

The other big part of my day to day work is modelling and data analysis.
Usually regression based stuff and LP optimization problems (SAS is very good
for this) especially around yield and quality control. The venerable excel
"solver" plugin is often abused very heavily by engineers and is not always
the ideal solution.

The person who I took over from was a Stats guy and the original job title was
"Process Statistician" my boss has since retitled my role "Data Management
Engineer". I still think of myself as an engineer first and foremost and a
"data" person second.

I use SAS heavily. We have kind of gone in the opposite direction to you. I
have rewritten some of our models in the past from C++ into SAS mostly for
ease of maintenance because SAS is better understood by the non programmers
(Most of the Engineers here do not have a programming/CS background and those
that do tend to either know Fortran or Visual Basic very few grasp C/C++ very
well). Speed is not really any issue but opaqueness and ease of maintanece is.

I'd like to learn R because I have heard it is very similar to SAS but more
transferable to outside companies. Julia is the other language I've got my eye
on I have heard it is somewhat similar to MATLAB which is used for some
modelling work here.

------
jordz
Cassandra is mentioned, I agree it's great for storing metadata and can be
used to build efficient graph implementations but it's cited for Graphs and
Relationships? I think that can be misleading as Cassandra is a a distributed
column based key-value store.

~~~
wenc
I noticed that too. I don't want to gainsay the author's experiences, but it
sounds like the author is describing the job of a data analyst who happens to
dabble with various software. I don't get the sense the author has in-depth
knowledge about the tools he lists.

Also, I don't know about putting Mongo and Cassandra under "Tools for working
with unusual datasets".

------
codr4life
Am I the only one who came here looking for someone's experience as a tool
set? For a second there I thought I might have stumbled over real honesty, a
rare treat these days. Maybe, if we stop putting each other in stupid labeled
boxes to please our bullshit peddling masters, we would get somewhere...

------
mastazi
From the article:

> Machine learning and data mining are not well distinguished, but machine
> learning techniques increasingly favor “unsupervised” learning algorithms.

The statement above puzzles me because it does not align with what I can see
in the news. Maybe I'm just uninformed, so please let me know if I'm wrong.

According to what I can read in the news:

1 - Almost all of the recent ML developments that I can think of are in the
field of supervised learning / reinforcement learning

2 - the only field that I can think of where unsupervised learning techniques
are prevalent is data mining, which is precisely why I see it as a very
specific field.

Am I missing something?

~~~
cityhall
No, you're right. Nothing about this blog post/resume inspires confidence.

------
DarkLinkXXXX
Big Data is when you outgrow Excel.

------
mordant
'Data scientist' is just title inflation by statisticians.

~~~
thinkr42
More like 'analyst' in how easily it is thrown around. Calling a built in
function in python or R is just about equivalent to calling one in Excel.
Sure, you can claim that folks need to know more about what is going on, but
honestly, how many have actually gone through the work of deriving the
functions they're calling to begin with?

~~~
lacampbell
I'm wondering how useful deriving functions yourself is in the age of
computers. I feel like knowing axioms about the mathematical structure you're
dealing with and how to do proofs is very important, but it always struck me
as odd that were still stepping through complex applied maths functions
manually in pen and paper. Programmers don't bother say, writing our own
hashtable implementation more than a handful of times in our lives, do we?
Does forgetting how to derive hashtables mean we won't know how to use them
effectively?

Genuine question - more than happy to be proven wrong.

~~~
curiousgal
>stepping through complex applied maths functions manually in pen and paper.

We do that because:

A it helps us understand them better

B it teaches us how to think, the way Feynman said "Know how to solve every
problem that has been solved". Granted, it seems pointless to work through
what is easily accessible through machine BUT it teaches how to solve new
problems. I wouldn't consider using NumPy or Matlab as the first step towards
solving a new math problem.

It's like using Assembly vs using a higher level programming language.

~~~
thinkr42
Completely agree. There's a lot of nuance in these algorithms, they're not as
cut and dry as simply calling a package method and oftentimes they aren't
optimized to your use case. I work in Machine Learning, specifically on NLP,
and it is really obvious when interviewing potential employees who knows what
SVD means and who just know the NumPy function. Most "data scientists" I've
interviewed fall in the latter category.

edit-This is of course completely anecdotal experience.

------
Volt
I'm not sure this is what a data scientist is. It was supposed to be a
research scientist (which is where the _scientist_ part came from) that
wrangles data and code. This individual should have both domain knowledge and
coding chops while knowing how to conduct research.

~~~
131012
That would make me a data scientist, but I do not think I am and still have to
learn a few tricks from this guy (and others).

