
Data science is different now - AhtiK
https://veekaybee.github.io/2019/02/13/data-science-is-different/
======
lordnacho
There's something very relevant in this story. Data Science is too glamorous a
term. There's an implication that the DS person is some sort of magician who
maybe isn't as good at general coding but has special data magic skills,
making them more valuable than your average grunt.

In my years in finance, there was a similar problem. One guy in particular I
worked with reckoned himself an "ideas guy" and would simply spout out
gibberish that he expected the rest of us to implement. He could barely use
excel himself, let alone code.

The fact is the best coders I met never fancied themselves as specialists.
They could certainly fit some models for you, but they could also write some
SQL, set up replication and other maintenance, write cron jobs, set up ssh
keys, merge some git branches, and write front and back end code in several
different languages, declarative and imperative. I always put it down to a mix
of curiosity and humility, giving these people a very good grasp of the
fundamentals plus a foothold in almost every area of coding that I could think
of.

~~~
Fomite
There's also specialists (mostly epidemiologists, statisticians, etc.) who
view their "special data magic skills" as often actively dangerous.

~~~
singingfish
To be fair their special skill can be actively dangerous. In the same way that
my devops skills can also be dangerous. Disclaimer: I've periodically been
doing data science since before it was called that.

~~~
geezerjay
DevOps skills don't help define public policies with gibberrish and baseless
assertions disguised as facts extracted from data.

Sometimes some so-called data scientists appear xaman interpreting numerical
bones and plotted fire shapes generated from compuational peyote, and people
interpret those conclusions as answers to their problems.

~~~
singingfish
Data science as such isn't really a thing that defines public policy. That's
more the realm of classical statistical analysis. Althogh there are some major
flaws in classical statistical analysis too, such that risks of types 1 and 2
errors are far too high.

------
perturbation
I have been a data scientist for the last 4 years.

I think (one of) the problems with the data science career field is that there
are a lot of juniors who want to run sklearn and call it a day, following the
tutorials that seem to 'just work' that real-world data doesn't without a
fight.

To get value out of the work, you have to be methodical, careful, and really
dig into the data. The observation that 85% of the time is cleaning doesn't
eliminate the need to know what you're doing, what approaches to use, how to
judge success, how to communicate results, etc.

Another thing to consider: I've found big, boring companies are usually better
to do DS at than small ones. Big, boring companies have better discipline in
collecting and managing data. Also, a 1% improvement to an existing process
matters a lot at BigCo, and very little at a startup - and a lot of DS models
are that sort of incremental progress over rules engines or heuristics.

~~~
bigger_cheese
In my world working on data at a BigCo (Industrial plant in my case) I'd say
there are 3 schools of people

1) 'The Old Guard' who are extremely skeptical. They tend to be extremely
dismissive of models and predictions, distrust anything but most basic
analysis. If they can't do the analysis on an excel spreadsheet it's too
complicated and "will never work". These people tend to be Engineers
(mechanical and chem type) and Plant Operations roles. A lot of the time there
is value in listening to there skepticism but they tend to be extremely
conservative by nature (Fortran ort to be enough for anyone...).

2) 'The Optimists' people who think "big data" and "machine learning" is the
panacea to every problem in our org. To these people a prediction is a good as
a real measurement - they trust forecasting implicitly. They have probably
read an article somewhere about machine learning but don't really grasp any of
the intricacies. These people tend to be in logistics/accounting/finance type
roles and a large part of my job tends to be spent in phone calls with these
people explaining why their forecasts did not match the actual results.

3) 'The KPI guy' \- usually a manager who is somewhat out of their depth who
wants to distill everything he can into a single number which can be displayed
on a dashboard. The end result is a dilbert-esque situation where the 'KPI
guy' decides that to make his mark in the org he needs to come up with a new
metric. You end up with the bizarre situation where people are discussing a
'super metric' made by combing other metrics into a single number. I also
spend a lot of time on phone with these guys because they forget what undpin
their super metrics and don't understand all the subtleties they've distilled
out of the data by focusing so much on higher level metrics. They get angry
when you question the value of their dashboard. Whenever someone starts
talking about "Yield" "OEE" "DIFOT" good chance they are a 'KPI guy'

Most of my job is balancing out interactions between the three 'customers'.
Tempering the optimists enthusiasm, reigning in the KPI guys and nudging the
Old Guard.

~~~
dasboth
This is so spot on about Data Science in the "enterprise" or "legacy"
organisations (i.e. basically pre-dating the data hype).

Personally getting stuff done with data in this environment is more satisfying
than using the latest neural network, I presume you're the same?

------
rjbwork
According to this, I'm a data scientist. I've done and do everything on that
list except for "put python in production" and "Scaling sharing of Jupyter
notebooks". I've put R in production (albeit not my code, I am responsible for
making sure it runs correctly and surfacing errors to the system/developer of
that code). I maintain a data lake, multiple SQL servers, deal with gobs of
json, version control my SQL Schemas, vc our data types (admittedly they
change quite rarely), etc. etc.

But I'm really just a developer who's good at databases and ETL, along with my
regular tasks of writing near-realtime background processing systems, web
api's, SQL, etc.

I think the data science industry seems to have been massively overhyped, and
now they want people who can use AI and statistical learning methods and all
this other stuff I don't know to do plain old data engineer work.

A sad outcome for a discipline that once held so much promise.

~~~
Bartweiss
> _I 'm really just a developer who's good at databases and ETL_

On the other side of things, this might be rarer than you think.

My experience is that a lot of newish programmers have very little database
experience. What they do have is often centered on Mongo or other non-
relational stores, used more for persistent storage than as interactive
entities. The ability to get info out of a SQL database is pretty standard,
obviously. But handling aggregated or joined tables are not entirely standard.
(Interviewing for an entry-level backend dev job at a major company, I was
pretty startled to have the databases section cap out at 'group by' and
'join'.) And anticipating error sources (e.g. MySQL's rollup handling),
reading and responding to 'explain' plans, or knowing about backend issues
like InnoDB settings is well outside a lot of developers' familiarity.

I assume part of this is the heavy focus of bootcamps and some college
programs on building web apps, and the optional status of databases classes in
many college CS programs. But I could imagine a lot of other factors stopping
people from picking it up elsewhere, like the changing divisions among
DBA/SysAdmin/DevOps/SRE.

So on one end, a data science boom turned out a lot of people with advanced
skills in a field with lots of simple work, and at the other there's a gap in
developer knowledge which makes it convenient to hire highly-trained people
and dump them into roles that are a mix of analyst and DBA work.

~~~
oarabbus_
> The ability to get info out of a SQL database is pretty standard, obviously.

I wouldn't have a job if this were true. And on a related note, I've found
software engineers to be generally poor at writing queries (compared to
DBAs/Analysts/Data Scientists).

------
gipp
I worked in 3 DS roles over ~5 years, and recently made the "official" jump to
SWE. I've also interviewed dozens of candidates for several openings during
that time.

This post rings extremely true to my experience, and largely aligns with what
I've been telling people for the last couple of years. I see so many bootcamp
or Masters grads with a wildly skewed understanding of what the job entails. I
also see a lot of MBA types diluting the meaning of the DS term as a whole.

A "data science" curriculum as such will basically prepare you only for an
analyst role. You're not going to be able to compete with the glut of science
PhDs flooding every open role, either. DS may be your title but you will not
be doing any of the exciting things you want to be doing. To differentiate
yourself you need to specialize, and good engineering skills are a prime way
to do that.

~~~
bitL
Heh, it's like how automakers switched to describe any software position
around crappy navigation as "self-driving car" job.

~~~
throwawaymath
Likewise, every hedge fund is quantitative, and puts “quant” or “quantitative”
into as many job titles as it can.

All these trendy terms eventually devolve into noisy marketing to attract
talent.

------
itronitron
>> ...in the past 2 years, % of any given project that involves ML: 15%, that
involves moving, monitoring, and counting data to feed ML: 85%

As it should be. In order to have confidence in your ML you need to really
understand your data and data processing.

~~~
darkxanthos
Yes. The point I took away from this is that this is not at all a focus of
most academic settings. This ends up leaving a huge gap and leaving candidates
with an academic DS background woefully unprepared and undesirable.

~~~
barbecue_sauce
That seems strange to me. People on forums like this often describe Data
Science practitioners as "statisticians that can code". If academic Data
Science programs aren't emphasizing data engineering as part of their
curriculum, what differentiates a Data Science program from statistics or
business intelligence?

~~~
Bartweiss
> _If academic Data Science programs aren 't emphasizing data engineering as
> part of their curriculum, what differentiates a Data Science program from
> statistics or business intelligence?_

In my experience, they're emphasizing software-based data work like machine
learning, but not the (vital) peripherals like cleaning/studying/loading data
or monitoring and sanity-checking outputs.

A data science student might get a process-first task like making predictions
from data using KNN, regressions, t-tests, or neural nets, choosing a method
and optimizing based on performance. A statistics student might focus on
theory, choosing an appropriate analysis method in advance based on the
dataset, and reasoning about the effects of error instead of just trying to
reduce it.

But the data scientist could still be training on a clean, wholly-theoretical
dataset or a highly predictable online-training environment. The result is a
lot of entry-level data scientists who are mechanically talented but stymied
by real-world hurdles. Issues handling dirty or inconstant data, for one. But
there are a lot of others: a tendency to do analysis in a vacuum, without
taking advantage of knowledge about the domain and data source; or judging
output effectiveness based on training accuracy, without asking whether the
dataset is (and will stay) well-matched to the actual task.

I don't mean that to sound dismissive; there are lots of people who do all of
that well, even newly-trained. But it does seem to be a common gap in a lot of
data science education.

~~~
owlie
4th year EE undegraduate student here, taking both "Data Analysis/Pattern Rec"
and "Computer Vision" electives this term. My early courses prepared me more
for a path focused in circuit design, but I jumped ship through exposure to
wonderful, wonderful DSP. A lot of what I'm learning now is very new to me,
so, I appreciate comments like yours that give a sense of potential gaps in my
learning. Thank you.

I'm currently working on an assignment for CV in which we extract Histogram of
Oriented Gradient features from the CIFAR-10 dataset using python, then use
them to train one of three classifiers (SVM, Gaussian Naive Bayes, Logistic
Regression). I had asked about preprocessing, but was told it was outside the
scope of this assignment, so we're just using the dataset as-is. :(

The nice bit is, I have a research internship coming up in a lab that will
have me working on actual datasets, rather than toy examples. And, there's a
data science club on campus that has an explicit focus on cleaning data which
I plan on regularly attending. So... hopefully I'm on the right track!

~~~
xiphias2
Don't worry, when you have real problems you will have time to learn. Most of
the time is not even data cleaning, but debugging, getting into the details of
the data or code written by somebody else to understand why something is not
working (and there's always something that's not working :) ). The main
differentiator is whether you have interest / patience for that or not.

------
twic
Couple of notes.

 _Be prepared for most of your data scientist work to not be data science.
Adjust your skillset for that._

Same in real science - for every minute you spend thinking about what nature
might be doing, you spend tens of hours carrying things around, mixing things,
checking things, repeating things, etc. This is how all real work is.

 _Most modern languages are procedural: Java, Python, Scala, R, Go, etc._

If someone has a friend who does Scala, can they read them this quote and film
the reaction? Thanks.

~~~
scott_s
Reading the quote in context:

 _> Isn’t SQL a programming language? It is, but it’s declarative. You specify
the outputs you want (i.e. which columns from your table you want to pull),
but not how those columns are actually returned to you. SQL abstracts a lot of
what’s going on under the covers of a database.

You want a procedural language, one where you have to specify how and where
the data is selected from. Most modern languages are procedural: Java, Python,
Scala, R, Go, etc._

The author is trying to contrast fully Turing complete languages with a
declarative domain specific language like SQL. (Yes, I know that some
extensions provided by various database implementations make SQL Turing-
complete.) Unfortunately, the word she chose to express this is already a
term-of-art in the programming language world which means something different.
Luckily, we're all charitable readers, so we can correct on the fly and
understand what she meant.

~~~
twic
Oh, absolutely, the meaning is perfectly clear! I just want to see a Scala
programmer cry.

------
alexgmcm
As someone working in DS for the last 4 years this is pretty accurate.

If you have a good academic background it can be possible to enter a DS role
immediately but often you will be doing work far more towards the Business
Intelligence end of things rather than deploying Deep Neural Nets in
production or whatever.

I have friends who transitioned into Data Engineering and it does seem like
the outlook is better there.

It's an excellent post.

------
minimaxir
There's nothing wrong with the data science industry becoming different, _as
long as expectations are managed_. Specifically, as this article notes, the
probability of getting hired due to the increased competition, and the
realities of the real-world job.

Both are currently not transparent enough for the data science newbies; which
is why on my end I try to be transparent as possible whenever the topic comes
up (I wrote a post similar to the OP last year:
[https://minimaxir.com/2018/10/data-science-
protips/](https://minimaxir.com/2018/10/data-science-protips/)).

------
binalpatel
The market value (i.e. the big bucks) I think will shift into Data Engineering
and the role that's abstractly called "Machine Learning Engineer".

Reliably getting any data science analysis or model running in a real world
setting is a demand that's naturally going to follow from the Data Science
glut.

------
wirrbel
When I started my first data-science role, the role description of my company
sounded a bit like "software engineer who happens to know stats and ml". The
description was fairly specific on the fact that data scientists would build
and deploy models and services. Nowadays it seems not to fall under the
software engineering umbrella. And I do think the change started with the deep
learning craze. It distorted a lot in the field. Nowadays I see so many
overfitting and complicated models that cannot be operated in production. But
they sure make impressive slides and reports.

------
Mortiffer
totally agree. Have been consulting in the data world for some years now. Most
companies want to do data science but they have so many low hanging fruit that
it makes no sense to do any ML. If they actually manage to get a senior data
scientist hired then they typically torture them with boring BI dashboard
creation.

~~~
nightski
BI dashboard creation is torture in the same sense that creating web
applications might be torture for a software developer. In other words, if you
find it to be torture you are probably in the wrong field. At the end of the
day I've found myself much happier as an engineer when I am less focused on
the tools I am using and instead the impact I can have on the business I am
serving. If a BI dashboard provides that value, then that is fun because I am
making an impact. Maybe more so than if I was using a complicated RNN
somewhere when it provided little value.

------
wdavidw
I have been dispatching the same arguments for the last 3 years. Schools have
all engaged in Data Science programming flowing the market with statisticians
reconverted into data science with basic programming skills, even lighter
notion of data engineering, DevOps tooling and operational understanding. In
2015, our Big Data major was renamed Data Science, no matter if we are still
teaching NoSQL, Hadoop, Spark... I've been careful to never engage Adaltas on
the road of DS not because we didn't like it but because of the hype around it
and the created market distortion. I tell my customers that we have Data
Engineering who can excel in Machine Learning if needed, placing their models
in streaming processing with Spark or Flink and pushing it into production
with the expectation of operational constraints. Lately, we just engaged a
young Data Scientist consultant with the right resume supporting it, first we
did was to place him on a 4 months diet to teach him about how to deploy and
secure a platform as an InfraOps and how to write data ingestion as a Data
Engineer.

------
pooya13
“In those early years[2012], there was no real formalized way to learn “data
science,”

Yeah they were called quants (aka mathematics/statistics graduates).

------
jillesvangurp
I'm not a data scientist but I've worked with a few over the past 10 years and
I strongly agree with this article that the work has changed a lot over that
time.

The first generation machine learning experts were proper scientists with
proper Ph. D. degrees, academic track records, etc. that would typically be
very opinionated on what algorithms (and quite possibly wrote a few of their
own) to use but not necessarily experienced engineers. I saw a lot of clumsy
engineering and convoluted testing and evaluation processes.

This explains a lot about the current state of the art which involves a lot of
tools that are aimed at people who are not primarily engineers and need to be
shielded from complex infrastructure and code but do know a lot about
statistics, machine learning algorithms, and all the stuff that first
generation machine learning experts would know.

The second generation of machine learning experts is basically riding an
ongoing commoditization boom. They use toolkits from Google, Facebook and
others pretty much as is. These tools are easy to use for them but not
necessarily for non expert engineers that know a lot about pumping data around
but not necessarily about machine learning algorithms. This is getting a lot
easier. I've heard of high school kids getting ML jobs with no college
training whatsoever and just high school math and a bit of online training. My
impression is that you can get nice results with a little effort.

The next generation of machine learning engineers won't be scientists and
they'll indeed mostly work on manipulating data. All the machine learning
algorithms will be provided in the form of black box libraries and tools that
will mostly work in a fully automated mode. IMHO the whole point of deep
learning is that the algorithms figure things out by themselves. Even the job
of picking the right algortithms and configuring them is ultimately going to
be something that machine learning algorithms will be better at than a junior
engineer with no relevant scientific background.

Or indeed an experienced software engineer with a classic computer science
background, like myself. I have no clue what e.g. a tensor is. articles on the
topic seem to be very math heavy and tend to give me headaches. But should I
even have to care to be able to configure some black boxes that process data
and produce models that I can plug into my runtime? My pet theory is that
we're already past that point and that lots of companies are getting decent
results not having to care about the underlying algorithms already.

I went to a great meetup at Soundcloud last week about how they used off the
shelf machine learning tooling to improve their saerch ranking in
elasticsearch. It was all about the training data, the parameters in the
search query that they wanted to machine learn, their tooling for evaluating
model performance in terms of being able to rank real queries against real
data, tooling for annotating training data, integrating models with their
software, the devops for retraining the models, etc.

My experience working with the machine learing team search group in Nokia Maps
(now Here) eight years ago was that the tools were an obstacle to getting
results fast and that iterations on model improvements were measured in
months. A lot of engineering went into things like feature extraction, model
tuning, and other stuff that scientists do as well as building essentially all
of the tools from the ground up so that models could actually be generated
evaluated, and integrated. Only problem: many of these people weren't
experienced engineers so the tools were kind of clunky and there were lots of
integration headaches, insanely long integration cycles, and lots of missed
opportunities to fix (rather obvious) data problems due to a bias towards
endless tweaking of algorithms instead of applying pragmatic fixes to the
data. It kind of worked and the search wasn't horrible but the biggest problem
was that the underlying data wasn't great to begin with (mis-categorized, full
of duplicates, incomplete/stale, etc.).

The people at Soundcloud got it down to iterating in hours with a few months
of engineering. That's from idea to proof of concept to having code in
production that outperformed a manually crafted query.

That sounds like something I could do but it also sounds like a greenfield for
proper tools to emerge that make all of this a lot less painful than it
currently is. The next generation hopefully won't have to build a lot of in
house tooling and reinvent a lot of wheels while doing so.

~~~
mr_toad
> articles on the topic seem to be very math heavy and tend to give me
> headaches

Of course. Academic papers (and a disturbingly large number of Wikipedia
pages) are not meant to explain things, they’re meant to emphasise just how
smart the authors are.

> I have no clue what e.g. a tensor is.

Well, even Einstein struggled with Tensors. In the context of TensorFlow
they’re just multi-dimensional arrays.

------
TrackerFF
Yeah - tons of traditional analyst jobs (Business Intelligence / Analysis,
Marketing analyst, etc.) have been re-labeled as Data Science.

I'd be amazed if even 10% of the people are able to do anything more than just
import scikit-learn, and train a classifier through tutorials.

This is IMO no different than when the software dev. craze started, and people
with 3 weeks of coding experience started applying for entry-level jobs. You
start interviewing them, and they can't even explain the difference between a
for or while loop-

In the end, there's just more noise. You need to find a good way to cut
through this noise, both qualified candidates and employers

------
anotheryou
fast.ai youtube lesson view numbers:

1\. Lesson: 355k

2\. Lesson: 144k

7\. Lesson: 34k

Surprisingly close to those 7%.

------
tanilama
This is pretty honest and acute description of the industry landscape and
prediction going forward.

I think DS has been abused by some people as an umbrella to not produce
qualify code, yet they somehow they put themselves in higher regards in the
value chain.

However I do see there is a real position for DS in the industry, but it
should be a specialization of senior SDE when they decide to further their
career, not its own job family. Otherwise it should be renamed as data analyst
for clarity.

~~~
scomp
_I think DS has been abused by some people as an umbrella to not produce
qualify code, yet they somehow they put themselves in higher regards in the
value chain._

Hit the nail on the head here. I worked in an DevOPs/ETL team across from the
data science team, all they did was write SELECT * FROM sales and complain
Teradata was slow and when they got the result set they'd use "R" to SUM the
column and display it with GGPLOT.

------
triplee
I loved the tone of this article because it's fairly relevant, and with a
small facelift, could have been advice to web developers circa the early
2000s.

Data science is still a thing, and it's maturing in the way that applied
sciences do when they get to the point of needing a little more engineering
background. Tech. just is never that glamorous, but the dirty secret is that
only people in tech. seem to really get that, so we have this hype cycle every
few years.

