
The origin story of data science - sheckpar
https://www.welcometothejungle.co/fr/articles/story-origin-data-science
======
hedora
This origin story is at least a century too late.

If you visit the sorting exhibit at the computer history museum, you will
rapidly discover that big data and data science predates computers.

Their oldest machine sticks punch cards above a pool of mercury, and then
drops a bunch of metal pins on to the card.

The signals from the metal pins activated mechanical counters.

In this way, it could perform aggregation queries for operators with constant-
sized sufficient statistics (~= map only jobs). it was used by the US census
bureau.

Later machines could sort the punch cards, which enabled map reduce queries.

The term “data base” came about 50 years after the census machine was in use.

A military installation was running the equivalent of map reduce queries and
needed a noun to describe their mountains of punch cards (or maybe reel to
reel tapes).

~~~
jds375
Do you have any further references about what exact military installation was
doing this and when? I’m reading Cryptonomicon by Neal Stephenson and one of
the main characters builds a mercury-based data analysis machine that reads
punch cards to analyze communication data in WWII and this sounds eerily
similar.

~~~
triplee
A lot of that book took into account real life events and people.
Unfortunately I don't have specifics on that one, so hopefully the parent
does.

Also, that book is fantastic, as are the prequels.

------
o10449366
Data scientists are the brand-name salesmen of tech. They talk about "data
science" so much that every business owner is now convinced that they need to
buy into machine learning and "Big data" to be legitimate in the current year,
but despite all of the chatter it isn't any clearer what exactly "data
science" is or how it's different from technology and methods that have
existed for decades.

~~~
minimaxir
As a current data scientist, the problem is that the concepts behind data
science/big data _are legitimate solutions_ to business problems, but the
current marketing of the field is either misleading (e.g. it's not all about
building models/AI, and I wrote a post complaining about that:
[https://minimaxir.com/2018/10/data-science-
protips/](https://minimaxir.com/2018/10/data-science-protips/) ), or it's
business upsell for an enterprise product that may not be the optimal solution
to a given problem.

It's one of the reasons I've switched more to publishing open-source tooling
for data science/AI instead of blog posts about data science.

~~~
bunderbunder
The obsession with "AI" is particularly fascinating.

10 years ago, before it was a household term, people could mention without
blushing that their production implementation ended up being a simple if-
statement. If anything, being able to come up with an insight that was clear
and well-focused enough to be implemented in such a straightforward way was a
feather in one's cap.

Nowadays, if you aren't building a model with at least a few million
parameters (though billions would be better), that ends up being so complex
that the only hope for productionizing it is to stick it into a container and
ask the engineering and ops folks not to peek inside it too carefully - sort
of like a parent leaving the door to their teenager's bedroom closed - then
nobody wants to take it seriously.

Call me a curmudgeon, but it doesn't half remind me of a while back when I was
being directed to process 12MB datasets on a distributed computing cluster. At
some point the tools stopped being a means to an end, and started being the
only thing anybody cares about.

~~~
salty_biscuits
It is funny, what you are usually supposed to be replacing /automating with ML
is the nasty 100 line conditional statement that forms some bit of "business
logic" that nobody really understands. This thing usually being the time
pressured fever dream of some dev that left the company years ago. Now it is
replaced by something that is even more incomprehensible. If it is done well
then at least you get performance guarantees and the ability to retrain. If it
is done badly, then you don't know what the features are, which model belongs
with which dataset, how to regenerate training sets, where the original
training set went, how to set the hyperparameters to get reasonable training
performance, etc, etc.

------
olooney
For an actual in-depth look at the origin and deep history of data science and
statistics, check out the book the "The 7 Pillars of Statistical Wisdom"[1].
It recounts historical anecdotes such as the Trial of Pyx[2] at the London
mint (an early example of random sampling, how the "foot" was defined by
literally having a bunch of guys put their feet end-to-end, how least squares
was applied to orbits by Gauss, how in the 1940's inverting a 14x14 matrix
(for Leontief's input-output analysis in economics[4]) was a big data problem,
and so on.

I personally believe data science began with Bacon's Novum Organum[5], which
describes an inductive method[6] starting from tables of facts and simplifying
them until you reason general truths that may be treated as axioms for the
purposes of deductive reasoning. This is no different in principle then the
model fitting and prediction we do today.

[1]: [https://www.amazon.com/Seven-Pillars-Statistical-
Wisdom/dp/0...](https://www.amazon.com/Seven-Pillars-Statistical-
Wisdom/dp/0674088913)

[2]:
[https://en.wikipedia.org/wiki/Trial_of_the_Pyx](https://en.wikipedia.org/wiki/Trial_of_the_Pyx)

[3]:
[https://www2.stetson.edu/~efriedma/periodictable/html/Ga.htm...](https://www2.stetson.edu/~efriedma/periodictable/html/Ga.html)

[4]:
[https://www.math.ksu.edu/~gerald/leontief.pdf](https://www.math.ksu.edu/~gerald/leontief.pdf)

[5]:
[https://en.wikipedia.org/wiki/Novum_Organum](https://en.wikipedia.org/wiki/Novum_Organum)

[6]:
[https://en.wikipedia.org/wiki/Baconian_method](https://en.wikipedia.org/wiki/Baconian_method)

------
denzil_correa
> This provided him with the subject matter for his famous paper Statistical
> Modeling: The Two Cultures (2001). Like Chambers, he divided statistics into
> two groups: The data modeling culture (Chambers’s lesser statistics) and the
> algorithmic modeling culture (Chambers’s greater statistics). He took it one
> step further, stating that 98% of statisticians were from the former, while
> only 2% were from the latter.

This paper provides an important distinction for the "Machine Learning is just
statistics" remark.

~~~
disgruntledphd2
So, the tools of machine learning are the tools of statistics. The uses to
which those tools are put are quite different.

------
KasianFranks
We used to call 'data science' data mining as anyone that spent time on
KDnuggets might remember
[https://en.wikipedia.org/wiki/Data_mining](https://en.wikipedia.org/wiki/Data_mining)

------
bane
This story has become the stuff of urban legend and puts entirely too much
onto John Tukey (who deserves a great deal of credit in other areas). It's
become so common in some circles that people who buy into this story will stop
talking to you or at least ignore that this story, which makes big circles in
stats and ML circles and is entirely wrong.

Peter Naur coined the term in the 60s as an alternative for "Computer Science"
but with a focus on the processing of Data vs abstract computation.

In the stats world, Chikio Hayashi proposed the same term in the 90s and it
was followed by Jeff Wu a year later and outlines more or less the broad
strokes of modern data science and saw it emerging as a new discipline from
stats, but not necessarily as a computationally based discipline like CS.

Tukey argued for something a bit broader, "data analysis", as a more focused
discipline -- but as a separate thread from Naur's data science. Data has been
analyzed for centuries anyways -- Tukey's main contribution was to bring
several data analytic disciplines together under one-term, but his vision
wasn't for data science, it was for data analytics which is, depending on who
you ask, either a different discipline or an umbrella so large that it
includes data science among others.

I'm old enough to also remember data processing turning into data analysis,
expert systems, data mining and so on. Statisticians were off working on
chalkboards and hand-fitting fitting curves with calculators during all of
this. Once the stats folks started realizing, sometime in the 2000s, the
Computer Science had made better and more predictive tools and was a better
approach for many observable phenomenon, statisticians the world over bundled
up all the different disciplines they could claw together in a black hole and
shot at these other disciplines, then wrapped it all in the incomprehensible
statistics jargon and claimed it as their own shouting Tukey's name to the
mountains.

It's not really worth fighting at this point, but this history is revisionist
urban legend at best.

Source: I've worked in data disciplines since the 90s and remember when all
the roles I used to work in that dated back decades, suddenly required PhDs in
statistics as Stats majors flooded the market in the 2000s.

------
davidw
The joke I've always heard is "a data scientist is a statistician who works in
San Francisco"

------
mxwsn
Leo Breiman's Statistical Modeling: The Two Cultures (2001, pdf) was posted
recently (just 10 days ago) and is a fantastic read.
[https://news.ycombinator.com/item?id=19835962](https://news.ycombinator.com/item?id=19835962)

------
vcdimension
There are many more than just four people who pushed the boundary of
statistics.

------
astazangasta
So... To demonstrate that data science is a distinct field from statistics we
get the story of four statisticians?

Similarly we might find the backgrounds of programmers to be diverse - less
than 50% are CS majors. We might conclude that programming is a separate
discipline from computer science. Sure, it is. But i think data science and
statistics are related in the same way: one provides the theory for the
other's practice.

------
mswen
Back in 2015 I did a webinar session for CIOs titled

    
    
        Data Science: Trenches to the Boardroom (10 Lessons)
    

The definition I was using at the time was:

Multi-disciplinary knowledge, methods and skills for building systems that
acquire, manage and analyze data to deliver insight and automate quantitative
analyses like segmentation, classification and prediction in support of real-
time business processes.

I focused on the building of systems that automate analysis. I did this in an
effort to distinguish what I am now doing from my previous work as a
"research-based consultant." In that role I would take on a business
problem/question and use a variety of qualitative and quantitative techniques
to investigate and come to an evidence based recommendation for the client.

But now, I build systems that often automate away the work of an entry level
analyst. Or, use the models to drive systems that automatically adjust prices
to changes in the competitive landscape on a SKU by SKU basis. Or, build
systems that create added-value data sets that are in turn sold to another
company.

------
straits_ship
Key development missing from this article is Jeff Wu's coining of the term
"data science" back in '97\.
[http://www2.isye.gatech.edu/~jeffwu/presentations/datascienc...](http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf)

~~~
kgwgk
"The roundtable discussion “Perspectives in classification and the Future of
IFCS” was held at the last Conference under the chairmanship of Professor H.
-H. Bock. In this panel discussion, I used the phrase ‘Data Science’. There
was a question, “What is ‘Data Science’? ” I briefly answered it. This is the
starting point of the present paper."

[https://link.springer.com/chapter/10.1007/978-4-431-65950-1_...](https://link.springer.com/chapter/10.1007/978-4-431-65950-1_3)

"This volume, Data Science, Classification, and Related Methods, contains a
selection of papers presented at the Fifth Conference of the International
Federation of Oassification Societies (IFCS-96), which was held in Kobe,
Japan, from March 27 to 30,1996."

------
lukejduncan
How do you write the origin story of Data Science and not mention DJ Patil and
Jeff Hammerbacher?

