Hacker News new | past | comments | ask | show | jobs | submit login
The origin story of data science (welcometothejungle.co)
121 points by sheckpar on May 16, 2019 | hide | past | favorite | 33 comments

This origin story is at least a century too late.

If you visit the sorting exhibit at the computer history museum, you will rapidly discover that big data and data science predates computers.

Their oldest machine sticks punch cards above a pool of mercury, and then drops a bunch of metal pins on to the card.

The signals from the metal pins activated mechanical counters.

In this way, it could perform aggregation queries for operators with constant-sized sufficient statistics (~= map only jobs). it was used by the US census bureau.

Later machines could sort the punch cards, which enabled map reduce queries.

The term “data base” came about 50 years after the census machine was in use.

A military installation was running the equivalent of map reduce queries and needed a noun to describe their mountains of punch cards (or maybe reel to reel tapes).

Back when the Computing-Tabulating-Recording company expanded and renamed themselves International Business Machines their bread and butter was automated card reading and managing, but I know the manual card reading process (with rods and trays) continued, amazingly, into the 1950s at least (I had professors who had used those techniques as grad students).

Hollerith, however, already had electrical (I hesitate to say "electronic:) card reading in the 19th century (Hollerith was one of the progenitors of CTR by the way.). All those cool optimizations like mercury-bath readers, vacuum columns for tapes etc were simply expensive performance optimizations of tasks previously done mechanically, only viable for people with huge datasets.

By the 1940s the Bell long distance area code routing tables were stored as sets of metal punch cards the size of North American automobile license plates. As they were metal (they were read a lot) they were presumably read mechanically though I haven't ever seen the apparatus. Yes, they updated the routing tables by sending huge crates of metal tablets to each toll office.

Whippersnappers might not remember the ubiquity of punch cards (actually I don't remember either, and I'm the oldest of the genXers), but apparently you got one with your bills, like utilities and charge card bills, so they could assign payment to the correct account. I do remember highway tolls being collected by issuing a punch card upon entry to the turnpike and then having the toll collector use that to calculate the toll due when you exited. Apparently the phrase "do not fold, spindle or mutilate" was a commonly used tagline for jokes about the computerization of society as all theses cards bore that legend.

I remember the tolls like that as well. The Ohio Turnpike had them as late as 2000, IIRC.

The Maine Turnpike got rid of theirs in 1997.

I point to the discovery behind the source of the broad street cholera epidemic in 1854 as an example of sophisticated pattern recognition that predates computers:


Do you have any further references about what exact military installation was doing this and when? I’m reading Cryptonomicon by Neal Stephenson and one of the main characters builds a mercury-based data analysis machine that reads punch cards to analyze communication data in WWII and this sounds eerily similar.

A lot of that book took into account real life events and people. Unfortunately I don't have specifics on that one, so hopefully the parent does.

Also, that book is fantastic, as are the prequels.

IIR mercury-based memories part of a larger class of memory called "delay-line memory" that dates all the way back the 1920s. The basic premise is that it uses the time delay of signals propagating in some medium to temporarily "store" a value.

The modern analog is the need to refresh DRAM in order to keep the bits flipped the right way.


see my reply to the parent post

Data scientists are the brand-name salesmen of tech. They talk about "data science" so much that every business owner is now convinced that they need to buy into machine learning and "Big data" to be legitimate in the current year, but despite all of the chatter it isn't any clearer what exactly "data science" is or how it's different from technology and methods that have existed for decades.

As a current data scientist, the problem is that the concepts behind data science/big data are legitimate solutions to business problems, but the current marketing of the field is either misleading (e.g. it's not all about building models/AI, and I wrote a post complaining about that: https://minimaxir.com/2018/10/data-science-protips/ ), or it's business upsell for an enterprise product that may not be the optimal solution to a given problem.

It's one of the reasons I've switched more to publishing open-source tooling for data science/AI instead of blog posts about data science.

The obsession with "AI" is particularly fascinating.

10 years ago, before it was a household term, people could mention without blushing that their production implementation ended up being a simple if-statement. If anything, being able to come up with an insight that was clear and well-focused enough to be implemented in such a straightforward way was a feather in one's cap.

Nowadays, if you aren't building a model with at least a few million parameters (though billions would be better), that ends up being so complex that the only hope for productionizing it is to stick it into a container and ask the engineering and ops folks not to peek inside it too carefully - sort of like a parent leaving the door to their teenager's bedroom closed - then nobody wants to take it seriously.

Call me a curmudgeon, but it doesn't half remind me of a while back when I was being directed to process 12MB datasets on a distributed computing cluster. At some point the tools stopped being a means to an end, and started being the only thing anybody cares about.

It is funny, what you are usually supposed to be replacing /automating with ML is the nasty 100 line conditional statement that forms some bit of "business logic" that nobody really understands. This thing usually being the time pressured fever dream of some dev that left the company years ago. Now it is replaced by something that is even more incomprehensible. If it is done well then at least you get performance guarantees and the ability to retrain. If it is done badly, then you don't know what the features are, which model belongs with which dataset, how to regenerate training sets, where the original training set went, how to set the hyperparameters to get reasonable training performance, etc, etc.

IBM's attempted marketing grab of the space with Watson has probably set back significant portions of the market by a decade at least.

Whenever people make this comment I have to seriously question if they've ever worked with any actually data scientists or at a company doing any data related work at scale? Data science is a unique skill set that is quite different than any previously existing role (most of which, for the most part, still exist).

"Big data" is not some myth. I've worked at companies big and small that are easily ingesting terabytes of useable data per day and queries consume multiple days worth of this semi-structured data. So starting out you need people with the basic software proficiency to efficiently query and work with this data.

Virtually all data science team meetings that I've been in have assumed a pretty strong familiarity with probability/statistics, linear algebra, and calculus. White board solutions that come up everyday involve a non-trivial amount of quantitative reasoning. It's incredibly rare to see non-DS people with these skills outside of what were traditionally research roles in larger companies.

Likewise the job requires not just knowing the math but knowing how to implement it when necessary. Sure there are plenty of black box libraries out there, but these will only get you so far. Diving into the software implementation details of these things is a pretty common issue.

What decades old methodologies do you think are being merely rebranded in the data scientists you've worked with? Nearly all of the pure stats people I've worked with in the years past did most of their work on the white board (or SAS) but left the code up to someone else. Business analysts still exist and honestly have none of the technical skills necessary. Software engineers rarely have the quantitive background to quickly prototype.

In all the places I've worked DS is a genuinely unique intersection of skills that requires a pretty reasonable proficiency in across the board to be effective.

I think that both of you are right, to be honest.

There have always been stats people that code, and coding people that do stats.

Most of the useful (for structured-data problems) models are pretty old, and you can solve an awful lot of business problems with linear models with carefully chosen features.

(it is perhaps worth pointing out, that with the exception of ReLU, CNN's in their modern form have existed since the 80s).

But on the other hand, the large-scale hiring of people to do this work is pretty new, as well as the volume of data available to even moderately sized services, so there's a quantitative change that is somewhat qualitative.

I'm also interested in the fact that there is no mention of A-B tests in the OP, given that the need for this is relatively new, and something that probably pushed a lot of companies towards needing more quantitative people, which led to the current nature of "modern" data science.

When the term was first coined it had a clear meaning - someone who was a statistician, and a programmer, and a subject matter expert.

Someone who could, for instance, write code from scratch to calculate maximum likelihood estimates, and use it to make predictions on real world data.

The term got coopted by snake oil salesmen pretty soon after.

Wait now he has to write his own optimization routines too?

I’ve been reading about “expert systems” of late for a particular gig I’m pitching for ... most the literature is 1980-1985 and what I realised was is that expert systems were all the rage back then. The IBMs of this world sold them and you clearly didn’t know what’s best for you if you weren’t buying them.

Is there anyone sufficiently old to remember AI in the 80s reading this? What was the job title du jour back then ?

For an actual in-depth look at the origin and deep history of data science and statistics, check out the book the "The 7 Pillars of Statistical Wisdom"[1]. It recounts historical anecdotes such as the Trial of Pyx[2] at the London mint (an early example of random sampling, how the "foot" was defined by literally having a bunch of guys put their feet end-to-end, how least squares was applied to orbits by Gauss, how in the 1940's inverting a 14x14 matrix (for Leontief's input-output analysis in economics[4]) was a big data problem, and so on.

I personally believe data science began with Bacon's Novum Organum[5], which describes an inductive method[6] starting from tables of facts and simplifying them until you reason general truths that may be treated as axioms for the purposes of deductive reasoning. This is no different in principle then the model fitting and prediction we do today.

[1]: https://www.amazon.com/Seven-Pillars-Statistical-Wisdom/dp/0...

[2]: https://en.wikipedia.org/wiki/Trial_of_the_Pyx

[3]: https://www2.stetson.edu/~efriedma/periodictable/html/Ga.htm...

[4]: https://www.math.ksu.edu/~gerald/leontief.pdf

[5]: https://en.wikipedia.org/wiki/Novum_Organum

[6]: https://en.wikipedia.org/wiki/Baconian_method

> This provided him with the subject matter for his famous paper Statistical Modeling: The Two Cultures (2001). Like Chambers, he divided statistics into two groups: The data modeling culture (Chambers’s lesser statistics) and the algorithmic modeling culture (Chambers’s greater statistics). He took it one step further, stating that 98% of statisticians were from the former, while only 2% were from the latter.

This paper provides an important distinction for the "Machine Learning is just statistics" remark.

So, the tools of machine learning are the tools of statistics. The uses to which those tools are put are quite different.

It describes why machine learning was regarded as separate from statistics twenty years ago. The world has moved on.

We used to call 'data science' data mining as anyone that spent time on KDnuggets might remember https://en.wikipedia.org/wiki/Data_mining

This story has become the stuff of urban legend and puts entirely too much onto John Tukey (who deserves a great deal of credit in other areas). It's become so common in some circles that people who buy into this story will stop talking to you or at least ignore that this story, which makes big circles in stats and ML circles and is entirely wrong.

Peter Naur coined the term in the 60s as an alternative for "Computer Science" but with a focus on the processing of Data vs abstract computation.

In the stats world, Chikio Hayashi proposed the same term in the 90s and it was followed by Jeff Wu a year later and outlines more or less the broad strokes of modern data science and saw it emerging as a new discipline from stats, but not necessarily as a computationally based discipline like CS.

Tukey argued for something a bit broader, "data analysis", as a more focused discipline -- but as a separate thread from Naur's data science. Data has been analyzed for centuries anyways -- Tukey's main contribution was to bring several data analytic disciplines together under one-term, but his vision wasn't for data science, it was for data analytics which is, depending on who you ask, either a different discipline or an umbrella so large that it includes data science among others.

I'm old enough to also remember data processing turning into data analysis, expert systems, data mining and so on. Statisticians were off working on chalkboards and hand-fitting fitting curves with calculators during all of this. Once the stats folks started realizing, sometime in the 2000s, the Computer Science had made better and more predictive tools and was a better approach for many observable phenomenon, statisticians the world over bundled up all the different disciplines they could claw together in a black hole and shot at these other disciplines, then wrapped it all in the incomprehensible statistics jargon and claimed it as their own shouting Tukey's name to the mountains.

It's not really worth fighting at this point, but this history is revisionist urban legend at best.

Source: I've worked in data disciplines since the 90s and remember when all the roles I used to work in that dated back decades, suddenly required PhDs in statistics as Stats majors flooded the market in the 2000s.

The joke I've always heard is "a data scientist is a statistician who works in San Francisco"

Leo Breiman's Statistical Modeling: The Two Cultures (2001, pdf) was posted recently (just 10 days ago) and is a fantastic read. https://news.ycombinator.com/item?id=19835962

There are many more than just four people who pushed the boundary of statistics.

So... To demonstrate that data science is a distinct field from statistics we get the story of four statisticians?

Similarly we might find the backgrounds of programmers to be diverse - less than 50% are CS majors. We might conclude that programming is a separate discipline from computer science. Sure, it is. But i think data science and statistics are related in the same way: one provides the theory for the other's practice.

Back in 2015 I did a webinar session for CIOs titled

    Data Science: Trenches to the Boardroom (10 Lessons)
The definition I was using at the time was:

Multi-disciplinary knowledge, methods and skills for building systems that acquire, manage and analyze data to deliver insight and automate quantitative analyses like segmentation, classification and prediction in support of real-time business processes.

I focused on the building of systems that automate analysis. I did this in an effort to distinguish what I am now doing from my previous work as a "research-based consultant." In that role I would take on a business problem/question and use a variety of qualitative and quantitative techniques to investigate and come to an evidence based recommendation for the client.

But now, I build systems that often automate away the work of an entry level analyst. Or, use the models to drive systems that automatically adjust prices to changes in the competitive landscape on a SKU by SKU basis. Or, build systems that create added-value data sets that are in turn sold to another company.

Key development missing from this article is Jeff Wu's coining of the term "data science" back in '97. http://www2.isye.gatech.edu/~jeffwu/presentations/datascienc...

"The roundtable discussion “Perspectives in classification and the Future of IFCS” was held at the last Conference under the chairmanship of Professor H. -H. Bock. In this panel discussion, I used the phrase ‘Data Science’. There was a question, “What is ‘Data Science’? ” I briefly answered it. This is the starting point of the present paper."


"This volume, Data Science, Classification, and Related Methods, contains a selection of papers presented at the Fifth Conference of the International Federation of Oassification Societies (IFCS-96), which was held in Kobe, Japan, from March 27 to 30,1996."

How do you write the origin story of Data Science and not mention DJ Patil and Jeff Hammerbacher?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact