If you visit the sorting exhibit at the computer history museum, you will rapidly discover that big data and data science predates computers.
Their oldest machine sticks punch cards above a pool of mercury, and then drops a bunch of metal pins on to the card.
The signals from the metal pins activated mechanical counters.
In this way, it could perform aggregation queries for operators with constant-sized sufficient statistics (~= map only jobs). it was used by the US census bureau.
Later machines could sort the punch cards, which enabled map reduce queries.
The term “data base” came about 50 years after the census machine was in use.
A military installation was running the equivalent of map reduce queries and needed a noun to describe their mountains of punch cards (or maybe reel to reel tapes).
Hollerith, however, already had electrical (I hesitate to say "electronic:) card reading in the 19th century (Hollerith was one of the progenitors of CTR by the way.). All those cool optimizations like mercury-bath readers, vacuum columns for tapes etc were simply expensive performance optimizations of tasks previously done mechanically, only viable for people with huge datasets.
By the 1940s the Bell long distance area code routing tables were stored as sets of metal punch cards the size of North American automobile license plates. As they were metal (they were read a lot) they were presumably read mechanically though I haven't ever seen the apparatus. Yes, they updated the routing tables by sending huge crates of metal tablets to each toll office.
Whippersnappers might not remember the ubiquity of punch cards (actually I don't remember either, and I'm the oldest of the genXers), but apparently you got one with your bills, like utilities and charge card bills, so they could assign payment to the correct account. I do remember highway tolls being collected by issuing a punch card upon entry to the turnpike and then having the toll collector use that to calculate the toll due when you exited. Apparently the phrase "do not fold, spindle or mutilate" was a commonly used tagline for jokes about the computerization of society as all theses cards bore that legend.
The modern analog is the need to refresh DRAM in order to keep the bits flipped the right way.
Also, that book is fantastic, as are the prequels.
It's one of the reasons I've switched more to publishing open-source tooling for data science/AI instead of blog posts about data science.
10 years ago, before it was a household term, people could mention without blushing that their production implementation ended up being a simple if-statement. If anything, being able to come up with an insight that was clear and well-focused enough to be implemented in such a straightforward way was a feather in one's cap.
Nowadays, if you aren't building a model with at least a few million parameters (though billions would be better), that ends up being so complex that the only hope for productionizing it is to stick it into a container and ask the engineering and ops folks not to peek inside it too carefully - sort of like a parent leaving the door to their teenager's bedroom closed - then nobody wants to take it seriously.
Call me a curmudgeon, but it doesn't half remind me of a while back when I was being directed to process 12MB datasets on a distributed computing cluster. At some point the tools stopped being a means to an end, and started being the only thing anybody cares about.
"Big data" is not some myth. I've worked at companies big and small that are easily ingesting terabytes of useable data per day and queries consume multiple days worth of this semi-structured data. So starting out you need people with the basic software proficiency to efficiently query and work with this data.
Virtually all data science team meetings that I've been in have assumed a pretty strong familiarity with probability/statistics, linear algebra, and calculus. White board solutions that come up everyday involve a non-trivial amount of quantitative reasoning. It's incredibly rare to see non-DS people with these skills outside of what were traditionally research roles in larger companies.
Likewise the job requires not just knowing the math but knowing how to implement it when necessary. Sure there are plenty of black box libraries out there, but these will only get you so far. Diving into the software implementation details of these things is a pretty common issue.
What decades old methodologies do you think are being merely rebranded in the data scientists you've worked with? Nearly all of the pure stats people I've worked with in the years past did most of their work on the white board (or SAS) but left the code up to someone else. Business analysts still exist and honestly have none of the technical skills necessary. Software engineers rarely have the quantitive background to quickly prototype.
In all the places I've worked DS is a genuinely unique intersection of skills that requires a pretty reasonable proficiency in across the board to be effective.
There have always been stats people that code, and coding people that do stats.
Most of the useful (for structured-data problems) models are pretty old, and you can solve an awful lot of business problems with linear models with carefully chosen features.
(it is perhaps worth pointing out, that with the exception of ReLU, CNN's in their modern form have existed since the 80s).
But on the other hand, the large-scale hiring of people to do this work is pretty new, as well as the volume of data available to even moderately sized services, so there's a quantitative change that is somewhat qualitative.
I'm also interested in the fact that there is no mention of A-B tests in the OP, given that the need for this is relatively new, and something that probably pushed a lot of companies towards needing more quantitative people, which led to the current nature of "modern" data science.
Someone who could, for instance, write code from scratch to calculate maximum likelihood estimates, and use it to make predictions on real world data.
The term got coopted by snake oil salesmen pretty soon after.
Is there anyone sufficiently old to remember AI in the 80s reading this? What was the job title du jour back then ?
I personally believe data science began with Bacon's Novum Organum, which describes an inductive method starting from tables of facts and simplifying them until you reason general truths that may be treated as axioms for the purposes of deductive reasoning. This is no different in principle then the model fitting and prediction we do today.
This paper provides an important distinction for the "Machine Learning is just statistics" remark.
Peter Naur coined the term in the 60s as an alternative for "Computer Science" but with a focus on the processing of Data vs abstract computation.
In the stats world, Chikio Hayashi proposed the same term in the 90s and it was followed by Jeff Wu a year later and outlines more or less the broad strokes of modern data science and saw it emerging as a new discipline from stats, but not necessarily as a computationally based discipline like CS.
Tukey argued for something a bit broader, "data analysis", as a more focused discipline -- but as a separate thread from Naur's data science. Data has been analyzed for centuries anyways -- Tukey's main contribution was to bring several data analytic disciplines together under one-term, but his vision wasn't for data science, it was for data analytics which is, depending on who you ask, either a different discipline or an umbrella so large that it includes data science among others.
I'm old enough to also remember data processing turning into data analysis, expert systems, data mining and so on. Statisticians were off working on chalkboards and hand-fitting fitting curves with calculators during all of this. Once the stats folks started realizing, sometime in the 2000s, the Computer Science had made better and more predictive tools and was a better approach for many observable phenomenon, statisticians the world over bundled up all the different disciplines they could claw together in a black hole and shot at these other disciplines, then wrapped it all in the incomprehensible statistics jargon and claimed it as their own shouting Tukey's name to the mountains.
It's not really worth fighting at this point, but this history is revisionist urban legend at best.
Source: I've worked in data disciplines since the 90s and remember when all the roles I used to work in that dated back decades, suddenly required PhDs in statistics as Stats majors flooded the market in the 2000s.
Similarly we might find the backgrounds of programmers to be diverse - less than 50% are CS majors. We might conclude that programming is a separate discipline from computer science. Sure, it is. But i think data science and statistics are related in the same way: one provides the theory for the other's practice.
Data Science: Trenches to the Boardroom (10 Lessons)
Multi-disciplinary knowledge, methods and skills for building systems that acquire, manage and analyze data to deliver insight and automate quantitative analyses like segmentation, classification and prediction in support of real-time business processes.
I focused on the building of systems that automate analysis. I did this in an effort to distinguish what I am now doing from my previous work as a "research-based consultant." In that role I would take on a business problem/question and use a variety of qualitative and quantitative techniques to investigate and come to an evidence based recommendation for the client.
But now, I build systems that often automate away the work of an entry level analyst. Or, use the models to drive systems that automatically adjust prices to changes in the competitive landscape on a SKU by SKU basis. Or, build systems that create added-value data sets that are in turn sold to another company.
"This volume, Data Science, Classification, and Related Methods, contains a selection of papers presented at the Fifth Conference of the International Federation of Oassification Societies (IFCS-96), which was held in Kobe, Japan, from March 27 to 30,1996."