Add to that small army of people, because, you know, you need specialists of variety of professions just to debug all integration issues between all those components that WOULD NOT BE NEEDED if you just decided to put your stuff in memory.
Frankly, the proportion of projects that really need to work on data that could not fit in memory of a single machine is very low. I work for one of the largest banks in the world processing most of its trades from all over the world and guess what, all of it fits in RAM.
Story time: I worked on one project where a single (large) building's internal sensor data (HVAC, motion, etc. 100k sensors) would fill a 40TB array every year. They had a 20 year retention policy. So Dell would just add a new server + array every year.
I worked with another company that had 2000 oracle servers in some sort of franken-cluster config. Reports took 1 week to run and they had pricing data for their industry (they were a transaction middleman) for almost 40 years. I can't even guess the data size because nobody could figure it out.
This is not a FAANG problem. This is an everage SME to large enterprise problem. Yeah, startups don't have much data. Most companies out there aren't startups.
By the way, memory isn't the only solution. In the past 15 years, I've rarely worked on projects where everything was in memory. Disks work just fine with good database technology.
That's a lot of data, but what do you even do with it other than take minuscule slices or calculate statistics?
And for those uses, I'd put whether it fits in RAM as not applicable. It doesn't, but can you even tell the difference?
Agreed that 20 year retention is silly. We thought it was silly, but the policies reflected the need for historical analysis for audit purposes.
It does in fact matter what you can fit in RAM though. We had to adapt all our systems to a janky SQL Server setup that was horrible for time series data and make our software run on those servers. RAM availability for working sets was a huge bottleneck (hence the cost of analysis).
Yes your company's data may fit in RAM. But does every intermediate data set also fit in RAM ? Because I've also worked at a bank and we had thousands of complex ETLs often needing tens to hundreds of intermediate sets along the way. There is no AWS server that can keep all of that inflight at one time.
And what about your Data Analysts/Scientists. Can all of their random data sets reside in RAM on the same server too ?
$100K has always been "cheap" for a "business computer" and today you can get more computer for that money than ever.
$100K of hardware (per year or so) is small-fry compared to almost every other R&D industry out there. Just compare with the cost of debuggers, oscilloscopes and EMC labs for electronic engineers.
Buy them a machine each at a cost of $40-60 billion ?
Or would it make more sense to buy one Spark cluster and then share the resources at a fraction of the cost.
Still expensive, but much less than 40 billion.
We have a Spark cluster which supports all of those users for $10-$20k a month.
There is a trend, when the application is inefficient, to spend huge amount of resources on scaling it instead of making the application more efficient.
I mean, I also prefer doing things on a single machine, but if that machine gets expensive enough, or writing a program that can actually use all that power gets too difficult, why not switch to a cloud database?
Look at it this way: This is for the person that's already going to get enough ram sticks to fit the entire data set or multiple of it, across many machines, and deal with the enormous overhead from doing queries across many machines. The revelation is that you can fit that much ram inside a single machine for a much cheaper and faster experience.
We have way too many fucking copies of our DB.