
Does my data fit in RAM? - louwrentius
https://yourdatafitsinram.net/
======
Jupe
Some pedantry...

Raw RAM space is not the issue... it's indexing and structure of the data that
makes it process-able.

If you just need to spin through the data once, there's no need to even put
all of it in RAM - just stream it off disk and process it sequentially.

If you need to join the data, filter, index, query it, you'll need a lot more
RAM than your actual data. Database engines have their own overhead (system
tables, query processors, query parameter caches, etc.)

And, this all assumes read-only. If you want to update that data, you'll need
even more for extents, temp tables, index update operations, etc.

~~~
LeoPanthera
I often use RAM disks for production services. I'm sure that I shouldn't. It
feels very lazy, and I know that I'm probably missing out on the various
"right" ways to do things.

But it works _so well_ and it's _so easy_. It's a really difficult habit to
kick.

Obligatory joke: "Oh boy, virtual memory! Now I can have a _really big_ RAM
disk!"

------
siffland
Funny story, about 6 years ago we got a HP DL980 server with 1TB of memory to
move from an Itanium HP-UX server. The test database was Oracle and about
600GB in size. We loaded the data and they had some query test they would run
and the first time took about 45 minutes (which was several hours faster than
the HPUX), They made changes and all the rest of the runs took about 5 minutes
for their test to complete. Finally someone asked me and i manually dropped
the buffers and cache and back to about 45 minutes.

Their changes did not do anything, everything was getting cached. It was cool,
but one needs to know what is happening with their data. I am just glad they
asked before going to management saying their tests only took 5 minutes.

~~~
Dylan16807
What an unfortunate cache-filling algorithm, though. With eight drive slots,
40 minutes of IO is about 31 megabytes per disk per second.

------
yongjik
Even more importantly, does your data _have to_ fit in RAM?

There are tons of problems that need to process large data, but touch each
item just once (or a few times). You can go a really long way by storing them
in disk (or some cloud storage like S3) and writing a script to scan through
them.

I know, pretty obvious, but somehow escapes many devs.

~~~
sevensor
There's also the "not all memory is RAM" trick: plan ahead with enough swap to
fit all the data you intend to process, and just _pretend_ that you have
enough RAM. Let the virtual memory subsystem worry about whether or not it
fits in RAM. Whether this works well or horribly depends on your data layout
and access patterns.

~~~
mrmrcoleman
Interesting. Can you provide some examples of where this is the correct
approach?

~~~
tmountain
This is how mongodb originally managed all its data. It used memory mapped
files to store the data and let the underlying OS memory management facilities
do what they were designed to do. This saved the mongodb devs a ton of
complexity in building their own custom cache and let them get to market much
faster. The downside is that since virtual memory is shared between processes,
other competing processes could potentially mess with your working set
(pushing warm data out, etc). The other downside is that since your turning
over the management of that “memory” to the OS, you lose fine grained control
that can be used to optimize for your specific use case.

~~~
xchaotic
Except nowadays with Docker / Kubr you can safely assume the db engine will be
the only tenant of a given vm /pod whatever so I think it’s better to let OS
do memory management than fight it

------
ksec
The problem is DRAM price hasn't drop one bit.

The lowest price floor per GB has been similar for the past decade. Roughly at
$2.8/GB in 2012, 2016, and 2019. And all DRAM manufacturers has been enjoying
a very profitable period.

And yet our Data size continue to grow. We can fit more Data inside memory not
because DRAM capacity has increase, but we are simply increasing memory
channels.

~~~
keanzu
Everyone knows that DRAM prices have been in a collapse since early this year,
but last week DRAM prices hit a historic low point on the spot market. Based
on data the Memory Guy collected from spot-price source InSpectrum, the lowest
spot price per gigabyte for branded DRAM reached $2.59 last week.

[https://thememoryguy.com/dram-prices-hit-historic-
low/](https://thememoryguy.com/dram-prices-hit-historic-low/)

You've selected out the low points on the graph: 2012, 2016, and 2019, most of
the time DRAM has not been available at these prices. Now is definitely the
time to load up on RAM.

~~~
ksec
>the lowest spot price per gigabyte for branded DRAM reached $2.59 last week.

It would be better to reference this as quoted from the article which was
written in November 2019. So not really _last week_

>most of the time DRAM has not been available at these prices.

I did said price floor.

~~~
satanspastaroll
If it doesen't have to be available, i could sell one stick for $0.01 and
that'd be the new floor

~~~
Dylan16807
You sure are giving _any_ benefit of the doubt there.

Lowest massively-available price, please and thank you.

~~~
satanspastaroll
If it's lowest massively-available price then this

>most of the time DRAM has not been available at these prices.

should make it not the floor. If the floor doesen't have to be available, then
what's the exact point it becomes relevant? Otherwise the price is simply
misleading

~~~
Dylan16807
Any price that is massively-available becomes relevant and _stays_ relevant
forever.

A price has to be massively available _at a point in time_ to matter. It
doesn't have to be available _forever_ to matter. It feels like you're
conflating the two.

The price is on a downward trend, but there are hitches and setbacks. One fair
way to measure it is to use some kind of average. Another also-fair way to
measure it is to go by the lowest "real" price, where "real" means you can buy
something like a million sticks on the open market.

When we're talking about whether we should be impressed by a price, using the
lowest historical price for comparison makes sense.

(And just to be absolutely clear, you would need to adjust the metric for a
product that goes _up_ in price over time. But for something on a downward
trend, this metric works fine.)

------
ramraj07
Legit question: I have a dataset that's a terabyte in size spread over
multiple tables, but my queries often involve complex self joins and filters;
for various reasons, I'd prefer to be able to write my queries in SQL (or
spark code) because it's the most expressive system I've seen. What tool
should I use it to load this dataset on RAM and run these queries?

~~~
chmod775
Most DB engines will use what RAM is available and even if they don't, your
OS's page cache will make sure stuff is fast anyways.

> What tool should I use it to load this dataset on RAM and run these queries?

The question should really be: What tool should I use to make this fast?

Postgres can be pretty fast when used correctly and you can make your data
fit.

~~~
falcolas
Exactly this. DBs are really good at utilizing all the memory you give them.
The query planners might give you some fits when they try and use disk tables
for complicated joins, but you can work around them.

------
skwb
I remember back in the early 2010's that a large selling point of SAS (besides
the ridiculous point that R/python were freeware and therefore cannot be
trusted on important projects ) was that it can chew through large data sets
that perhaps couldn't be moved into RAM (but maybe it takes a week or
whatever....).

This was a fairly salient point, and remember circa 2012/2013 struggling to
fit large bioinfomatics data into an older iMac with base R.

~~~
dredmorbius
SAS Institute have long claimed this. It's been provably bullshit for decades.

In practice, an awk script frequently ran circles around processing. On a
direct basis, awk corresponds quite closely to the SAS DATA Step (and was
intended to be paired with tools such as S, the precursor to R, for similar
types of processing).

The fact that awk had associative arrays (which SAS long lacked, it's since
... come up with something along those lines) and could perform extremely
rapid sort-merge or pattern matches (equivalent to SAS data formats, which
internally utilise a b-tree structure) helped.

With awk, sort, unique, and a few hand-rolled statistics awk libraries /
scripts, you can replace much of the functionality of SAS. And that's without
even touching R or gnuplot, each of which offer further vast capabilities.

And at an aggreable annual license fee.

------
bluedino
$2,000 each for 128GB LRDIMMs, 48 of those will be $100,000 and then you'll
need another $20,000 to buy the rest of the server it goes in.

~~~
wmf
If the result is faster than a $250K Hadoop cluster then you're still ahead.

~~~
trhway
the way things today, the Hadoop cluster will be just a bit faster on that
thing.

25 years ago i was suggesting clients to upgrade from 4MB to 6-8MB as it was
improving their experience with our business software, these days i've already
suggested a couple of customers to upgrade from 6TB and 8TB respectively ...
as it would improve their experience with our business software. What's funny
is that customer experience with business software back then was better than
today.

~~~
Aperocky
Everything is way too much bloat now

------
jszymborski
"Can I afford to fit my data in RAM?" is a whole other site, I presume...

~~~
vb6sp6
"Can you afford to not put your data in RAM" is another one.

~~~
birdyrooster
"Did you even need all of this data?" is the one after that one

~~~
dredmorbius
I'd argue it's the first.

------
bordercases
Short answer: It fits in RAM if it's <= 12288 GB

~~~
slumdev
The original Twitter thread is funny. This site doesn't add anything. It looks
like SEO more than anything, and now that HN has linked to it, it's been
successful.

~~~
samspenc
Could you share the link to the original Twitter thread? I guess the link was
changed on HN before some of us got here.

~~~
imron
On the linked page, it says:

"Inspired by this tweet" which links to
[https://twitter.com/garybernhardt/status/600783770925420546](https://twitter.com/garybernhardt/status/600783770925420546)

~~~
makapuf
Note that this is 2015, on the heyday of big data.

------
ghc
Fine print: But it might cost you $300K CapEx or $800K/yr OpEx. Hope you have
a budget!

~~~
lmilcin
You know what else costs? Humongous amount of servers to run silly stuff to
orchestrate other silly stuff to autoscale yet else silly stuff to do stuff on
your stuff that could fit into memory and be processed on a single server (+
backup, of course).

Add to that small army of people, because, you know, you need specialists of
variety of professions just to debug all integration issues between all those
components that WOULD NOT BE NEEDED if you just decided to put your stuff in
memory.

Frankly, the proportion of projects that really need to work on data that
could not fit in memory of a single machine is very low. I work for one of the
largest banks in the world processing most of its trades from all over the
world and guess what, all of it fits in RAM.

~~~
threeseed
I really don't understand comments like this.

Yes your company's data may fit in RAM. But does every intermediate data set
also fit in RAM ? Because I've also worked at a bank and we had thousands of
complex ETLs often needing tens to hundreds of intermediate sets along the
way. There is no AWS server that can keep all of that inflight at one time.

And what about your Data Analysts/Scientists. Can all of their random data
sets reside in RAM on the same server too ?

~~~
andyjpb
Buy them a machine each.

$100K has always been "cheap" for a "business computer" and today you can get
more computer for that money than ever.

$100K of hardware (per year or so) is small-fry compared to almost every other
R&D industry out there. Just compare with the cost of debuggers, oscilloscopes
and EMC labs for electronic engineers.

~~~
threeseed
My company has over 400 Data Scientists and 1000s of Data Analysts.

Buy them a machine each at a cost of $40-60 billion ?

Or would it make more sense to buy one Spark cluster and then share the
resources at a fraction of the cost.

~~~
Aeolun
I don’t get your numbers, getting one for 400 people is 40M, for thousands it
may be 100-999M.

Still expensive, but much less than 40 billion.

~~~
threeseed
He said $100k for each user but it's a dumb idea anyway.

We have a Spark cluster which supports all of those users for $10-$20k a
month.

------
zaltekk
The Amazon listing has an Azure instance type and links to Azure docs.

A quick search shows that you can get at least 24TiB from AWS:
[https://aws.amazon.com/ec2/instance-types/high-
memory](https://aws.amazon.com/ec2/instance-types/high-memory)

~~~
louwrentius
Thanks for pointing out my sloppy mistake.

I have chosen for the cloud options to only select virtual instances that can
be spun up on demand. The high-memory instances you link to are purpose-build.

On the other hand, it is true, they do exist.

------
scarejunba
You can fit 48 TB on a HPE MC990 X though I'm pretty sure that's got one of
those NUMA architectures that SGI had with the UV 3000 or whatever.

I remember jokingly telling my team to spend the millions of dollars we did
expanding our clusters with one of these and just processing in RAM. I
honestly don't think I did the analysis to make sure it would be actually
better.

It was 'jokingly' because we couldn't afford three of these machines anyway.
The clusters had the property that we could lose some large fraction of nodes
and still operate, we could expand slightly sub-linearly, etc. which are all
lovely properties.

It would have been neat, though. Ahhhh imagine the luxury of just loading
things into memory and crunching the whole thing in minutes instead of hours.
Gives me shivers.

~~~
andyjpb
This is why IBM have historically done so well.

You didn't have to have all the capital. They'd rent you the machine on a long
lease and you could say you'd bought a million dollar computer.

Sun tried to get into similar markets but it was always a tougher deal on
minis and micros.

------
rbanffy
IIRC, the largest IBM z15 can have up to 40 TiB of RAM.

The website needs updating.

~~~
wmf
And Superdome Flex supports 48 TB.

~~~
rbanffy
These seem to be the least boring x86 machines. At 16 sockets the NUMA
topology must be interesting.

------
justizin
ah and it is so fucking lovely to be hired by a company which has been running
everything on a single machine for 2 years.

------
dmos62
I just upgraded my laptop to 3 GB. Feeling a bit behind on the times.

------
randyzwitch
While pithy, the implication that you are going to process 12 TB of data in
RAM using mostly single-threaded tools doesn't reflect reality.

~~~
lotyrin
Where exactly does that implication come from? Are you from a world where you
need a map reduce framework + cluster to have parallelism of any kind?

~~~
randyzwitch
No, but I frequently see people implying that you can do your data science in
Python and R as long as you can fit the data in RAM. As you mention, it's not
RAM that's the limiting factor for larger data volumes, it's finding tools
that exploit parallelism.

~~~
lotyrin
I have done plenty of parallel work in python and R, so I'm still not sure
what you mean.

~~~
randyzwitch
Yes, that's my point. It's too simplistic to say "well, the data fits in RAM",
you have to add parallelism to make the workload tolerable. In the past, some
people have done that using MapReduce or Spark, GNU parallel or just writing
parallel code in their favorite language. But RAM by itself isn't the only
limiting factor to whether a problem is solvable in a reasonable amount of
time.

------
EdwardDiego
Telling me that my (example) 24TB data set fits into RAM, because it fits into
24TB on an AWS instance designed for SAP that's so expensive that it's price
on inquiry, isn't overly helpful.

May I suggest that a) they consider applications need RAM too and b) that if
the price is POI, it's probably better to mention "But you know, 188
m5.8xlarges might be cheaper"

------
1MachineElf
I've no idea if the intended audience here would ever run their workloads on
Solaris, but like the IBM POWER systems, Oracle & Fujitsu SPARC servers also
max out at 64TB of RAM. I didn't see those included here.

------
sriku
The question of whether my storage mechanism fits budget is more relevant, no?
It costs less than 4cents per hour to store 1TB on S3 but an X1 with 1TB of
RAM costs $13 or so per hour on Aws. So the issue is whether what you pay for
is worth the result you're computing.

------
staticassertion
Pretty cool that single boxes have > 10 terabytes of RAM.

I would definitely imagine that most workloads rarely need more than a few to
a few hundred TB in memory, since you may have petabytes of data but you
probably touch very little of it.

------
H8crilA
One of the best things you can often do for latency/throughput improvement is
to just mlock() the data. Yes, it fits. Get a more expensive machine and 5x
the throughput in a single day with a config change.

------
iaabtpbtpnn
Sure, my data fits in RAM if you take one of those example machines, which we
have, and which are currently hosting dozens of production VMs, and instead
dedicate the whole thing to my database. I'd love that, but it's never going
to happen.

~~~
takeda
It's still way more efficient to use it for that database, than running a
dozen of hadoop nodes as VMs.

~~~
sdenton4
We do what we must because we can.

------
deepsun
There is AWS and Azure, but no GCP. Why?

------
panzagl
I wonder how Abe Vigoda is doing...

------
hkt
Honestly thought this would be a page that says "yes"

------
lunixbochs
Looks like I'm all set with my -12PB dataset fitting in ram.

~~~
jkaptur
12TiB, not PiB!

Hope you read this before you spent $50,000,000 dollars/month (although you
might be able to negotiate a discount).

------
known
Reminds me of
[https://en.wikipedia.org/wiki/List_of_Ultrabook_models](https://en.wikipedia.org/wiki/List_of_Ultrabook_models)

------
einpoklum
> Does my data fit in RAM?

1\. If you have to ask, then either it doesn't now, or it doesn't sometimes.
So assume it doesn't.

2\. If you can use a cluster, then maybe.

3\. In some senses, it doesn't matter. How so? Reading and writing from RAM is
very slow, latency-wise, for today's processors. If I can bend the truth a
little, it's a bit like a fast SSD. So, if you can up the bandwidth to disk
enough, it becomes kind of comparable. Well, if you can use 16 PCIe 4.0 lanes,
it's roughly 24 GB/sec effective bandwidth, which is roughly half of your
memory bandwidth. Now it's true that in real-life systems it's usually just 4
lanes, but it's very doable to change that with a nice card.

4\. DIMM-form-factor non-volatile memory may increase memory sizes much more.

