Hacker News new | past | comments | ask | show | jobs | submit login
Does my data fit in RAM? (yourdatafitsinram.net)
216 points by louwrentius on Feb 12, 2020 | hide | past | favorite | 162 comments

Some pedantry...

Raw RAM space is not the issue... it's indexing and structure of the data that makes it process-able.

If you just need to spin through the data once, there's no need to even put all of it in RAM - just stream it off disk and process it sequentially.

If you need to join the data, filter, index, query it, you'll need a lot more RAM than your actual data. Database engines have their own overhead (system tables, query processors, query parameter caches, etc.)

And, this all assumes read-only. If you want to update that data, you'll need even more for extents, temp tables, index update operations, etc.

I often use RAM disks for production services. I'm sure that I shouldn't. It feels very lazy, and I know that I'm probably missing out on the various "right" ways to do things.

But it works so well and it's so easy. It's a really difficult habit to kick.

Obligatory joke: "Oh boy, virtual memory! Now I can have a really big RAM disk!"

Funny story, about 6 years ago we got a HP DL980 server with 1TB of memory to move from an Itanium HP-UX server. The test database was Oracle and about 600GB in size. We loaded the data and they had some query test they would run and the first time took about 45 minutes (which was several hours faster than the HPUX), They made changes and all the rest of the runs took about 5 minutes for their test to complete. Finally someone asked me and i manually dropped the buffers and cache and back to about 45 minutes.

Their changes did not do anything, everything was getting cached. It was cool, but one needs to know what is happening with their data. I am just glad they asked before going to management saying their tests only took 5 minutes.

What an unfortunate cache-filling algorithm, though. With eight drive slots, 40 minutes of IO is about 31 megabytes per disk per second.

Those were some poorly built systems. I worked on probably 10 of them and they not-infrequently had .. major issues. HP's support model was to send a tech out with 2 sticks of RAM, and try them in different places to try to trace memory failures... across 4 (or 8?) cassettes, and 64 sticks of ram, and 20+ minute POST times.

We eventually had one server entirely replaced at HP's cost after yelling at them long enough, and that one never worked well enough to ever use in production, either. I'd say we had maybe a 70-80% success rate with those servers. They were beasts, though, with 4TB of RAM as I recall, and 288 cores.

Even more importantly, does your data have to fit in RAM?

There are tons of problems that need to process large data, but touch each item just once (or a few times). You can go a really long way by storing them in disk (or some cloud storage like S3) and writing a script to scan through them.

I know, pretty obvious, but somehow escapes many devs.

There's also the "not all memory is RAM" trick: plan ahead with enough swap to fit all the data you intend to process, and just pretend that you have enough RAM. Let the virtual memory subsystem worry about whether or not it fits in RAM. Whether this works well or horribly depends on your data layout and access patterns.

Don't even need to do that. Just mmap it and the virtual memory system will handle it.

Interesting. Can you provide some examples of where this is the correct approach?

This is how mongodb originally managed all its data. It used memory mapped files to store the data and let the underlying OS memory management facilities do what they were designed to do. This saved the mongodb devs a ton of complexity in building their own custom cache and let them get to market much faster. The downside is that since virtual memory is shared between processes, other competing processes could potentially mess with your working set (pushing warm data out, etc). The other downside is that since your turning over the management of that “memory” to the OS, you lose fine grained control that can be used to optimize for your specific use case.

Except nowadays with Docker / Kubr you can safely assume the db engine will be the only tenant of a given vm /pod whatever so I think it’s better to let OS do memory management than fight it

Might not be exactly the same use case, but a simple example is compiling large libraries on constrained/embedded platforms. Building OpenCV on a Pi certainly used to require adding a gig of swap.

With the Varnish HTTP cache the authors started out with a very "mmap or bust" type of approach, but later added a malloc-based backend.

Escapes many devs? Really? I used to work with biologists who thought they needed to run their scripts on a supercomputer because the first line read their entire file into an array. But if I saw someone who calls themselves a "dev" doing this I'd consider them incompetent.

I once got into an argument with a senior technical interviewer because he wanted a quick solution of an in-memory sort of an unbounded set of log files.

Needless to say I wasn't recommended for the job, and it taught me a valuable lesson: if you don't first give them what they want, you can't give them what they actually need.

Plenty of devs that don't do any sort of file streaming, say those who started with Game Maker or another specialized domain

I've spent a lot of time writing Spark code, and its ability to store data in a column oriented format in RAM is the only reason why - disk is goddamned slow.

As soon as you're touching it more than once, sticking it in RAM upon reading makes everything much faster.

The problem is DRAM price hasn't drop one bit.

The lowest price floor per GB has been similar for the past decade. Roughly at $2.8/GB in 2012, 2016, and 2019. And all DRAM manufacturers has been enjoying a very profitable period.

And yet our Data size continue to grow. We can fit more Data inside memory not because DRAM capacity has increase, but we are simply increasing memory channels.

Everyone knows that DRAM prices have been in a collapse since early this year, but last week DRAM prices hit a historic low point on the spot market. Based on data the Memory Guy collected from spot-price source InSpectrum, the lowest spot price per gigabyte for branded DRAM reached $2.59 last week.


You've selected out the low points on the graph: 2012, 2016, and 2019, most of the time DRAM has not been available at these prices. Now is definitely the time to load up on RAM.

7.5 percent drop since 2012 (using your numbers plus the OP’s number).

That’s not what we in computing refer to as a “collapse”.

> Now is definitely the time to load up on RAM

And it is predicted to climb back up this year, due to manufacturers dropping wafer starts at a point in time when a large launch of next-gen consoles is drastically increasing consumption.

Nope, definitely no collusion there. /s

> Everyone knows that DRAM prices have been in a collapse

I'm afraid I had no idea that had happened at all.

> since early this year,

Since..when?..now? Feb 12th is pretty early in the year AFAIAC

>the lowest spot price per gigabyte for branded DRAM reached $2.59 last week.

It would be better to reference this as quoted from the article which was written in November 2019. So not really last week

>most of the time DRAM has not been available at these prices.

I did said price floor.

If it doesen't have to be available, i could sell one stick for $0.01 and that'd be the new floor

You sure are giving any benefit of the doubt there.

Lowest massively-available price, please and thank you.

If it's lowest massively-available price then this

>most of the time DRAM has not been available at these prices.

should make it not the floor. If the floor doesen't have to be available, then what's the exact point it becomes relevant? Otherwise the price is simply misleading

Any price that is massively-available becomes relevant and stays relevant forever.

A price has to be massively available at a point in time to matter. It doesn't have to be available forever to matter. It feels like you're conflating the two.

The price is on a downward trend, but there are hitches and setbacks. One fair way to measure it is to use some kind of average. Another also-fair way to measure it is to go by the lowest "real" price, where "real" means you can buy something like a million sticks on the open market.

When we're talking about whether we should be impressed by a price, using the lowest historical price for comparison makes sense.

(And just to be absolutely clear, you would need to adjust the metric for a product that goes up in price over time. But for something on a downward trend, this metric works fine.)

buying the ram probably cheaper than buying database liceneses and/or the machines to run them and/or the time you spend making things fast enough when using them.

If it was just a matter of adding more memory people would. A few thousand, or tens of thousands of dollars aren’t much to organisations that have that much data.

The trouble is that there are limits to how much memory you can fit on a motherboard.

And how much you can afford. My partner's scientific research group recently upgraded their shared server used for analysis ... to one with 32 GB of RAM and 8 TB of storage. I think they could have done better, personally, but it's telling how thin budgets are stretched in a lot of real world cases.

Something like this [1] gets you to 4TB of ram. Dual socket Zen2 Epyc gets you 8 TB ram. A lot of stuff fits in 8 TB of ram.

[1] https://www.supermicro.com/en/Aplus/system/1U/1113/AS-1113S-...

just wait for China to enter the mkt in thia decade...

You mean 2 years ago? https://www.anandtech.com/show/12455/chinese-xian-uniic-semi...

It doesn't seem to have brought prices down much.

Legit question: I have a dataset that's a terabyte in size spread over multiple tables, but my queries often involve complex self joins and filters; for various reasons, I'd prefer to be able to write my queries in SQL (or spark code) because it's the most expressive system I've seen. What tool should I use it to load this dataset on RAM and run these queries?

There are a few steps to consider before you are loading data into ram.

Can you partition the data in any useful way? For example if queries use separate ranges of dates, then you can partition data so that queries only need to touch the relevant date range. Can you pre-process any computations? Sometimes tricky things done within the context of multiple joins can be done once and written to a table for later use. Can you materialize any views? Do you have the proper indexes set up for your joins and filters? Are you looking at execution plans for your queries? Sometimes small changes can speed up queries by many orders of magnitude.

Smart queries + properly structured data + a well tuned postgres DB is an incredibly powerful tool.

Can I set up efficient indexes on parquet data to use with Spark, or is it necessary to use a DB?

Most DB engines will use what RAM is available and even if they don't, your OS's page cache will make sure stuff is fast anyways.

> What tool should I use it to load this dataset on RAM and run these queries?

The question should really be: What tool should I use to make this fast?

Postgres can be pretty fast when used correctly and you can make your data fit.

Exactly this. DBs are really good at utilizing all the memory you give them. The query planners might give you some fits when they try and use disk tables for complicated joins, but you can work around them.

Both mysql and pgsql bypass the page cache if they can and maintain their own page caches. You have to do this, otherwise you’re double caching! That is, you’d have your own page cache, which you need to manage calls to read() and to know when to flush pages, while the OS would also have the same pages in its own cache.

(mongodb I believe uses direct mmap access instead of a pagecache, and lmdb does this as well)

You can get a box with 4TB of ram on EC2 for $4/hr spot, so copy your data into /dev/shm and go hog wild.

For lots of databases, most of their time is spent locking and copying data around, so depending on your workload you might getsignificant speedups in Pandas/Numpy if it's just you doing manipulations, and there are multicore just-in-time compilers for lots of Pandas/Numpy operations (like Numba/Dask/etc).

If you have lots of weird merging criteria and want the flexibility of SQL I'd say use a modern Postgresql with multicore selects on that 4TB box.

How long does it take to copy your data in? And what’s the bandwidth cost involved?

People like to talk about the elasticity or compute, but startup is not free (or even cheap in most cases).

If your data is in S3, my experience is that you can push ~20-40MB/core/sec on most instances.

OP is probably talking about an x1e.32xlarge. According to Daniel Vassalo's S3 benchmark [1], it can do about 2.7GB/sec.

So your 4TB DB might take ~30min to fetch.

Bandwidth is free, you'd pay $2 for the 30 min of compute, and some fractions of pennies for the few hundred S3 requests.

[1]: https://github.com/dvassallo/s3-benchmark

It's to note that any data exported out of AWS will be billed at $0.09/GB, or $90/TB

Currently I have it loaded on redshift with as much optimization as possible, and the queries are far more analytical than end-user like (often having to self join on the same dataset). This works okay, but doesn't scale with more than a handful users at a time. I'll probably run some tests with the postgres suggestion but curious if this is still a better alternative or not

[Disclaimer: I worked on BigQuery a couple lives ago]

I'd give Google BigQuery a shot. Should work fast [seconds] and scale seamlessly to [much] larger datasets and [many] more users. For a 1 TB dataset, I have a hard time imagining crafting a slow query. Maybe something outlandish like 1000[00?] joins. They also have an in-memory "BI Engine" offering, alas limited to 50GB max.

On premise, there is Tableau Data Engine. I don't think they offer a SQL interface, you have to buy into the entire ecosystem.

Long shot: I've been working on "most expressive query system over multiple tables" as an offshoot of some recent NLP work. Your use case piqued my interest. I'd love to help / understand it better. My contact is in my profile.

"My contact is in my profile"

This makes for a good laugh: Queries for the masses. Contact: f'info@${user}.com'

Every major database will load the hottest data into RAM, where the scope of "hottest" broadens to whatever amount will fit in RAM. A small percentage of them require you to confirm how much RAM it can use for this cache.

Putting the data on a ramdisk just becomes entirely redundant because it's still going to create a second memory cache that it uses.

Many operations do a local cache warming by running the common queries over the database before they it is brought online for processing. As a secondary note, people often under-estimate the size of their data because they don't account for all of the keys, indexes and relationships that also would be memory cached in an ideal situation.

You don't need to store everything into RAM to get fast results. Data warehouse relational databases are designed exactly for this kind of fast SQL analysis over extremely large datasets. They use a variety of techniques like vectorized processing on compressed columnar storage to get you quick results.

Google's BigQuery, AWS Redshift, Snowflake (are all hosted), or MemSQL, Clickhouse (to run yourself). Other options include Greenplum, Vertica, Actian, YellowBrick, or even GPU-powered systems like MapD, Kinetica, and Sqream.

I recommend BigQuery for no-ops hosted version or MemSQL if you want a local install.

Some of those aren't designed for data that doesn't fit in RAM. They even have it in their name (MemSQL for example).

All of those support datasets that don't fit in RAM, or else they would be useless at data warehousing.

MemSQL uses rowstores in memory combined with columnstores on disk. Both can be joined together seamlessly and the latest release will automatically choose and transition the table type for you as data size and access patterns change.

You can just write Spark SQL, set the executor memory to whatever the machine is and not worry about whether it's in RAM or not.

Spark will naturally use RAM first and then disk as needed.

You should use the cache or persist call on the spark data frame/dataset. Persist gives you more control.

This is what most people in my org do, this is orders of magnitude slower than running queries on the same dataset on redshift with optimized presorting and distribution. Redshift doesn't scale for tens or hundreds of parallel users though, so looking for options

I'm about 75% joking: restore a new cluster from a snapshot for horizontal scaling. They did launch a feature along those lines, because my joke suggestion probably doesn't scale organizationally: take a look at Concurrency Scaling. This is also a fundamental feature of Snowflake with the separation of compute and storage.

Exactly the same game plan for us now, evaluating snowflake, but wanted to check if there's a fundamentally different paradigm that can be much more faster and scalablr

I'll be That Guy and ask: what are you doing with the data, and can you change your processing or analysis to reduce the amount of data you need to touch?

In my experience, it's nearly always the case that pulling in all data is not necessary, and that thinking through your goals, data, and processing can often reduce both the amount of data touched and the processing run on it massively. Look up the article on why GNU grep is so fast for a bunch of general tricks that can be employed, many of which may apply to data processing generally.


1. Random sampling. The Law of Large Numbers applies and affords copious advantages. There are few problems a sample of 100 - 1,000 cannot offer immense insights on, and even if you need to rely on larger samples for more detailed results, these can guide further analysis at greatly reduced computational cost.

2. Stratified sampling. When you need to include exemplars of various groups, some not highly prevalent within the data.

3. Subset your data. Divide by regions, groups, accounts, corporate divisions, demographic classifications, time blocks (day, week, month, quarter, year, ...), etc. Process chunks at a time.

4. Precompute summary / period data. Computing max, min, mean, standard deviation, and a set of percentiles for data attributes (individuals, groups, age quintiles or deciles, geocoded regions, time series), and then operating on the summarised data, can be tremendously useful. Consider data as an RRD rather than a comprehensive set (may apply to time series or other entities).

Creating a set of temporary or analytic datasets / tables can be tremendously useful. As much fun as it is to write a single soup-to-nuts SQL query.

5. Linear scans typically beat random scans. If you can seek sequentially through data rather than mix-and-match, so much the better. With SSD this advantage falls markedly, but isn't completely erased. For fusion type drives (hybrid SSD/HDD) there can still be marked advantages.

6. Indexes and sorts. The rule of thumb I'd grown up with in OLAP was that indexes work when you're accessing up to 10% of a dataset, otherwise a sort might be preferred. Remember that sorts are exceedingly expensive.

If at all possible, subset or narrow (see below) data BEFORE sorting.

6. Hash lookups. If one table fits into RAM, then construct a hash table using that (all the better if your tools support this natively -- hand-rolling hashing algorithms is possible, but tedious), and use that to process larger table(s).

7. "Narrow" the data. Select only the fields you need. Most especially, write only the fields you need. In SQL this is as simple as a "SELECT <fieldlist> FROM <table>" rather than "SELECT * FROM <table>". There are times you can also reduce total data throughput by recoding long records (say, geocoded names, there are a few thousands of place names in the US, using Census TIGER data, vs. placenames which may run to 22 characters ("Truth or Consequences", in NM), or even longer for international placenames. You'll need a tool to remap those later. For statistical analysis, converting to analysis variables may be necessary regardless.

The number of times I've seen people dragging all fields through extensive data is ... many.

Some of this can be performed in SQL, some wants a more data-related language (SAS DATA Step and awk are both largely equivalent here).

Otherwise: understanding your platforms storage, memory, and virtual memory subsystems can be useful. Even as simple a practice as running "cat mydatafile > /dev/null" can often speed up subsequent processing.

in memory sqlite is pretty fast

Yes, but SQLite is probably the least expressive SQL dialect there is. If you're choosing SQL because of its expressiveness, you probably aren't thinking of a dialect with only 5 types (including NULL).

I tried this once, creating an index on a 20 billion row table isn't fast :/

I haven't tried this and don't know if it would work -- but depending on the shape of your data and queries, you might not need certain indices. That is, for some workloads (especially if you're thinking of spot instances), it might be overall faster to skip the indexing and allow the query to do a full table scan. It sounds like maybe you never tried the query without the index, so I'm curious to know if there's any weight behind this theory.

If you run on MySQL/InnoDB you can set innodb_buffer_pool_size=1000Gb and it should cache your data after the first query.


I am not a big-data guy but wouldn't it be along the lines of A) get a big honking server B) fire up "X" SQL server C) Allocate 95-98% of the RAM to DB cache?

A single terabyte is a few magnitudes from what you need big-data-anything for. You could probably work with that just fine on your average 64GB ram desktop with an SSD.

Another poster already replied with a decent refutation of this claim, but a single pass over a TB of data is often not enough for 'big data' use cases and at tens of minutes per pass, it may very well be infeasible to operate on such at dataset with only 64GB of memory.

In the machine learning world, some of the algorithms that are industrial workhorses will require you to have your dataset in memory (ie: all the common GBM libraries), and will walk over it lots of times.

You may be able to perform some gymnastics and allow the OS to swap your terabyte+ dataset around inside your 64GB of RAM, but the algorithms are now going to take forever to complete as you thrash your swap constantly while the training algorithm is running.

tl;dr - a terabyte dataset in the machine learning context may very well need that much RAM plus some overhead in terms of memory available to be able to train a model on the dataset.

A small computer with 1 SSD will take at least 10-20 minutes to make a pass over 1TB of data, if everything is perfectly pipelined.

Samsung claims their 970 Pro NVMe can read 3.5GB/s sequentially. That's about 300 seconds or 5 minutes per TB.

It can't though.

It can, and their fastest enterprise SSD can write at that speed too, or do sequential reads at 7-8GB/s, or random reads at over 4 GB/s.

I just ran `time cp /dev/nvme0n1 /dev/null` on the 1TB 970 Pro. The result:

  real    4m50.724s
  user    0m2.001s
  sys     3m10.282s
So with literally zero optimization effort, we've hit the spec (and saturated a PCIe 3.0 x4 link).

Impressive performance for a $345 consumer grade SSD.


That's impressive and all, but any fragmentation or non-linear access and performance will fall off a cliff

You'd probably be surprised. For reads, there are tons of drives that will saturate PCIe 3.0 x4 with 4kB random reads. Throughput is a bit lower because of more overhead from smaller commands, but still several GB/s. Fragmentation won't appreciably slow you down any further, as long as you keep feeding the drive a reasonably large queue of requests (so you do need your software to be working with a decent degree of parallelism).

What will cause you serious and unavoidable trouble is if you cannot structure things to have any spatial locality. If you only want one 64-bit value out of the 4kB block you've fetched, and you'll come back later another 511 times to fetch the other 64b values in that block, then your performance deficit relative to DRAM will be greatly amplified (because your DRAM fetches would be 64B cachelines fetch 8x each instead of 4kB blocks fetched 512x each).

Option A is the best imo, I worked on many sql db's that the rule was to fit it into ram. Option c will bite you in the ass eventually. The kernel and your other processes need some space to malloc, and you dont want to page in/out.

Having 4/16TiB servers or "memory db servers" as I thought of them solved a lot of problems outright. Still need huge i/o but less of it depending on your workload.

I'm pretty sure that was supposed to be a list of steps, not a list of options.

> The kernel and your other processes need some space to malloc, and you dont want to page in/out.

Some space, like "most of 20-50 gigabytes"?

You want to take into account how exactly the space used by joins will fit into memory, but 2-5% of a terabyte is an extremely generous allocation for everything else on the box.

I remember back in the early 2010's that a large selling point of SAS (besides the ridiculous point that R/python were freeware and therefore cannot be trusted on important projects ) was that it can chew through large data sets that perhaps couldn't be moved into RAM (but maybe it takes a week or whatever....).

This was a fairly salient point, and remember circa 2012/2013 struggling to fit large bioinfomatics data into an older iMac with base R.

SAS Institute have long claimed this. It's been provably bullshit for decades.

In practice, an awk script frequently ran circles around processing. On a direct basis, awk corresponds quite closely to the SAS DATA Step (and was intended to be paired with tools such as S, the precursor to R, for similar types of processing).

The fact that awk had associative arrays (which SAS long lacked, it's since ... come up with something along those lines) and could perform extremely rapid sort-merge or pattern matches (equivalent to SAS data formats, which internally utilise a b-tree structure) helped.

With awk, sort, unique, and a few hand-rolled statistics awk libraries / scripts, you can replace much of the functionality of SAS. And that's without even touching R or gnuplot, each of which offer further vast capabilities.

And at an aggreable annual license fee.

I have not tested it myself but now there is a disk.frame ( https://github.com/xiaodaigh/disk.frame ). As far as I know, other options exist too.

$2,000 each for 128GB LRDIMMs, 48 of those will be $100,000 and then you'll need another $20,000 to buy the rest of the server it goes in.

If the result is faster than a $250K Hadoop cluster then you're still ahead.

the way things today, the Hadoop cluster will be just a bit faster on that thing.

25 years ago i was suggesting clients to upgrade from 4MB to 6-8MB as it was improving their experience with our business software, these days i've already suggested a couple of customers to upgrade from 6TB and 8TB respectively ... as it would improve their experience with our business software. What's funny is that customer experience with business software back then was better than today.

Everything is way too much bloat now

Yeah, except Hadoop provides redundancy, easier backups, turnkey solutions for governance and compliance and easier scaling

I’m no fan of distributed systems that sit idle at 2% utilisation when a single node would do : BUT, reducing it down to “cost” and “does it fit in RAM” is way too reductive

And reducing it to a single machine means I have an order of magnitude less time spent on setting it up, maintaining it, adapting all my code, debugging, etc.

These days I’m firmly of the opinion that if you can make it run on a single machine, you absolutely should, and you don’t get more machines until you can prove you can fully utilise one machine.

Account for parent's concerns for redundancy, backups, scalability.

If you store data in something like AWS S3, "redundancy" and "backup" is handled by AWS. And scalability is a moot point if a single box can handle your load.

When a single box cannot handle your load, you start one more, and then you start worrying about scalability.

(Of course, YMMV, I guess there are cases where "one box" -> "two boxes" requires an enormous quantum jump, like transactional DB. For everything else, there's a box.)

What about services where background load causes unacceptable latencies due to other bottlenecks on that single machine? What if you are IO bound and you couldn't possibly use all of the CPU or memory on any system which was given to you?

Waiting for a machine to be “fully utilised” before scaling just shows your lack of experience at systems engineering.

Do you know how quickly disks fail if you force them at 100% utilisation, 24/7?

Then what happens when this system dies? How much downtime do you have because you have to replace then hardware then get your hundreds of gigabyte dataset back in RAM and hot again?

I’ve worked as a lead developer at companies where I’ve been personally responsible for hundreds of thousands of machines, and running a node to 100% and THEN thinking about scaling is short sighted and stupid

I don't mean "let your production systems spool up to point where you're maxing out a single machine" - that would be exceedingly silly.

I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines.

What this means is not getting a 9-node spark cluster to push a few hundred gb of data from S3 to a database because "it took too long to run in python" because it's a single threaded, non-async, non-performance tuned.

And what about redundancy in case of node failure?

> I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines

How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully” because:

It puts more strain on the hardware and will cause it to fail a LOT faster

There’s no redundancy so when the system fails you’ll probably need hours or maybe days to replace physical hardware, restore from backup, verify restore integrity, and resume operations - which after all this work, will only put you in the same position again, waiting for the next failure

Single node systems make it difficult to canary deploy because a runaway bug can blow a node out - and you only have one.

Workload patterns are rarely a linear steam of homogenous tiny events - a large memory allocation from a big query, or an unanticipated table scan, or any 5th percentile type difficult task can cause so much system contention on a single node that your operations effectively stop

What about edge cases in kernels and network drivers - many times we have had frozen kernel modules, zombie processes, deadlocks and do on, again, with only one node something as trivial as a reboot means halting operations.

There’s just so many reasons a single node is a bad idea, I’m having trouble listing them

> How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully”

You're missing the word "can". It's a very important part of that sentence.

If your software can't even use 80% of one node, it has scaling problems that you need to address ASAP, and probably before throwing more nodes at it.

> It puts more strain on the hardware and will cause it to fail a LOT faster

Unless you're hammering an SSD, how does that happen? CPU and RAM should be at a pretty stable amount of watts anywhere from 'moderate' load and up, which doesn't strain anything or overheat.

> redundancy


Fail over?

I’m not against redundancy/HA in production systems, I’m opposing clusters of machines to perform data workloads that could more efficiently handled by single machines. Also note here that I’m talking about data science and machine learning workloads, where node failure simply means the job isn’t marked as done, a replacement machine gets provisioned and we resume/restart.

I’m not suggesting running your web servers and main databases on a single machine.

But that Hadoop cluster will also work just fine if the data set does not fit in RAM. And I've never met a data set that didn't expand over time. And with Spark you get a robust, scalable and industry standard way of distributing work amongst the nodes.

Also did you know that Hadoop is open source. So that $250K is purely for hardware.

Finally a machine that can run Chrome.

Can still “only” have 6000 tabs.

(Someone did it recently with 1.5TiB of ram on a Mac Pro. Then Linus from Linus tech tips did it on Windows with 2TiB)


If that allows you to simplify your architecture so that stuff just runs on a single machine instead of needing to develop, debug and maintain a distributed solution, then you save much more money in engineer salaries than this.

if you can get by with 4TB, it's less than $30/hr on AWS

What is? Anything I touch on AWS is 5k a month. /s

$30/hr is between $20,000 and $22,000 per month depending on the length of the month.

You don't have to run it full time tho'...

"Can I afford to fit my data in RAM?" is a whole other site, I presume...

"Can you afford to not put your data in RAM" is another one.

"Did you even need all of this data?" is the one after that one

I'd argue it's the first.

The point is that the alternative is far more expensive (taking from the tweet linked on the page that inspired it)

Short answer: It fits in RAM if it's <= 12288 GB

The original Twitter thread is funny. This site doesn't add anything. It looks like SEO more than anything, and now that HN has linked to it, it's been successful.

The original site yourdatafitsinram.com is now a get rich quick scheme site.

Also, the site that remained wasn't ever updated. So I took some time and updated some links.

I liked updating/creating this because I love the simplicity of the concept / thought behind it.

Could you share the link to the original Twitter thread? I guess the link was changed on HN before some of us got here.

On the linked page, it says:

"Inspired by this tweet" which links to https://twitter.com/garybernhardt/status/600783770925420546

Note that this is 2015, on the heyday of big data.

> Short answer: It fits in RAM if it's <= 12288 GB

For a few given system configurations, and a given definition of "fits".

Fine print: But it might cost you $300K CapEx or $800K/yr OpEx. Hope you have a budget!

You know what else costs? Humongous amount of servers to run silly stuff to orchestrate other silly stuff to autoscale yet else silly stuff to do stuff on your stuff that could fit into memory and be processed on a single server (+ backup, of course).

Add to that small army of people, because, you know, you need specialists of variety of professions just to debug all integration issues between all those components that WOULD NOT BE NEEDED if you just decided to put your stuff in memory.

Frankly, the proportion of projects that really need to work on data that could not fit in memory of a single machine is very low. I work for one of the largest banks in the world processing most of its trades from all over the world and guess what, all of it fits in RAM.

There's a world outside of web apps and SV tech companies. There's a lot of big datasets out there, most of which never hit the cloud at all.

Story time: I worked on one project where a single (large) building's internal sensor data (HVAC, motion, etc. 100k sensors) would fill a 40TB array every year. They had a 20 year retention policy. So Dell would just add a new server + array every year.

I worked with another company that had 2000 oracle servers in some sort of franken-cluster config. Reports took 1 week to run and they had pricing data for their industry (they were a transaction middleman) for almost 40 years. I can't even guess the data size because nobody could figure it out.

This is not a FAANG problem. This is an everage SME to large enterprise problem. Yeah, startups don't have much data. Most companies out there aren't startups.

By the way, memory isn't the only solution. In the past 15 years, I've rarely worked on projects where everything was in memory. Disks work just fine with good database technology.

> Story time: I worked on one project where a single (large) building's internal sensor data (HVAC, motion, etc. 100k sensors) would fill a 40TB array every year. They had a 20 year retention policy. So Dell would just add a new server + array every year.

That's a lot of data, but what do you even do with it other than take minuscule slices or calculate statistics?

And for those uses, I'd put whether it fits in RAM as not applicable. It doesn't, but can you even tell the difference?

They paid us $600K every six months to analyze the data and suggest adjustments to their control systems (it's called continuous commissioning, but it's not really continuous due to laws in many places about requiring a person in the loop on controls). They saved millions of dollars every year doing this, because large, complex buildings drift out of optimized airflow and electricity use very quickly.

Agreed that 20 year retention is silly. We thought it was silly, but the policies reflected the need for historical analysis for audit purposes.

It does in fact matter what you can fit in RAM though. We had to adapt all our systems to a janky SQL Server setup that was horrible for time series data and make our software run on those servers. RAM availability for working sets was a huge bottleneck (hence the cost of analysis).

This is another problem. Is really 20 year retention policy necessary for ALL sensor data? Can it be somehow aggregated and only then the aggregated data to be subject to retention policy? Can the retention policy be made to make it possible to lose some fidelity gradually (the way RRDtool is used by Nagios, for example)?

I really don't understand comments like this.

Yes your company's data may fit in RAM. But does every intermediate data set also fit in RAM ? Because I've also worked at a bank and we had thousands of complex ETLs often needing tens to hundreds of intermediate sets along the way. There is no AWS server that can keep all of that inflight at one time.

And what about your Data Analysts/Scientists. Can all of their random data sets reside in RAM on the same server too ?

Buy them a machine each.

$100K has always been "cheap" for a "business computer" and today you can get more computer for that money than ever.

$100K of hardware (per year or so) is small-fry compared to almost every other R&D industry out there. Just compare with the cost of debuggers, oscilloscopes and EMC labs for electronic engineers.

My company has over 400 Data Scientists and 1000s of Data Analysts.

Buy them a machine each at a cost of $40-60 billion ?

Or would it make more sense to buy one Spark cluster and then share the resources at a fraction of the cost.

I don’t get your numbers, getting one for 400 people is 40M, for thousands it may be 100-999M.

Still expensive, but much less than 40 billion.

He said $100k for each user but it's a dumb idea anyway.

We have a Spark cluster which supports all of those users for $10-$20k a month.

Never said EVERY data set fits in RAM, but that doesn't mean MOST of them don't.

There is a trend, when the application is inefficient, to spend huge amount of resources on scaling it instead of making the application more efficient.

There are a number of cloud database solutions that are very easy to manage and not all that expensive. For example I work for Snowflake and our product doesn't need a small army of people to babysit it.

I mean, I also prefer doing things on a single machine, but if that machine gets expensive enough, or writing a program that can actually use all that power gets too difficult, why not switch to a cloud database?

This is about saving you money, so that's not the right fine print.

Look at it this way: This is for the person that's already going to get enough ram sticks to fit the entire data set or multiple of it, across many machines, and deal with the enormous overhead from doing queries across many machines. The revelation is that you can fit that much ram inside a single machine for a much cheaper and faster experience.

it will cost you 100K CapEx and will save you many 100s of Ks in opEx and untold amount of money in development cost.

Meh, that’s like half the budget of the databases on our dev environment.

We have way too many fucking copies of our DB.

I prefer using mocked data in dev. Smaller dataset, no possibility of PII leaks.

I would love to know how you came up with these numbers.

The Amazon listing has an Azure instance type and links to Azure docs.

A quick search shows that you can get at least 24TiB from AWS: https://aws.amazon.com/ec2/instance-types/high-memory

Thanks for pointing out my sloppy mistake.

I have chosen for the cloud options to only select virtual instances that can be spun up on demand. The high-memory instances you link to are purpose-build.

On the other hand, it is true, they do exist.

You can fit 48 TB on a HPE MC990 X though I'm pretty sure that's got one of those NUMA architectures that SGI had with the UV 3000 or whatever.

I remember jokingly telling my team to spend the millions of dollars we did expanding our clusters with one of these and just processing in RAM. I honestly don't think I did the analysis to make sure it would be actually better.

It was 'jokingly' because we couldn't afford three of these machines anyway. The clusters had the property that we could lose some large fraction of nodes and still operate, we could expand slightly sub-linearly, etc. which are all lovely properties.

It would have been neat, though. Ahhhh imagine the luxury of just loading things into memory and crunching the whole thing in minutes instead of hours. Gives me shivers.

This is why IBM have historically done so well.

You didn't have to have all the capital. They'd rent you the machine on a long lease and you could say you'd bought a million dollar computer.

Sun tried to get into similar markets but it was always a tougher deal on minis and micros.

IIRC, the largest IBM z15 can have up to 40 TiB of RAM.

The website needs updating.

And Superdome Flex supports 48 TB.

These seem to be the least boring x86 machines. At 16 sockets the NUMA topology must be interesting.

How much is that in TB?

For RAM TB = TiB.

ah and it is so fucking lovely to be hired by a company which has been running everything on a single machine for 2 years.

I just upgraded my laptop to 3 GB. Feeling a bit behind on the times.

While pithy, the implication that you are going to process 12 TB of data in RAM using mostly single-threaded tools doesn't reflect reality.

Where exactly does that implication come from? Are you from a world where you need a map reduce framework + cluster to have parallelism of any kind?

No, but I frequently see people implying that you can do your data science in Python and R as long as you can fit the data in RAM. As you mention, it's not RAM that's the limiting factor for larger data volumes, it's finding tools that exploit parallelism.

I have done plenty of parallel work in python and R, so I'm still not sure what you mean.

Yes, that's my point. It's too simplistic to say "well, the data fits in RAM", you have to add parallelism to make the workload tolerable. In the past, some people have done that using MapReduce or Spark, GNU parallel or just writing parallel code in their favorite language. But RAM by itself isn't the only limiting factor to whether a problem is solvable in a reasonable amount of time.

If you are using python and R, can’t you make your own parallelism?

Telling me that my (example) 24TB data set fits into RAM, because it fits into 24TB on an AWS instance designed for SAP that's so expensive that it's price on inquiry, isn't overly helpful.

May I suggest that a) they consider applications need RAM too and b) that if the price is POI, it's probably better to mention "But you know, 188 m5.8xlarges might be cheaper"

The question of whether my storage mechanism fits budget is more relevant, no? It costs less than 4cents per hour to store 1TB on S3 but an X1 with 1TB of RAM costs $13 or so per hour on Aws. So the issue is whether what you pay for is worth the result you're computing.

I've no idea if the intended audience here would ever run their workloads on Solaris, but like the IBM POWER systems, Oracle & Fujitsu SPARC servers also max out at 64TB of RAM. I didn't see those included here.

Pretty cool that single boxes have > 10 terabytes of RAM.

I would definitely imagine that most workloads rarely need more than a few to a few hundred TB in memory, since you may have petabytes of data but you probably touch very little of it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact