
Your data fits in RAM - lukegb
http://yourdatafitsinram.com/
======
Smerity
It's probably worth extending "Your data fits in RAM" to "Your data doesn't
fit in RAM, but it does fit on an SSD". So many problems will still work with
quite reasonable performance when using an SSD instead. By using a single
machine with an array of SSDs, you also avoid the complexity and overhead of
distributed systems.

My favourite realization of this: Frank McSherry shows how simplicity and a
few optimisations can win out on graph analysis in his COST work. In his first
post[1], he shows how to beat a cluster of machines with a laptop. In his
second post[2], he applies even more optimizations, both space and speed, to
process the largest publicly available graph dataset - terabyte sized with
over a hundred billion edges - all on his laptop's SSD.

[1]:
[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

[2]:
[http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html)

~~~
cornellphds
This is classic case of "Algorithm/Problem Selection" if your
algorithm/problem is tailored to a task such as PageRank, surely a single
threaded highly optimized code will beat a cluster designed for ETL tasks. In
real organizations where there are multiple workflows/algorithms, distributed
systems always win out. Systems like Hadoop take care of Administration,
Redundancy, Monitoring and Scheduling in a manner that a single machine
cannot. Sure you can "Grep" faster on a laptop than AWS EMR with 4 Medium
instances, but in reality where you have 12 types of jobs which are run by
team of 6 people, you are much better off with a distributed system.

~~~
acqq
Ditto for computationally intensive work: if it is CPU dominated, more CPU's
calculating in parallel will be of advantage, even if the data could fit some
RAM.

There's no a single simple answer, but sure, whenever less computers are
enough, less should be used.

The recent problem is, some people love "clouds" so much today that they push
there the work that could really be done locally.

~~~
vidarh
Part of the problem is that a lot of problems that are CPU dominated on a
single system becomes IO dominated once you start distributing it at very low
node counts without _very_ careful attention to detail.

~~~
exelius
The entire reason for the popularity of distributed systems is because
application developers in general are very bad at managing I/O load. Most
developers only think of CPU/memory constraints, but usually not about disk
I/O. There's nothing wrong with that; because if your services are stateless
then the only I/O you should have is logging.

In a stateless microservice architecture, disk I/O is only an issue on your
database servers. Which is why database servers are often still run on bare
metal as it gives you better control over your disk I/O - which you can
usually saturate anyway on a database server.

In most advanced organizations, those database servers are often managed by a
specialized team. Application servers are CPU/memory bound and can be located
pretty much anywhere and managed with a DevOps model. DBAs have to worry about
many more things, and there is a deeper reliance on hardware as well. And it
doesn't matter which database you use; NoSQL is equally as finicky as a few of
my developers recently learned when they tried to deploy a large Couchbase
cluster on SAN-backed virtual machines.

~~~
speeder
I own a ASUS gaming laptop...

It has two flaws: One, it has nVidia Optimus (that just suck, whoever
implemented it should be shot).

Two, the I/O is not that good, even with a 7200 RPM disk, and Windows 8.1 make
it much worse (windows for some reason keep running his anti-virus,
superfetch, and other disk intensive stuff ALL THE TIME).

This is noticeable when playing emulated games: games that use emulator, even
cartridge ones, need to read from the disc/cartridge on the real hardware, on
the computer they need to read from disc quite frequently, the frequent
slowndowns, specially during cutscenes or map changes is very noticeable.

The funny thing is: I had old computers, with poorer CPU (this one has a i7)
that could run those games much better. It seems even hardware manufacturers
forget I/O

~~~
acqq
I can bet in your case it's not the disk I/O actually, unless you have some
very strange emulator the file access should still go through the OS disk
cache. VMware for example surely benefits from it. How many GB do you have on
the machine? How much is still left free when you run the emulator and the
other software you need?

~~~
speeder
I noticed it as disk i/o because I would leave task manager running on second
screen and every time the game lagged memory and CPU use were below 30% and
disk was 100%, and if I left sorting by disk usage, the first places are the
windows stuff, and after them, the emulator.

------
SwellJoe
So, let's say my system is currently backed by MySQL or PostgreSQL, and that
is not fungible. How would one move that data into RAM, including writes? And,
how would one maintain some level of safety in the event of a crash? i.e. I
don't really care if I lose X amount of time worth of data (say, five
minutes), but I do care that when I reboot the system, the database comes back
from disk into RAM in a consistent state.

Is there some off-the-shelf solution to this problem? And, if so, why isn't it
talked about more? Every CMS ever, for example, would be very well-served by
something like this. My entire website's database, all ~100k comments and
pages and issues and all 60k users, is only 1.4GB, and performance is always a
problem. I don't care if I lose a couple minutes worth of comments in the
event of a system reboot or crash. So, why can't I just turn that feature (in-
memory with eventual on-disk consistency, or whatever you'd want to call it)
on and forget about it?

~~~
mhaymo
Does your RDBMS's built-in caching not handle this pretty well? Just up the
cache size, e.g. in PostgreSQL changing effective_cache_size
[https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...](https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server)

~~~
SwellJoe
For reads, yes. It's fine on reading data. The problems are writes which
always wait for disk consistency to return.

------
praseodym
And if your data doesn't fit in a single server's RAM, just buy some more and
run Apache Spark [1] on them. It's an in-memory computation engine that's
really nice to program for: you don't have to worry about low-level clustering
details (like MapReduce). And it's way (10-100x) faster than Hadoop.

[1] [https://spark.apache.org](https://spark.apache.org)

~~~
threeseed
Spark is fast becoming the default tool for big data.

The recent addition of SparkR in 1.4 means that now data scientists can
leverage in memory data in the cluster that has been put there by output from
either Scala or DW developers.

Combine it with Tachyon ([http://tachyon-project.org](http://tachyon-
project.org)) and it's not hard to imagine petabytes of data all processed in
memory.

~~~
studentrob
Can you explain what Tachyon does that's different from what Spark already
provides?

I haven't used either Spark or Tachyon. I thought the Spark solution was to
just put my dataset in memory. But the Tachyon page seems to say the same
thing

~~~
nl
There's a slide deck[1] that explains it rather well.

Basically, Tachyon acts as a distributed, reliable, in memory _file_ system.

To generalise enormously, programs have problems sharing data in RAM. Tachyon
lets you share data between (say) your Spark jobs and your Hadoop Map/Reduce
jobs at RAM speed, even across machines (it understands data-locality, so will
attempt to keep data close to where it is being used).

[1]
[http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16...](http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16-Strata.pdf)

~~~
studentrob
Neat, thanks for the link, the code examples towards the end make it clear
that this is pretty simple to use.

~~~
nl
Yeah, most things coming from the Spark team are excellent in that respect.

I've never used Tachyon, but based on the wonderful "getting started"
experience Spark gives I'd be confident it would be similarly well thought
out.

------
lukegb
Inspired by
[https://twitter.com/garybernhardt/status/600783770925420546](https://twitter.com/garybernhardt/status/600783770925420546)

~~~
mosselman
Can someone explain in a bit more detail what this is about? Is the 'joke'
that running data computation in RAM is faster than what? From disk?

~~~
JonnieCache
The subtext is that running a fancy distributed system is more exciting and
beneficial for ones resume than simply buying a massive bloody server and
putting postgres on it, and that people are making tech decisions on this
basis.

~~~
Anderkent
This of course ignores that it's much easier to get your hands on a cluster of
average machines than one massive bloody server, and all the non-performance-
oriented benefits of running a cluster (availability etc.).

Much easier to request a client provisions 20 of their standard machines, or
get them from AWS. People don't like custom hardware, and for good reason.

~~~
falcolas
Amazon offers some bloody huge servers... 32 core, 256GB RAM, and 48TB HDD
space. d2.8x large

~~~
brianwawok
That is 4k a MONTH for 256gb of ram.

If you could do the same job on a fleet of 8-16GB servers.. you can get a lot
more CPU for a lot less dollars. Depends if you really need everything on 1
machine or not (as of course nothing will beat same machine in memory
locality)

~~~
jules
Not true, 8x16GB costs as much as 1x256 on Amazon. The issue here is that
Amazon is hilariously expensive in general. Hetzner will rent you a 256GB
server for €460 per month. Or you can buy one from Dell for $5000. These are
not high numbers, in 1990 you paid more than that for a "cheap" home computer.
For the price of a floppy drive back then you can now get a 32GB server.

------
rm999
Yes! As someone who frequently runs memory-intensive algorithms on large(ish)
datasets, I have a hard time explaining to many technical people that moving
from a single server to a cluster increases complexity and cost by an
incredible amount. It affects key decisions like algorithm and language, and
generally requires a lot of tweaking.

When a problem becomes big enough, moving to a cluster is absolutely the right
decision. Meanwhile, RAM is cheap and follows Moore's Law.

~~~
FooBarWidget
Complexity, sure. But cost? I thought a single 1 TB RAM server is more
expensive than 10x 100 GB RAM servers.

And many people don't want to deal with physical hardware. Dealing with
physical hardware increases operational complexity too. They want to rent a
virtual/cloud server. Which provider allows you to rent a virtual server with
1 TB RAM?

~~~
e12e
I took a look around for "high-ram" servers, and it seems one I can buy today,
is HP ProLiant DL580 Gen9. With just 256 GB of ram, it clocks in at 540.995,-
NOK (71.5k USD). It has 96 ram slots, and I can't seem to find anything bigger
than 32 GB DDR4 RAM, and rounding the price _up_ 96x32GB comes to roughly
672.000,- NOK (~90k USD). Adding that up (throwing away the puny ram
installed), gets us to a little over double the original price, or 1.212.995,-
(~161k USD). This has 4 18 core E7s (72 cores) clocked at 2.5Ghz -- and 3TB of
ram (half of max, because of 32 GB dimms).

It _is_ true that while the jump from 256GB to 3TB is "just" ~2x -- I could
get a server for 1/10 of the price of the original configuration -- but only
with 4GB of RAM, and nowhere near even 18 hardware threads.

If you are CPU limited (even at 72 hw threads) you might need more, smaller
servers.

But such a monster should scale "pretty far" I'd say. Does cost about half as
much as a small apartment, or one developer/year.

~~~
joelwilliamson
Dell sells servers with 96x64GB RAM. There is a huge (7x) premium for the 64GB
DIMMs instead of 32GB, so it runs around 500k, with almost the entire price
going to RAM.

------
chao-
I love it. I was just doing some Fermi estimates for a friend on the data for
a project he has in the pipeline. I was curious whether or not it would be
cost efficient for his project's budget to go with NVMe SSDs or have to stick
with traditional SATA ones, and turns out it doesn't even matter (for now)
because at least the first three months of data will fit in 256GB of RAM, even
allowing for a 2.5x factor stemming from some (estimated) inefficient storage
or data structure use in a scripting language like Ruby or Python.

Edit: And after those first three months he'll know more about the use and
performance demands of the project and will be able to make far more accurate
decisions about storage categories.

~~~
paulrosenzweig
Where's 2.5x from? I'd be curious to see any actual data on comparing memory
footprint for a problem in C/Go/Rust to Python/Ruby. I'm sure it varies
widely, but 2.5x might not be far off.

------
Dzidas
Today I'm working on dataset of 1GB, which fits in memory. But it is not
enough. If a variable is category/factor you need to introduce dummy values
and your dataset starts picking the weight. Next - do you want apply ML
algorithm in parallel? Upst, you need more memory. Done that? Now please use
test dataset for prediction. My point that "data in memory" is just the
beginning...

------
SubuSS
The problem with giant boxes (full of RAM / SSD / Disk) is giant failures and
huge recovery times. This is worsened in case of RAM because now every power
blip is a full on recovery situation. Have a big enough data set focussed on a
single box (or two for backup purposes), your customers are going to blow a
gasket the moment one of them go down because workloads usually grow to
accommodate available capacity.

FB has a nice paper that talks about this problem.
[https://research.facebook.com/publications/300734513398948/x...](https://research.facebook.com/publications/300734513398948/xoring-
elephants-novel-erasure-codes-for-big-data/)

~~~
CHY872
Well, you wouldn't run such a server without a hefty uninterruptable power
supply system. On your bigger server you can expect a smaller frequency of
failures due to fewer points of failure, and can make your system more
resilient (rendundant RAM, filesystems, power etc).

------
jakozaur
More accurate title would be fit in RAM of single machine.

Maybe some bonus category:

0\. Spreadsheet is all you need.

1\. Python script is good enough.

2\. Java/Scala is way to go.

3\. Need to manage memory (gc doesn't cut), some custom organization.

4\. Actually needs a cluster.

~~~
baldfat
> 0\. Spreadsheet is all you need.

I HATE when people use Spreadsheets to do anything besides simple math.

[http://lemire.me/blog/archives/2014/05/23/you-shouldnt-
use-a...](http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a-
spreadsheet-for-important-work-i-mean-it/)

TL:DR your work is not reproducible and we can't see what you did to get to
your numbers. A million examples of why this is bad.

Also

> 1\. Python script is good enough

You mean Python with pandas and numpy?

I use R which is also a great choice

> 2\. Java/Scala is way to go.

For you but the vast majority of Data Scientist don't use either and their
choice for people is not universal. Julia looks like a great new comer. I
again mainly use R.

> 3 & 4 are good points.

~~~
dredmorbius
Sadly, I recall those arguments against spreadsheet computation being made in
the early 1990s. People simply won't learn.

Ray Panko, University of Hawaii.

[http://panko.shidler.hawaii.edu/SSR/](http://panko.shidler.hawaii.edu/SSR/)

[http://panko.shidler.hawaii.edu/SSR/](http://panko.shidler.hawaii.edu/SSR/)

Goes back to 1993:

[http://panko.shidler.hawaii.edu/SSR/auditexp.htm](http://panko.shidler.hawaii.edu/SSR/auditexp.htm)

~~~
TorKlingberg
There is even a European Spreadsheet Risks Interest Group:
[http://www.eusprig.org/](http://www.eusprig.org/)

------
a-saleh
I am affraid the in our research lab we didn't have 10 000$ up front/200$ a
month to get a pc with 1TB ram ... we did have a large computer hall and BOINC
though :)

------
sytelus
Looks like 1.5TB RAM with 15 cores costs $50K. But it shouldn't be just about
RAM. The problems I'm working on requires 250 cores on similar amount of data.
If there was an option to get say 150 cores with 2TB RAM, things would fly for
sure.

~~~
jacquesm
Another 4 to 6 years and that should be a reality.

~~~
vegabook
4-6 months and you'll have a Knight's Landing Xeon Phi with at least 72 cores
and 288 hardware threads, with vector instructions, and you'll be able to
stick 3 of them in single blade.

------
falcolas
Seems a bit naive, saying 2.1PB probably doesn't fit in ram, "but it could"...

I get who this is aimed at, and why, but just saying that it fits in RAM isn't
as useful as it could be. This is an opportunity to teach, not just snark.

~~~
collyw
A bit of clever reformatting of your data and 2.1 Pb could probably easily be
reduced in size to something that would fit in RAM. Are you actually needing
every byte?

------
jkot
Outside of scale-out, scale-up there is also solution: scale-in. Optimize your
memory usage, so your data occupies less space.

I work on something like that.

------
pedrocr
So this seems to use 6.144TiB as the limit that will fit in RAM. That's
1.536TiB x 4 when using the latest Xeon I could find[1]. According to the
specs though you should be able to use 8, so the total limit should actually
be 1.536 x 8 = 12.288 TiB. 12TiB of RAM, that's quite amazing.

[1] [http://ark.intel.com/products/84688/Intel-Xeon-
Processor-E7-...](http://ark.intel.com/products/84688/Intel-Xeon-
Processor-E7-8893-v3-45M-Cache-3_20-GHz)

~~~
genericuser
It seemed to use 6.000000000000000444089209850...??? TiB when I tried values.

~~~
pedrocr
It seems to use different values of cutoff depending on if you are using
MiB/GiB/TiB/etc. I tested with GiB and 6144 is OK, 6145 is not.

------
jerven
I think its wrong. It says 64 TB does not fit in RAM, but you can get 64TB
machines from SGI as well 32 TB ones from Oracle.

The SGI one with up to 2048 cores are larger in their single system images
than most people have in their clusters.

The benefit of these systems is not really the ease of programming but the
speed of interconnect.

List price of the Oracle one was 3 million a few years ago. But most of that
is actually in the high density dimms. These days I think the price must be
lower, but I won't waste my Oracle sales contact time in figuring out what it
is today. Of course it will still be expensive, it is an Oracle product after
all.

However, an equivalent dell list price cluster of simple 1U boxes (512 6C/64GB
ones!) will go for 1.5 million. The fact that to house 512 boxes i.e. 25 racks
or so plus networking. Of course you do get 1/3rd more cores than the SGI one.

For many of us that are between the just use a single normal server and yet
too small for the google solutions. These big memory solutions from Oracle and
SGI can make sense even if they are not the first thing that comes to mind!

------
iddqd
Everything fits in RAM if you have the budget for it.

~~~
jacquesm
No, the point is that usually fitting things in RAM _lowers_ the budget. So
it's well worth doing proper analysis on whether or not you can (a) fit all
your data in RAM and (b) if a cluster of machines does not become it's own
reason for existence.

Replacing a large number of nodes with a single machine with a lot of RAM is
usually a cost savings measure rather than a larger expense (and it saves
power too!), and due to a lack of communications overhead and exploitation of
the fact that you now have access to _all_ the data in one go you may very
well find that your algorithms run _much_ faster.

A distributed solution should be a means of last resort.

~~~
bshimmin
What does 6TB of RAM go for these days?

~~~
jacquesm
Less than 6 machines with 1T each ;) (Assuming the 6 will operate only on
local data and will never need to communicate in which case you may well end
up with more than 6).

But seriously: the price of RAM for servers is now ~10$ / G.

~~~
akie
That would translate to ~$60k for 6TB of RAM. Plus the cost of the server
itself ($10k?)

~~~
jacquesm
No, that's a _very_ expensive server, and Dell will charge you a hefty premium
for the memory. There are quite a few options that will cost you less than
that (of course the maximum capacity will vary).

It would be nice to see an article comparing all the high RAM machines side by
side with specs and prices.

The largest machine I have right now will hold 512G and was a run-of-the-mill
machine, it was about $5K, I'd expect the more exotic ones to be substantially
more expensive but probably not as expensive as the machines linked here.

~~~
RogerL
Can you point me to cheap 64GB LRDIMM Octal rank memory? Dell's prices for
this seem to be the market rate, but maybe I don't know where to shop.

~~~
jacquesm
I can get Octal rank in bulk for ~$1K so that's $16 or thereabouts / G, still
not bad for RAM that is obviously going to be sold in smaller quantities. Note
that HP or Dell will probably not be happy if you use 3rd party RAM in their
machines (if they didn't pull tricks to make sure only their own stuff
works!).

(For contrast, that $1K if you'd spend it on HP branded RAM would not even get
you four 16G dual rank units...).

------
rootlocus
Taken from the github repository:

var MAX_SENSIBLE = 6 * TB; function doesMyDataFitInRam(dataSize) { return
dataSize <= MAX_SENSIBLE; }

------
br0s
And if your data doesn't fit into the RAM of a single machine you can buy a
few more and use vSMP ([http://www.scalemp.com/](http://www.scalemp.com/)) to
create a shared memory single system image.

------
cornellphds
In my opinion the correct answer is 255Gb. (i.e. AWS r3X8 High Memory
instances ).

While one can purchase servers with larger memory most likely you will run
into limitation on number of cores. Also note that there is at least some
overhead in processing data, so you would need at least 2X the size of raw
data.

Finally while its a good thing to tweet, joke about and make fun of buzzword
while trying to appear smart. The reality is that purchasing such servers (>
255 Gb RAM) is costly process. Further you would ideally need two of them to
remove single point of failure. it is likely that the job is batch and while
it might take a terabyte of RAM you only need to run it once a week, in all
these cases you are much better off relying on a distributed system where each
node has very large memory, and the task can easily split. Just because you
have cluster does not mean that each node has to be a small instance (4
processors ~16 Gb RAM).

~~~
jacquesm
> Further you would ideally need two of them to remove single point of
> failure.

That's assuming that everything needs to be 'high availability' and buying
_two_ of everything is a must. This is definitely not always the case. In
plenty of situations buying a single item and simply repairing it when it
breaks is a perfectly good strategy.

~~~
cornellphds
Its not about having two of everything at all times, but rather about having a
capacity whenever you need it. At 244Gb you hit a sweet point where you can
have access to large capability at a flexible price (Spot Market / On Demand /
On Premise). This is what separates engineers with business acumen from run of
the mill "consultants" with a search engine.

~~~
jacquesm
You mentioned 'single point of failure'.

------
voidlogic
"Your data fits in RAM", vs "Your data fits in RAM on around X machines",
would be better. Any dataset fits in RAM.... but if its going to take more
machines then I am willing to buy it really doesn't.

------
karmakaze
Before core, there was tape. Tape used to be backup medium, then disk became
the new tape. Bubble memory begat SSD, so memory has in some sense become the
new disk.

RAM is the new disk: now for some, later for others.

------
yellowapple
"Yes, your data fits in RAM... if you feel like buying a server at the same
price as 3 Tesla Model S automobiles, a mansion in the Southern U.S., or a
bachelor pad in San Francisco."

------
peter303
HP hints its new memristor memory computer will have the cost of flash and the
speed of registers. An will mostly eliminate the multi-level memory
hierarchies we have today.

~~~
CHY872
Unlikely; the limiting factor is already distance - poor scaling from
interconnects (wires) already means that we can't have all that much global
state. This might increase the amount of state we can have, but unless you can
fit gigabytes into a single chip you won't be eliminating the multi level
memory hierarchy.

Like right now the L1 cache will have latencies of 1 or 2 cycles, and the L2
cache 15; this is due to the overheads of cache coherency protocols, moving
the data around the chip; it's not that the memory's slower, it's all SRAM.

They are probably referring to enterprise workloads. Here you have large
working sets (so caches are less useful) and you want maximum throughput.
Clever multithreading (finegrained) can reduce effective latency by scheduling
many (32?) processes at the same time, executing an instruction from each in
round-robin fashion (see Sun Niagara). In that case, you can sometimes dump
the L1 cache, and you would be able to get rid of the memory hierarchy.

There's also probably a benefit wrt hard drives/secondary storage; you can
obviously make system storage very fast, which might improve random access
times considerably. BUT this is probably not going to be transformative; it'll
improve certain types of accesses, but current algorithms are already very
highly tuned to spatial and temporal locality of reference. Furthermore,
you'll still see these structures win out, because they can take advantage of
hardware prefetching more easily.

~~~
eafpres
The property of memristors having real values instead of 0 or 1, and the fact
that their value can be path dependent, leads me to think that at least
information density can be increased over conventional memory today.

------
nickbauman
Cute but "Big Data" is really just data that's not in the building and isn't
feasible to just move around from one machine to another in your department.

------
nwenzel
Even if your data doesn't fit in RAM... and even if it does... when you're
developing, you should be using a sample of your data that fits into RAM.

------
swalsh
This is good marketing, but you know what would be even better marketing? Give
me access to that server for a week. Let me setup a demo of my biggest
customer, and then run my tasks. We've started (and are in progress) of
investing thousands of dollars in moving to Azure. A server this large is not
something I can buy, and experiment on easily. Hard numbers would convince my
superiors that its a better solution, but they're not going to give me $10k to
do the experiment.

~~~
vegabook
10k? Those sticks of RAM alone will cost you something like 75k USD. Then
you'll need the processors, arguably 4 of the top of the line 18-core XEONs at
5000 USD each. Then you'll need to put it all together with software and a
(properly cooled) rack, not to mention the terminal(s) to access it, plus the
personnel to put this baby together for you. This box could easily cost you
150 grand.

~~~
pquerna
Its not cost effective to use non E5-class Xeons, or go above 32GB DIMMs right
now.... So you want a Dual-Processor setup, 16 DIMM slots, so 16x 32GB = 512GB
w/ Dual Proc -- which you can do for about $10,000.

~~~
vegabook
That's a very nice piece of kits for 10k I have to say. Thanks for the
"sweetspot" price/perf advice. Seems like excellent value. I've had my heart
on a badass mac pro but these specs put it to shame.

------
Aardwolf
If I select 1KB, why does the link point to an HP server with up to 6TB of
RAM? Linking to an 80's PC seems more appropriate :)

------
tempodox
Wow, I wish I had the spare change for one of these beasts. I think I have
enough NP-hard problems to fill any RAM to the brim :)

------
nwrk
[http://www.downloadmoreram.com/](http://www.downloadmoreram.com/)

------
msellout
Although we can theoretically handle up to 2^64 bytes of RAM (16 exabytes),
the practical limit is much lower. I think someone on Wikipedia said it's
somewhere around 8TB, but I imagine the performance of random access into 8TB
RAM is much worse than a motherboard designed for up to 32GB RAM.

It's not as easy as just buying more RAM. You'll have to pay more attention to
how you make use of the various caches in between your CPU and RAM.

~~~
gambiting
I imagine that on a motherboard with 96x RAM slots, the access time between
the first one in row and the last one will be actually quite different, due to
the physical distance between them.

------
polite_wine
Sorry for the simple question but if you store it in ram what is the strategy
for when the server is turned off?

~~~
lukegb
The idea is more that when you process data, if you can fit it all in memory
(and you don't need lots of CPU power, etc, etc, etc) then just use one
machine and don't worry about "clusterising" it.

If you're expecting growth in the size of your dataset (beyond growth in RAM
size availability), then, well, maybe don't just use a single machine. Same
goes for a whole bunch of similar "it's too large for a single machine"
considerations.

Storing data should probably still be persisted to disk, and backed up.

------
stupidcar
Damn. my data is 6597069766657 bytes. Apparently if it was 6597069766656 bytes
it would have fitted in RAM.

~~~
lukegb
Well, hate to break it to you, but you probably have some overhead associated
with your data, like your operating system or structures related to processing
your data.

------
rplnt
Our data fits in ram but it proved to have no speed benefit. So the ram just
sits there, being empty.

------
starikovs
Redis as a primary data store!

------
octatoan
600 PiB "No, it probably doesn't fit in RAM (but it might)."

Well, well, well.

------
scblock
What is the point of this site other budget shaming?

------
lurkinggrue
Great googly moogly! terabytes of ram!

------
maljx
But does it fit in the L1 cache?

------
itamarhaber
Brilliant!

------
smartpants
6.000000000000000444 TiB

------
mahouse
Any point on the stupidly big ass font? It does not fit in my screen.

~~~
pcthrowaway
BUT IT FITS IN RAM!

------
imaginenore
That's like saying "you can fly first class".

If you don't have money, you can't. Very few people can afford it.

~~~
pdpi
It's more akin to saying "if you're looking at buying several economy tickets
to go from A to B, a first class ticket on a direct flight might be cheaper
and faster than stitching together several economy tickets"

------
smegel
If you are programming in R, you sure better hope it does!

~~~
baldfat
After reading the title I was sure there was something about R in the
comments.

You can program R in Spark you can now program in R
[http://blog.revolutionanalytics.com/2015/01/a-first-look-
at-...](http://blog.revolutionanalytics.com/2015/01/a-first-look-at-
spark.html)

Now you can work directly with SQL Server as announced this week by MS.
[http://www.computerworld.com/article/2923214/big-data/sql-
se...](http://www.computerworld.com/article/2923214/big-data/sql-
server-2016-to-include-r.html)

I have had a ton of arguments about R's "biggest weakness" being that it uses
RAM. I haven't once in the almost 3 years of working in R that I ran into this
road block, but I am sure others have. Which there are several good
distributed choices that will keep getting better and better.

Using RAM instead of Distributed is better in R as well as really any other
language in terms of complexity and flexibility.

~~~
saosebastiao
For my workloads, R has always choked on its single thread long before it
choked on memory. And the parallelism options are terrible hacks.

------
lessthunk
or you learn about data structures and algorithms and try to need less :-);
Randomized algorithms for example are intriguing.

------
toolslive
We build object stores... so, no it most definitely does not.

------
josephmx
I sincerely hope nobody is using a tool like this to decide which enterprise
servers to buy...

~~~
lukegb
Me too. The links are mostly to back up my claim rather than as a suggestion
of servers to buy (or I'd have found some affiliate links!)

