Hacker News new | comments | show | ask | jobs | submit login
Your data fits in RAM (yourdatafitsinram.com)
317 points by lukegb 940 days ago | hide | past | web | favorite | 224 comments



It's probably worth extending "Your data fits in RAM" to "Your data doesn't fit in RAM, but it does fit on an SSD". So many problems will still work with quite reasonable performance when using an SSD instead. By using a single machine with an array of SSDs, you also avoid the complexity and overhead of distributed systems.

My favourite realization of this: Frank McSherry shows how simplicity and a few optimisations can win out on graph analysis in his COST work. In his first post[1], he shows how to beat a cluster of machines with a laptop. In his second post[2], he applies even more optimizations, both space and speed, to process the largest publicly available graph dataset - terabyte sized with over a hundred billion edges - all on his laptop's SSD.

[1]: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

[2]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...


This is classic case of "Algorithm/Problem Selection" if your algorithm/problem is tailored to a task such as PageRank, surely a single threaded highly optimized code will beat a cluster designed for ETL tasks. In real organizations where there are multiple workflows/algorithms, distributed systems always win out. Systems like Hadoop take care of Administration, Redundancy, Monitoring and Scheduling in a manner that a single machine cannot. Sure you can "Grep" faster on a laptop than AWS EMR with 4 Medium instances, but in reality where you have 12 types of jobs which are run by team of 6 people, you are much better off with a distributed system.


Ditto for computationally intensive work: if it is CPU dominated, more CPU's calculating in parallel will be of advantage, even if the data could fit some RAM.

There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.


Part of the problem is that a lot of problems that are CPU dominated on a single system becomes IO dominated once you start distributing it at very low node counts without very careful attention to detail.


Yes, and starting from the distributed version first, one can even miss the fact that everything can be immensely faster once the communication is out of the way.

There's no a single simple answer, except: don't decide about the implementation first, instead competently and without prejudice evaluate your actual possibilities.

Also don't decide the language like "Python" or "Ruby" first. If you expect that the calculations are going to take a month, then code them in Ruby and wait the month for the result, you can miss the fact that you could have had the results in one day by just using another language, most probably without sacrifying the readability of the code much. Only if the task is really CPU-bound, of course.

On another side, if you have a ready solution in Python, and you'd need a month to develop the solution for other language, the first thing you have to consider is how often you plan to repeat the calculations afterwards. Etc.


I'd also toss in ease of correctness if you don't have a really well-known problem with a good test suite: I've seen a number of cases where people developed “tight” C code which gave results which appeared reasonable and then ran it for months before noticing logic errors which would either have been impossible or easier to find in a higher-level language or, particularly, one which had better numeric error handling characteristics.

There is a lot to be said for confirming that you have the right logic and then looking into something like PyPy/Cython/etc. or porting to a lower-level language.


> one which had better numeric error handling characteristics

Which language do you suggest which matches this?

I don't know many people who use the language "numeric error handling characteristic" but I know the benefit of already having good written high-level primitives (or libraries).


The first example which comes to mind are integer overflows, which are notoriously easy to miss in testing with small / insufficiently-representative data. Languages like Python which use a arbitrary-precision system will avoid incorrect results at the expense of performance and I believe a few other languages will raise an exception to force the developer to decide the correct way to handle them.

The other area I've seen this a lot is caused by the way floating point math works not matching non-specialists’ understanding – unexpectedly-failing equality checks, precision loss making the results dependent on the order of operations, etc. That's a harder problem: switching to a Decimal type can avoid the problem at the expense of performance but otherwise it mostly comes down to diligent testing and, hopefully, some sort of linting tool for the language you're using.


The entire reason for the popularity of distributed systems is because application developers in general are very bad at managing I/O load. Most developers only think of CPU/memory constraints, but usually not about disk I/O. There's nothing wrong with that; because if your services are stateless then the only I/O you should have is logging.

In a stateless microservice architecture, disk I/O is only an issue on your database servers. Which is why database servers are often still run on bare metal as it gives you better control over your disk I/O - which you can usually saturate anyway on a database server.

In most advanced organizations, those database servers are often managed by a specialized team. Application servers are CPU/memory bound and can be located pretty much anywhere and managed with a DevOps model. DBAs have to worry about many more things, and there is a deeper reliance on hardware as well. And it doesn't matter which database you use; NoSQL is equally as finicky as a few of my developers recently learned when they tried to deploy a large Couchbase cluster on SAN-backed virtual machines.


For web services, distributed usually makes sense, because the resource usage of a single connection is rarely high. It's largely uninteresting to consider huge machines for that use case for the reasons you outline.

But that's not really what the discussion is about. In the web services case, your dataset often/generally fits in memory because your data set is tiny. You don't need large servers for that, most of the time. Even most databases people have to work with are relatively small or easily sharded.

In the context of this discussion, consider that what matters is the size of the dataset for an individual "job". If you are processing many small jobs, then the memory size to consider is the memory size of an individual job, not the total memory required for all jobs you'd like to run in parallel. In that case many small servers is often cost effective.

If you are processing large jobs, on the other hand, you should seriously consider if there are data dependencies between different parts of your problem, in which case you very easily become I/O bound.


I own a ASUS gaming laptop...

It has two flaws: One, it has nVidia Optimus (that just suck, whoever implemented it should be shot).

Two, the I/O is not that good, even with a 7200 RPM disk, and Windows 8.1 make it much worse (windows for some reason keep running his anti-virus, superfetch, and other disk intensive stuff ALL THE TIME).

This is noticeable when playing emulated games: games that use emulator, even cartridge ones, need to read from the disc/cartridge on the real hardware, on the computer they need to read from disc quite frequently, the frequent slowndowns, specially during cutscenes or map changes is very noticeable.

The funny thing is: I had old computers, with poorer CPU (this one has a i7) that could run those games much better. It seems even hardware manufacturers forget I/O


I can bet in your case it's not the disk I/O actually, unless you have some very strange emulator the file access should still go through the OS disk cache. VMware for example surely benefits from it. How many GB do you have on the machine? How much is still left free when you run the emulator and the other software you need?


I noticed it as disk i/o because I would leave task manager running on second screen and every time the game lagged memory and CPU use were below 30% and disk was 100%, and if I left sorting by disk usage, the first places are the windows stuff, and after them, the emulator.



> There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

This - if you're using this website as your sole calculation then something has gone seriously wrong.

It was more intended to provoke discussion and thinking around overengineering things which can easily be done with, say, awk or a few lines of <scripting language>.

If you have large CPU requirements then sure, use Hadoop/Spark/some other distributed architecture. If you have a >6TB dataset but you only need 1GB of it at a time, then, well, maybe you still only need one computer.


The problem is getting 100% of the work to be parallel. If I have to do that last 10% in one machine there isn't much point to more than 10 machines.


You are talking about https://en.wikipedia.org/wiki/Amdahl's_law , am I right?


In most cases it's probably better to have one machine each for different jobs, then all cpu on every machine can be used for the current job. There are probably very few jobs that take extreme amount of cpu without also using a lot of data which would have to be moved around a cluster anyway.


Cloud seems to have become a Wall Street flag on par with outsourcing. It is something a company does to grab market attention...


Some things take 1 second on 1 machine beefy machine or 6 hours on a cluster due to latency issues. And yes I do mean 20,000 times slower, though 2,000 is far more common due to latency inside a datacenter being around ~500,000 ns vs ~100 ns for main memory vs 0.5 nm from L1 cache.

PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.


A step between RAM and SSD could be "Your data fits in RAM in compressed form". LZ4 compression is takes 3-4x longer than memcpy. LZ4 decompression is only 50% [1] slower. 2-3 GB/s per core.

[1]: Your mileage may vary.


If it's slower both in and out what's the benefit?

To guy below me: Ah, thanks. I thought the guy above was trying to say it's slower than paging to disk. : )


Basically, parsing the percentages, using compressed data in RAM is ~1 order of magnitude slower to get into memory, and less than 1 order of magnitude slower to read from RAM. That is still around 2 (or more depending on setup) orders of magnitude faster than the hit to go to SSD / SSD RAID.

Useful knowledge on latency: https://gist.github.com/jboner/2841832


Because it's so much faster than having to go to disk.


Scalable graph systems have been around for many years and have never taken off.

Most businesses doing big data (like ours) often have multiple disparate data sources that at the start of the pipelines are ETLing into some EDW. Trying to consolidate them into a single integrated view is very difficult and time/resource intensive. Having billions of disconnected nodes in the graph would be very hard to reason with.


I feel compelled to point out that the average SSD has an order of magnitude (or maybe 2 orders) more IOPS than a 6-disk 15k RAID6 or RAID10 array.

And that's a single, standalone, non-RAIDed SSD. When you get a 6-SSD RAID10, magic starts to happen. And if you RAID enough SSDs (10-20?), you can theoretically start to get more bandwidth than you do with RAM.


sounds good to me, but why people not doing that? ssd price too high?


People are doing that. If you see those PCIe "SSDs", most of them are effectively multiple SSD's + controller on a card, combined with either software or hardware RAID-0... E.g. some of the OCZ cards at least show up as 4 individual units under Linux, while the Windows drivers at least by default will show it as one device.

But yes, it is expensive if you measure per GB. If your dataset is small, though, and you are more interested in cost per IO operation, they can be very cheap.


I've built boxes for computationally intensive workloads that use these PCIe SSD cards. The IO is STUPID fast, considering you're bypassing the SATA chipset and are limited only by PCIe speeds.

You don't get a large amount of space (I believe the cards I installed were only 64 or 128GB, but this was ~4 years ago), but for small datasets that you need extremely fast access to, they get the job done.


We have a few 480GB ones, and yes, they are stupid fast - we're getting 1GB/sec writes easily. You can get multiple TB now if you can afford it (can't justify it for our uses...)


We run 8 SSD RAID 10 for all our database nodes. However, you still need the same metric ton of RAM you'd need before. However, if you have an I/O heavy load [which almost all databases do] you can cut down on total number of machines compared to SATA.


RAID SSD has the same latency as a single SSD. This setup will help where the bottleneck is I/O throughput.


SSD prices are about as high as HDDs have traditionally been. The difference is now the bottom is falling out of the HDD market and you can get 3-5 2TB HDDs for as much as a 256GB SSD. So lots of devices and appliances use a single SSD for read- and write-caching, backed by a bank of high-capacity HDDs.


The Cyrus IMAP server has a "split metadata" mode which supports this configuration. Very nice if you are running a mail server of any significant size.


* VMWare uses this as the root of their vSAN technology.

* Nimble and other storage appliance manufacturers use a bank of SSDs to cache frequently-read data, and have layers of fast- and slow-HDDs to store data relative to its access frequency.

* Hybrid HDDs have been around for years.


So, let's say my system is currently backed by MySQL or PostgreSQL, and that is not fungible. How would one move that data into RAM, including writes? And, how would one maintain some level of safety in the event of a crash? i.e. I don't really care if I lose X amount of time worth of data (say, five minutes), but I do care that when I reboot the system, the database comes back from disk into RAM in a consistent state.

Is there some off-the-shelf solution to this problem? And, if so, why isn't it talked about more? Every CMS ever, for example, would be very well-served by something like this. My entire website's database, all ~100k comments and pages and issues and all 60k users, is only 1.4GB, and performance is always a problem. I don't care if I lose a couple minutes worth of comments in the event of a system reboot or crash. So, why can't I just turn that feature (in-memory with eventual on-disk consistency, or whatever you'd want to call it) on and forget about it?


If you are using PostgreSQL, your data is operating in RAM if your data is small enough to fit. The write-ahead log gets a sequential copy of writes that are immediately moved to disk but the rest is lazily applied in the background.

Given a halfway competent I/O scheduler and some cheap SSDs, you can continuously write new data to disk at network wire speed even at 10 GbE while operating on the data in RAM and saturating outbound network. There is no slowdown at all. Even for databases that do not implement a good I/O scheduler (like PostgreSQL unfortunately) your workload is sufficiently trivial that backing it with SSD should have no performance impact. If you are having a performance problem with 1.4GB CMS, it is an architecture problem, not a database problem.


Postgres's default settings are pretty bad. You'll want to increase shared_buffers to somewhere between 25%-40% of your available system RAM (if you're using 9.2 or earlier, you'll probably need to update SHMMAX using sysctl) instead of the default 128MB.

For the thing you described, you probably also want synchronous_commit=off. That means you might lose some commits in the case of a crash, but you won't get data corruption from it, and writes will be much faster.


If performance is a problem with a 1.4GB database, your system is either extremely low end, or you have some seriously un-optimised queries or database architecture and/or should take a long hard look at what caching you are not doing that you should be (including in the form of materialised views)

It simply is not your overall disk IO capacity that is the performance problem with a dataset that small. At least not for a CMS.


So, my question wasn't about whether the system in question is shitty. I know that it's shitty. I also know that the CMS has less than optimal queries. That was also not my question.

"It simply is not your overall disk IO capacity that is the performance problem with a dataset that small."

But, it clearly, and measurably is; there's nothing to argue about there. That which comes from RAM (reads) is fast, that which waits on disk (writes) is slow. Writes take several seconds to complete, thus users wait several seconds for their comments and posts to save before being able to continue reading. That sucks, and is stupid and pointless, especially since it's not even all that important that we avoid data loss. A minute of data loss averages out to close enough to zero actual data loss, since crashes are so rare.


See my other comments re: turning off fsync etc.

But something still seems amiss to me.

How many new comments are you seeing per second? How many MB/s can the hardware actually write? How any MB/s of new comments are you actually getting?

Given the tiny dataset and small number of users, you really need to be running on abysmally slow (as in servers slow by late 80's standards) hardware or have a really bizarrely high user engagement or ludicrous levels of inefficiencies for this to make any sense with the numbers you've given.

Lets say an average comment is 8KB. Lets double that for indexes. At one comment per second you would be writing 86400 comments a day, replacing almost all old comments with 1.4GB of new comments in one day. Given you mention 100k comments total, presumably your actual rate of comments is much slower.

But even 16KB/second is at least an order of magnitude less than than my old 8-bit RLL harddrive for my 25 year old Amiga 500, and within reach of even the floppy drive on it...

Assuming your commenting rate is a tenth of that, the IO that should be required would in theory be within the reach of the floppy drive on a Commodore 64 home computer. Heck, it wouldn't take much for it to be within reach of the C64 tape drive.

Now, there will be some write amplification due to fsyncs etc. but nothing that could even remotely explain what you're describing unless there's something else wrong.

Maybe you have a degraded RAID array? Or lots of read/write errors on the drive?


> So, let's say my system is currently backed by MySQL or PostgreSQL, and that is not fungible. How would one move that data into RAM,

MySQL and PostgreSQL will totally take advantage of all of the RAM you give them.

> including writes?

This is harder, and you might not want it? It's worth noting that this argument is almost certainly directed against things like Hadoop, which claim to trade off performance for low management and easy scalability.

There's also a bunch of databases aimed at this use case (http://en.wikipedia.org/wiki/List_of_in-memory_databases), but I don't have any experience with them.

> I don't really care if I lose X amount of time worth of data (say, five minutes),

MySQL has generally got your back in the 'less safety for more performance' arena:

http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.htm...


"MySQL has generally got your back in the 'less safety for more performance' arena:"

So, this was never entirely clear to me, but now that I've read a bit more about it, this might actually be exactly what I want (which is to not have the system wait to return when posting new content, and just assume it'll end up on disk eventually). The talk of not being ACID made me nervous and maybe switched off my brain. I guess it just means I don't need or want ACID in this case, all I want is a consistent database on reboot.

So, I guess maybe this does what I want, but just to be clear: In the event of data loss, the database will still be consistent, correct? i.e. we'll lose one or more comments or written pieces of data, but the transaction it was wrapped up in won't be half finished or something in the database? (I recall MySQL had issues with this kind of thing in the very distant past, but I imagine that's just bad memories at this point.)


You're in good shape if you use innodb in MySQL. The issues in the distant past you're thinking of are with the isam drivers, and I've also had issues with the compressed write only archive driver when writing to that continuously. Innodb will remain consistent.


I've been thinking this too.

For example, it is very easy to write badly performing code using ORMs. And yet, ORM is often chosen for good reasons initially to give development speed (e.g. Django forms) The problem with quickly prototyped ORM based apps is that initially the performance is good enough but when the data grows the amount of queries goes through the roof. It is not the amount of data per se, but number of queries. Fixing these performance problems afterwards for small customer projects is often too expensive for the customer, but if there were a plug'n'play in-memory SQL cache/replica with disk-based writes, it would easily handle the problem for many sites.

Configuring PostgreSQL to do something like reading data from in-memory replica is likely possible, but I see that there would be value in plug'n'play configuration script/solution.


>and performance is always a problem.

Have you proven that your performance issues are SQL related? If you configure mysql correctly and give it enough RAM, a lot of those queries are happily waiting for you in RAM, so you have a defacto RAM disk. Finding your bottleneck in a LAMP based CMS system is fairly non-trivial. Think of all the php and such that runs for every function. Its incredible how complex WP and Drupal are. Lots and lots of code runs for even the most trivial of things.

This is why we just move up one abstration layer and dump everything in Varnish, which also puts its cache in RAM. Drupal and WP will never be fast, even if mysql is. Might as well just use a transparent reverse proxy and call it a day.


Disk writes are the problem in my case, and I'm comfortable making that assertion (I'm not new to this particular game). Reads are plenty fast, it's writes that are a problem. Certainly, it is a pathological problem with the disk subsystem that makes it suck, but it does suck, and if I could flip a switch and say "stop waiting for the disk, write it whenever you can" without risking breaking consistency on crashes, I would do so.

It seems like, from another comment, that setting innodb_flush_log_at_trx_commit to a non-1 value is roughly what I want, though it still flushes every second, which is probably more often than I need, but may resolve the problem of the application waiting for the commit to disk, which is probably enough.


In-memory database on a ramdisk, replicating to an on-disk database. All reads answered from the RAM master, while the replica only has to answer writes.

But I suspect that it's all already in the disk cache. Is the bottleneck reads or writes?


I guess I didn't mention it, but obviously (or at least, obvious to me knowing my system), it is writes that are slow. Reading pages can all come from memory, as the whole database is cached in RAM (I've given MySQL plenty of space to store the whole site in memory). But, writing a new post or page can take up to 30 seconds or so, which is simply absurd.

There is something pathological about our disk subsystem on that particular system, which is another issue, but it has often struck me as annoying that I can't just tell MySQL or MariaDB, "I don't care if you lose a few minutes of data. Just be consistent after a crash."


> writing a new post or page can take up to 30 seconds or so, which is simply absurd.

If it really waits on the disk your problems would be solved by using the SSD. Change your hosting plan. Fitting 1.4GB database isn't expensive. If it remains slow that means you use some very badly programmed solution.


We have dedicated servers in colo. I'm upgrading to new SSDs on a server-by-server basis, as I have the opportunity to do so. This one just hasn't been upgraded yet. That wasn't really the question, though. Whether writes take 30 seconds or .1 seconds, if I don't care, I think there should be an obvious path to make it take a millisecond by letting it just "commit" to RAM and then return.

Which, it seems like there is in the innodb_flush_log_at_trx_commit option, which is awesome, and which I'm trying out, now.


If you try the value 2 please write here if it's better for you. It can be a good compromise: "only an operating system crash or a power outage can erase the last second of transactions." It still doesn't "flush" so I guess it can work really good.


I feel like this is already indirectly possible with cache systems... yes there is an initial load from DB or whatever, but with Ehcache for example I think(?) all those objects just sit in the JVM, and therefore should be in RAM. If you wrote some app startup batch process to stick every object possible into the cache proactively, I think you'd have essentially what you're asking for.

http://ehcache.org/documentation/2.6/configuration/cache-siz...


Yes, there are a few systems which stores info in memory but backs it up to disk. Look at VoltDb, Aerospike, Couchbase (to a degree), Starcounter, OrigoDb etc. That said for a CMS the structure is usually the problem unless you have billions of web pages which means a db with a flexible datamodel is probably what you want (ie non-relational), no matter where it stores the info. For exmple graph database should match that niche really well.


100 K comments is actually miniscule. I guess you don't use any caching if the performance is always a problem? And I'm rather sure your website has even more issues compared to a better optimized one. It can be an issue of what you pay for, of course, if you have some virtual hosting with just a little of RAM. That is, maybe your data "doesn't fit in the RAM" you are providing right now.


Of course caching is enabled. Reads are all coming from cache. Performance problems come on writes, which currently wait for the disk, always, even though I don't care if they do...I'd be happy with it writing to disk five minutes later, but that's not an option I've been able to find.


Take a look at this option on MySQL innodb. I've used this in the past to radically improve speed:

"If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1, the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues."

From https://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.ht...


Why don't you try SSD storage? The database is so small it shouldn't be expensive.


We are moving to SSD. It just takes a while. This is the last of our servers to not have SSD. But, nonetheless, it seems obvious that for many use cases, one should be able to say, "I don't care if it's on disk immediately. Just be fast."


You sort-of can do that with Postgres.

The "nice" option is to tune "fsync", "synchronous_commit" and "wal_sync_method" in postgresql.conf

If your overall write load is low enough that you will catch up over reasonable amounts of time, the really dirty method is to set up replication (internally on a single server if you like) and put the master on a ram disk. If your server crashes, just copy the slaves data directory and remove the recovery.conf file, and you'll lose only whatever hadn't been replicated.

But in terms of time investment in solving this, it's likely going to be cheaper to just stick an SSD in there.


Does your RDBMS's built-in caching not handle this pretty well? Just up the cache size, e.g. in PostgreSQL changing effective_cache_size https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...


For reads, yes. It's fine on reading data. The problems are writes which always wait for disk consistency to return.


shared_buffers is the setting you're thinking of here.

effective_cache_size should be set to a reasonable value of course, but it does not affect the allocated cache size, it's just used by the query optimizer.


RAM is more expensive than disk space. Most small websites want to save money, so paying extra to keep it all in RAM is probably out of the question.


RAM is cheaper than people.


And if your data doesn't fit in a single server's RAM, just buy some more and run Apache Spark [1] on them. It's an in-memory computation engine that's really nice to program for: you don't have to worry about low-level clustering details (like MapReduce). And it's way (10-100x) faster than Hadoop.

[1] https://spark.apache.org


Spark is fast becoming the default tool for big data.

The recent addition of SparkR in 1.4 means that now data scientists can leverage in memory data in the cluster that has been put there by output from either Scala or DW developers.

Combine it with Tachyon (http://tachyon-project.org) and it's not hard to imagine petabytes of data all processed in memory.


Can you explain what Tachyon does that's different from what Spark already provides?

I haven't used either Spark or Tachyon. I thought the Spark solution was to just put my dataset in memory. But the Tachyon page seems to say the same thing


There's a slide deck[1] that explains it rather well.

Basically, Tachyon acts as a distributed, reliable, in memory file system.

To generalise enormously, programs have problems sharing data in RAM. Tachyon lets you share data between (say) your Spark jobs and your Hadoop Map/Reduce jobs at RAM speed, even across machines (it understands data-locality, so will attempt to keep data close to where it is being used).

[1] http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16...


Neat, thanks for the link, the code examples towards the end make it clear that this is pretty simple to use.


Yeah, most things coming from the Spark team are excellent in that respect.

I've never used Tachyon, but based on the wonderful "getting started" experience Spark gives I'd be confident it would be similarly well thought out.


As others have explained Tachyon does the "put dataset in memory" part.

Spark started off as the "in memory map reduce" but has now become a platform for Scala, Java, Python, HiveQL, SQL and R code to run. It is the most active Apache project and is getting more and more powerful by the day.

Given how easy it is to get running it wouldn't surprise me to see it in the years to come being using as the primary front end for all data needs.



Can someone explain in a bit more detail what this is about? Is the 'joke' that running data computation in RAM is faster than what? From disk?


The subtext is that running a fancy distributed system is more exciting and beneficial for ones resume than simply buying a massive bloody server and putting postgres on it, and that people are making tech decisions on this basis.


This of course ignores that it's much easier to get your hands on a cluster of average machines than one massive bloody server, and all the non-performance-oriented benefits of running a cluster (availability etc.).

Much easier to request a client provisions 20 of their standard machines, or get them from AWS. People don't like custom hardware, and for good reason.


Yes. Just yesterday in some other story everyone was arguing for the cloud because who wants to maintain their own hardware? This morning the hue and cry is "slap more RAN in that puppy".

I just speced out a 6TB Dell server. Price? It is already at $600K, and I haven't fully speced it out yet (just processor, memory, drive). Maybe that memory requirement is high (though it is about what I would need); 1TB is somewhat over $200K.

For the right situation that sort of thing maybe makes sense, though I'm SOL if I need high availability (power out, internet flakey, RAM chip goes bad, etc leaves me dead in the water).

So I would need to stand up a million or more in equipment in several places, or just use AWS and suffer the scorn of someone saying 'you could have put that in RAM'. Yes. Yes I could have.


> everyone was arguing for the cloud because who wants to maintain their own hardware?

Well, I keep arguing against that, because you still get 90%+ of the maintenance work, plus some new maintenance work you didn't have before, to avoid some some relatively minor hardware maintenance. And you can get most of the benefits of non-cloud deployment with managed hosting where you never have to touch the hardware yourself.

I work both on "cloud only" setups and on physical hardware sitting in racks I manage, and you know what? The operational effort for the cloud setup is far higher even considering it costs me 1.5 hours in just travel time (combined both ways) every time I need to visit the data centre.

For starters, while servers fail and require manual maintenance, those failures are rare compared to the litany of issues I have to protect against in cloud setups because they happen often enough to be a problem. (The majority of the servers I deal with have uptimes in the multi-year range; average server failure rate is low enough that maintenance cost per server is in the single digit percentage of server and hosting costs). Secondly I have to fight against all kind of design issues with the specific cloud providers that are often sub-optimal and require extra effort (e.g. I lose flexibility to pick the best hardware configurations).

Cloud services have their place, but far too many people just assumes they're going to be cheaper, and proceed to spend three times as much what it'd cost them to just buy or lease some hardware, or rent managed hosting services.

Even if you don't want to maintain your own hardware, AWS is almost never cost effective if you keep instances alive more than 6-8 hours of the day in general. Your mileage may wary, of course.


"The cloud" is basically a new non standard OS to learn.

I am reasonably happy configuring an old school Linux box. Heroku is much more of a pain in the arse to deploy to in my experience, despite much of the work being done for you already. Debugging deployment issues is particularly painful.


Depends on what you mean by massive bloody server. You can get a server with a terabyte of RAM for a price that's insignificant compared to the cost developing software to run on a cluster.


> You can get a server with a terabyte of RAM for a price that's insignificant compared to the cost developing software to run on a cluster.

This assumes that a) You're in the valley where average developer salary is $10k a month or more, b) You're a large company paying developers that salary.

There are lots of other places where a) Developers are cheaper, or b) You're a cash strapped startup whose developers are the founder(s) working for free.


Comparison still holds, because if you buy a cluster with X amount of RAM the price will be roughly the same as a single server with X amount of RAM. Except that for some large X there won't be any off the shelf servers you can buy with that amount of RAM (let's say 2000GB), but lets be honest here, 99% of companies needs are under that X especially if we're talking about startups.


> if you buy a cluster with X amount of RAM

That's not the only option. You can rent a cluster for a lot cheaper.


...


You're assuming there are people competent of building such systems who are ignorant of the fact they can earn that money anywhere in the world.


Amazon offers some bloody huge servers... 32 core, 256GB RAM, and 48TB HDD space. d2.8x large


That is 4k a MONTH for 256gb of ram.

If you could do the same job on a fleet of 8-16GB servers.. you can get a lot more CPU for a lot less dollars. Depends if you really need everything on 1 machine or not (as of course nothing will beat same machine in memory locality)


Not true, 8x16GB costs as much as 1x256 on Amazon. The issue here is that Amazon is hilariously expensive in general. Hetzner will rent you a 256GB server for €460 per month. Or you can buy one from Dell for $5000. These are not high numbers, in 1990 you paid more than that for a "cheap" home computer. For the price of a floppy drive back then you can now get a 32GB server.


rackspace, onmetal-memory[1]: 512 GB, $1650/mo (3.22 $/gb/mo)

softlayer, dual Xeon 2000 Series: 512GB, $1,823.00/mo (3.56 $/gb/mo)

these are on-demand prices. pre-pay, or use a term discount, and its cheaper.

Build it yourself: You can build a Dell or similar on a 2-Xeon-proc (E5 series), your main limit is getting good prices on 16x 32GB DIMMS. But lets say you can buy the RAM for ~$6500, then its just dependent on the rest of your kit, lets say $10,000 flat for the whole server. $277.77/mo over 36 months, but you still need network infrastructure, and you might want a new one in 12 months, but you get the general idea.

[1] - http://www.rackspace.com/en-us/cloud/servers/onmetal


and fwiw, costs at amazon will scale linearly with resources. the 1 beefy box with 256GB RAM box costs about as much as 16 boxes with 16GB of RAM each.


If you're running a windows system licensing costs will be smaller when scaling up than scaling out, so there is that to bear in mind.


It may be more exciting, but don't those people know about the CAP theorem?



You should post that if it hasn't been posted already, it's a much better way to make the case than the current link.


Done: https://news.ycombinator.com/item?id=9582060

I originally saw it on HN, but almost two years ago. Old comments: https://news.ycombinator.com/item?id=6398650


There is no point deploying a heavy, complex (and usually pretty slow due to the overheads involved) distributed database, when you could just buy a server with xTB ram, load any sql database on it, and run your queries in a fraction of the time. If your data is so large that it can't fit in the RAM of a single machine, then distributed databases make more sense (since loading data off disk is very slow, modulo SSD).


For some problems SQL would already be way too much overhead.


Could you give a concrete example?

If your working set is small, say 1TB -- and so fits in RAM -- for what kind of problems would using SQL be so much of an overhead that you need a different approach? And what would that approach be?

I suppose you could have a massive set of linear equations that you might be able to fit into 1TB of RAM, but would be difficult to work with as tables in Postgres?


Take a graph, you could use an SQL database to store it and do your graph analysis using SQL, or, alternatively, you could convert your graph to an extremely compact in-memory format and then do your analysis on that. Much better efficiency for the same size problem, bonus: you can now analyze much larger graphs with the same hardware.


Or maybe a bit of both: http://stackoverflow.com/questions/27967093/how-to-aggregate...

I appreciate you taking the time to answer -- and I get that there's a reason for why we have graph databases. But I really meant something more concrete, as in here's a real-world example that isn't feasible to do on machine X with postgresql, but easy(ish) with a proper graph structure/db -- rather than "not all data structures are easy to map to database tables in a space-efficient manner".


Ok, one more example: A German company holds a very large amount of profile data and wanted to search through it. On disk storage in the 100's of gigabytes. Smart encoding of the data and a clever search strategy allowed the identification of 'candidate' records for matches with fairly high accuracy, fetching the few records that matched and checking if they really were matches sped things up two orders of magnitude over their SQL based solution.

It's very much dependent on how frequently you update the data and whether or not (re)loading the data or updating your structure in memory can be done efficient or not to determine whether or not such an approach is useful or not but going from 'too long to wait for' to 'near instant' for the result of a query is a nice gain.

In the end 'programmer efficiency' versus 'program efficiency' is one trade-off and cost of the hardware to operate the solution on is another. Making those trade-offs and determining the optimum can be hard.

But a rule of thumb is that a solution built up out of generic building blocks will usually be slower, easier to set up, will use more power and will be more expensive to operate but cheaper to build initially than a custom solution that is more optimal over the longer term.

So for a one-off analysis such a custom solution would never fly, but if you need to run your queries many 100's of times per second and the power bill is something that worries you then a more optimal solution might be worth investing in.


> A German company holds a very large amount of profile data and wanted to search through it. On disk storage in the 100's of gigabytes. Smart encoding of the data and a clever search strategy allowed the identification of 'candidate' records for matches with fairly high accuracy, fetching the few records that matched and checking if they really were matches sped things up two orders of magnitude over their SQL based solution.

Right. But you've still not made a compelling argument that the "smart encoding of the data" and "clever search strategy" couldn't be done with/in SQL. I'm quite sure you could probably get a magnitude or two of improvement by diving down outside of SQL -- but there's a big difference between using the schema you have, and setting up a dedicated system.

I was perhaps not clear, but I was wondering if it's not often the case with real-world data, that you could create a new, tailored schema in SQL, and do the analysis on one system -- especially if you allow for something like advanced stored procedures (perhaps in C) and alternate storage engines.

I suppose one could argue when something stops being SQL, and start becomming SQL-like. But the initial statement was that "For some problems SQL would already be way too much overhead.". The implication being (as I understood it, at least) -- that you not only have to change how the date is stored and queried, but that SQL databases are unsuitable to that that task -- and that by going away from your dbs, you can suddenly do something that you couldn't do on a system with X resources.

I'm not saying that's wrong -- I'm just asking if a) that's what you mean, and b) if you can quantify this overhead a bit more. Are using bitfields or whatnot with postgres going to be a 20% overhead, or a 20000% overhead in your scenario? Because if it's 20%, that sounds like it can be "upgraded away".


Tough crowd.


Yeah, sorry about that :-) I really do appreciate your input -- and the previously linked page-rank example show how using manual data layout can squeeze a problem down -- but I still wonder how many of those techniques could've been made to work with SQL -- and what the resulting overhead would've been. Size on disk would probably have been a problem -- AFAIK postgres tend to compress data a bit, but probably nowhere near as much as manual rle+gzip. But I wonder how far one could've gotten with just hosting the db on a compressed filesystem...


Data that fits in RAM doesn't need any "Big Data" solutions.


I believe it's more "no, you don't need an Hadoop cluster of 20 machines, your data fits in the RAM of one machine".


People will build gigantic compute clusters with expensive storage backends when their entire dataset fits in memory.

If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.

So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.


Right, RAM an order of magnitude faster than disk, so calculations will be performed very quickly. Big data usually implies clusters of servers because the data won't fit on one server (even on the disk).


Big Data usually implies 'big dollars', not necessarily a large amount of data. Simply use some in-efficient algorithm and a datastore with sufficient overhead and you're in Big Data territory.

Re-do the same thing using an optimal algorithm operating on a compact datastructure and you make it look easy, fast and cheap. Of course you're not going to make nearly as much money.


Most people who think they have "big data" problems actually don't have "big" data at all.


He's selling himself short.


Yes! As someone who frequently runs memory-intensive algorithms on large(ish) datasets, I have a hard time explaining to many technical people that moving from a single server to a cluster increases complexity and cost by an incredible amount. It affects key decisions like algorithm and language, and generally requires a lot of tweaking.

When a problem becomes big enough, moving to a cluster is absolutely the right decision. Meanwhile, RAM is cheap and follows Moore's Law.


Complexity, sure. But cost? I thought a single 1 TB RAM server is more expensive than 10x 100 GB RAM servers.

And many people don't want to deal with physical hardware. Dealing with physical hardware increases operational complexity too. They want to rent a virtual/cloud server. Which provider allows you to rent a virtual server with 1 TB RAM?


Cost follows complexity, even if it's not always immediately obvious.

A 1TB RAM server is more expensive than 10x 100GB RAM servers, but the hardware cost is often small compared to the business and technical cost of getting a solution to scale across a cluster.

Of course, generalizations are always dangerous—the take-home point here is perhaps that before going to a cluster because “that's the way big data is handled,” it's a good idea to do a proper cost-benefit analysis.


> generalizations are always dangerous

I've got to remember this one


I took a look around for "high-ram" servers, and it seems one I can buy today, is HP ProLiant DL580 Gen9. With just 256 GB of ram, it clocks in at 540.995,- NOK (71.5k USD). It has 96 ram slots, and I can't seem to find anything bigger than 32 GB DDR4 RAM, and rounding the price up 96x32GB comes to roughly 672.000,- NOK (~90k USD). Adding that up (throwing away the puny ram installed), gets us to a little over double the original price, or 1.212.995,- (~161k USD). This has 4 18 core E7s (72 cores) clocked at 2.5Ghz -- and 3TB of ram (half of max, because of 32 GB dimms).

It is true that while the jump from 256GB to 3TB is "just" ~2x -- I could get a server for 1/10 of the price of the original configuration -- but only with 4GB of RAM, and nowhere near even 18 hardware threads.

If you are CPU limited (even at 72 hw threads) you might need more, smaller servers.

But such a monster should scale "pretty far" I'd say. Does cost about half as much as a small apartment, or one developer/year.


Dell sells servers with 96x64GB RAM. There is a huge (7x) premium for the 64GB DIMMs instead of 32GB, so it runs around 500k, with almost the entire price going to RAM.


A Dell R920 with four E7-8880L v2 15 core processors (e.g. 60 real threads, 120 with HT), and 1024GB / 1TB of memory costs about $50,000 USD. To go to 1.5TB of memory pushes you to $60,000 USD.

Expensive in a relative to a low-end server or month of cloud usage, but that's an absurd amount of computational power.


> Complexity, sure. But cost? I thought a single 1 TB RAM server is more expensive than 10x 100 GB RAM servers.

Sure, but in the latter case you'd also have to pay for the manpower to build a cluster solution out of a formerly simple application. And people are usually more expensive than servers.


Apart from what other commenters already said about the cost of software complexity, is there a variant of Amdahl's law that could be used here? 10x 100 GB servers working on a problem together will probably never be 10x faster than 1x 100 GB server. Perhaps just the increase in the order of magnitude of the distance information needs to travel is already sufficient to set some higher bounds... So you may need to buy more than 10 of them to match 1x 1 TB server.


But you're assuming that memory is the only thing that matters here.

10x100GB will have 10x the computing power of 1x1TB server.


Ah, yes. Thank you!

Although, that means the 10x setup must cost much more. I think the idea in the comments above was taking 10 cheaper, weaker servers and somehow coming out with roughly the same price...

Well, in any case, things just got too complicated :)


SoftLayer lets you rent a server with 512 GB ram directly from their order form. (Monthly price for the ram is $1,444.00; a dual xeon 2620 server you can put it in is $380/month). It's baremetal, not virtual, but you can file tickets with them for any hardware stuff that comes up.

If you work outside the order form, you can get 768 GB, too. 1 TB is possible with their haswell servers, but availability seems limited.


10 servers with 100G will use a lot more power and will require distribution of your algorithm right along with your dataset, so instead of 10 server you will probably end up with a pretty high multiple of 10.


Isn't this what the cloud is for?

Surely you just rent an instance of the computing power you need for an hour or two then upload the data, a script and wait.


That really depends on your usecase. Not all analysis is 'one-shot' and not all businesses are free to upload their data into 'the cloud'.


Not only that, but uploading a 1TiB dataset implies a certain quality of connection which not all businesses want to take on...


As opposed to 10*100GiB that can be uploaded over a 56.6K modem, right? ;)


It's a reasonable cloud vs. on-premise argument. Obviously the scale of the data transfer to a cloud has more to do with the dataset size than the number of instances.


How often are you planning to do that? The bandwidth it takes to send data that fits in RAM somewhere else is somewhat expensive.


Many companies started with clusters by using retired machines and free software. Going back a few years that would have been a much easier solution to put in place as compared to a capital expenditure for a single top end server.


I love it. I was just doing some Fermi estimates for a friend on the data for a project he has in the pipeline. I was curious whether or not it would be cost efficient for his project's budget to go with NVMe SSDs or have to stick with traditional SATA ones, and turns out it doesn't even matter (for now) because at least the first three months of data will fit in 256GB of RAM, even allowing for a 2.5x factor stemming from some (estimated) inefficient storage or data structure use in a scripting language like Ruby or Python.

Edit: And after those first three months he'll know more about the use and performance demands of the project and will be able to make far more accurate decisions about storage categories.


Where's 2.5x from? I'd be curious to see any actual data on comparing memory footprint for a problem in C/Go/Rust to Python/Ruby. I'm sure it varies widely, but 2.5x might not be far off.


Today I'm working on dataset of 1GB, which fits in memory. But it is not enough. If a variable is category/factor you need to introduce dummy values and your dataset starts picking the weight. Next - do you want apply ML algorithm in parallel? Upst, you need more memory. Done that? Now please use test dataset for prediction. My point that "data in memory" is just the beginning...


The problem with giant boxes (full of RAM / SSD / Disk) is giant failures and huge recovery times. This is worsened in case of RAM because now every power blip is a full on recovery situation. Have a big enough data set focussed on a single box (or two for backup purposes), your customers are going to blow a gasket the moment one of them go down because workloads usually grow to accommodate available capacity.

FB has a nice paper that talks about this problem. https://research.facebook.com/publications/300734513398948/x...


Well, you wouldn't run such a server without a hefty uninterruptable power supply system. On your bigger server you can expect a smaller frequency of failures due to fewer points of failure, and can make your system more resilient (rendundant RAM, filesystems, power etc).


More accurate title would be fit in RAM of single machine.

Maybe some bonus category:

0. Spreadsheet is all you need.

1. Python script is good enough.

2. Java/Scala is way to go.

3. Need to manage memory (gc doesn't cut), some custom organization.

4. Actually needs a cluster.


> 0. Spreadsheet is all you need.

I HATE when people use Spreadsheets to do anything besides simple math.

http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a...

TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.

Also

> 1. Python script is good enough

You mean Python with pandas and numpy?

I use R which is also a great choice

> 2. Java/Scala is way to go.

For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.

> 3 & 4 are good points.


Sadly, I recall those arguments against spreadsheet computation being made in the early 1990s. People simply won't learn.

Ray Panko, University of Hawaii.

http://panko.shidler.hawaii.edu/SSR/

http://panko.shidler.hawaii.edu/SSR/

Goes back to 1993:

http://panko.shidler.hawaii.edu/SSR/auditexp.htm


There is even a European Spreadsheet Risks Interest Group: http://www.eusprig.org/


Ad 0. I agree. Your article got valid point. I wouldn't do serious research based solely on complicated spreadsheet.

Though in many non-techies things, like daily sales transactions it is a way to go.

Ad 1. pandas/numpy would put it on par with 2.

Ad 2. Would disagree. I know data scientist using Spark. Mostly they like Scala API.

In general, everyone got their favorite weapon of choice and what they feel comfortable. The point is that simpler solutions sometimes are just enough do their job.

Renting r3.4xlarge on AWS for an hour and play with your favorite tool may be an orders of magnitude easier/cheaper/faster than using big data solution.


Arguably Numpy/Pandas is just as performant as Scala/Java and it certainly beats R hands down when data becomes more than a say 10-20 gigabytes after which I find R slows to a crawl.


Untrue about the speed of R. R and Python are always around the same speed, but there are always other options specially with R, where there is always more than one way to do anything.

We have data.tables and dplyr which data.tables is maybe on average 50% faster and on some points multiple faster than Python [http://datascience.la/dplyr-and-a-very-basic-benchmark/]


  > mm <- matrix(rnorm(1000000), 1000, 1000)
  > system.time(eigen(mm))
     user  system elapsed 
     5.26    0.00    5.25 



   IPy [1] >>> xx = np.random.rand(1000000).reshape(1000, 1000)

   IPy [2] >>> %timeit(np.linalg.eig(xx))
  1 loops, best of 3: 1.28 s per loop
But where R really stinks is memory access:

  > system.time(for(x in 1:1000) for(y in 1:1000) mm[x, y] <- 1)
     user  system elapsed 
     1.09    0.00    1.11 

   IPy [7] >>> def do():
          ...:     for x in range(1000):
          ...:         for y in range(1000):
          ...:             xx[x, y] = 1
          ...:
   IPy [10] >>> %timeit do()
  10 loops, best of 3: 134 ms per loop
Growing lists in R is even worse with all the append nonsense. Exponential time slower.


That's why you never ever grow lists with R. do.call('rbind',...) or even better data.table::rbindlist(). You can't blame R for being slow if you don't know how to write fast R code.


obviously I use do.call all day long because R is my primary weapon, but even if I say so myself, a happy R user, Python with Numpy is faster. I would invite you to show me a single instance where R is faster at bog-standard memory access, than Numpy. My example demonstrates exactly this. Can there be anything simpler than a matrix access? And if that's (8x) slower, everything built on this fundamental building block of all computing (accessing memory) will be slow too. It's R's primary weakness and everybody knows it. Let me make it abundantly clear:

  > xx <- rep(0, 100000000)
  > system.time(xx[] <- 1)
     user  system elapsed 
    4.890   1.080   5.977 

  In [1]: import numpy as np
  In [2]: xx = np.zeros(100000000)                                               
  In [3]: %timeit xx[:] = 1
  1 loops, best of 3: 535 ms per loop
If the very basics, namely changing stuff in memory, is so much slower, then the entire edifice built on it will be slower too, no matter how much you mess around with do.call. And to address the issue of (slow, but quickly expandable) Python lists, recall that all of data science in Python is built on Numpy so the above comparisons are fair.


> you mean Python with pandas and numpy?

Actually, I would bet that some 50% of time people are importing numpy or pandas they really don't need it

Like for calculating the square root of a number. Or the average of a short list


I am affraid the in our research lab we didn't have 10 000$ up front/200$ a month to get a pc with 1TB ram ... we did have a large computer hall and BOINC though :)


Looks like 1.5TB RAM with 15 cores costs $50K. But it shouldn't be just about RAM. The problems I'm working on requires 250 cores on similar amount of data. If there was an option to get say 150 cores with 2TB RAM, things would fly for sure.


Another 4 to 6 years and that should be a reality.


4-6 months and you'll have a Knight's Landing Xeon Phi with at least 72 cores and 288 hardware threads, with vector instructions, and you'll be able to stick 3 of them in single blade.


Seems a bit naive, saying 2.1PB probably doesn't fit in ram, "but it could"...

I get who this is aimed at, and why, but just saying that it fits in RAM isn't as useful as it could be. This is an opportunity to teach, not just snark.


A bit of clever reformatting of your data and 2.1 Pb could probably easily be reduced in size to something that would fit in RAM. Are you actually needing every byte?


It wasn't intended to snark - I apologize if it was seen that way. I whipped this up super quick and perhaps should have expanded on my meaning.


Like, "To fit 2.1PB in RAM, you could spin up 9 r3.8xlarge EC2 instances for US$3.15 per hour"?


Outside of scale-out, scale-up there is also solution: scale-in. Optimize your memory usage, so your data occupies less space.

I work on something like that.


So this seems to use 6.144TiB as the limit that will fit in RAM. That's 1.536TiB x 4 when using the latest Xeon I could find[1]. According to the specs though you should be able to use 8, so the total limit should actually be 1.536 x 8 = 12.288 TiB. 12TiB of RAM, that's quite amazing.

[1] http://ark.intel.com/products/84688/Intel-Xeon-Processor-E7-...


It seemed to use 6.000000000000000444089209850...??? TiB when I tried values.


It seems to use different values of cutoff depending on if you are using MiB/GiB/TiB/etc. I tested with GiB and 6144 is OK, 6145 is not.


Glad to see I wasn't the only one to test the limits. ?


I think its wrong. It says 64 TB does not fit in RAM, but you can get 64TB machines from SGI as well 32 TB ones from Oracle.

The SGI one with up to 2048 cores are larger in their single system images than most people have in their clusters.

The benefit of these systems is not really the ease of programming but the speed of interconnect.

List price of the Oracle one was 3 million a few years ago. But most of that is actually in the high density dimms. These days I think the price must be lower, but I won't waste my Oracle sales contact time in figuring out what it is today. Of course it will still be expensive, it is an Oracle product after all.

However, an equivalent dell list price cluster of simple 1U boxes (512 6C/64GB ones!) will go for 1.5 million. The fact that to house 512 boxes i.e. 25 racks or so plus networking. Of course you do get 1/3rd more cores than the SGI one.

For many of us that are between the just use a single normal server and yet too small for the google solutions. These big memory solutions from Oracle and SGI can make sense even if they are not the first thing that comes to mind!


Everything fits in RAM if you have the budget for it.


No, the point is that usually fitting things in RAM lowers the budget. So it's well worth doing proper analysis on whether or not you can (a) fit all your data in RAM and (b) if a cluster of machines does not become it's own reason for existence.

Replacing a large number of nodes with a single machine with a lot of RAM is usually a cost savings measure rather than a larger expense (and it saves power too!), and due to a lack of communications overhead and exploitation of the fact that you now have access to all the data in one go you may very well find that your algorithms run much faster.

A distributed solution should be a means of last resort.


What does 6TB of RAM go for these days?


Less than 6 machines with 1T each ;) (Assuming the 6 will operate only on local data and will never need to communicate in which case you may well end up with more than 6).

But seriously: the price of RAM for servers is now ~10$ / G.


That would translate to ~$60k for 6TB of RAM. Plus the cost of the server itself ($10k?)


No, that's a very expensive server, and Dell will charge you a hefty premium for the memory. There are quite a few options that will cost you less than that (of course the maximum capacity will vary).

It would be nice to see an article comparing all the high RAM machines side by side with specs and prices.

The largest machine I have right now will hold 512G and was a run-of-the-mill machine, it was about $5K, I'd expect the more exotic ones to be substantially more expensive but probably not as expensive as the machines linked here.


Can you point me to cheap 64GB LRDIMM Octal rank memory? Dell's prices for this seem to be the market rate, but maybe I don't know where to shop.


I can get Octal rank in bulk for ~$1K so that's $16 or thereabouts / G, still not bad for RAM that is obviously going to be sold in smaller quantities. Note that HP or Dell will probably not be happy if you use 3rd party RAM in their machines (if they didn't pull tricks to make sure only their own stuff works!).

(For contrast, that $1K if you'd spend it on HP branded RAM would not even get you four 16G dual rank units...).


The price of 6TB is more in the range of 1/2 million. 64GB LRDIMM Octal rank is running $4.5K.


You really don't know where to shop ;)

5 seconds of googling gets you:

http://www.allhdd.com/index.php?target=products&mode=search&...

No idea if that will work in that particular server but technically there is no reason why it should not. You can establish a relationship with a distributor to get better prices than those that you can find listed. Those 'call for quotes' things can have two meanings: you're about to be screwed or 'we will give you a better price if you promise to not divulge that we did that'.


If you go by the Dell site linked from the article, a PowerEdge maxed out with 6TB RAM (4 sockets * 24 LRDIMMs * 64 GB) will set you back US$444,000 (or thereabouts). They also list 3.2 TB NVMe PCIe cards for US$11,000 each.


Though to be fair, no larger business would ever pay Dell sticker price. I bet they would sell it to a good client for only 200k ;)


That's the point of the tool! To remind people to compare the cost of fitting the data in ram compared to the cost of not putting it in ram.


Taken from the github repository:

var MAX_SENSIBLE = 6 * TB; function doesMyDataFitInRam(dataSize) { return dataSize <= MAX_SENSIBLE; }


And if your data doesn't fit into the RAM of a single machine you can buy a few more and use vSMP (http://www.scalemp.com/) to create a shared memory single system image.


In my opinion the correct answer is 255Gb. (i.e. AWS r3X8 High Memory instances ).

While one can purchase servers with larger memory most likely you will run into limitation on number of cores. Also note that there is at least some overhead in processing data, so you would need at least 2X the size of raw data.

Finally while its a good thing to tweet, joke about and make fun of buzzword while trying to appear smart. The reality is that purchasing such servers (> 255 Gb RAM) is costly process. Further you would ideally need two of them to remove single point of failure. it is likely that the job is batch and while it might take a terabyte of RAM you only need to run it once a week, in all these cases you are much better off relying on a distributed system where each node has very large memory, and the task can easily split. Just because you have cluster does not mean that each node has to be a small instance (4 processors ~16 Gb RAM).


> Further you would ideally need two of them to remove single point of failure.

That's assuming that everything needs to be 'high availability' and buying two of everything is a must. This is definitely not always the case. In plenty of situations buying a single item and simply repairing it when it breaks is a perfectly good strategy.


Its not about having two of everything at all times, but rather about having a capacity whenever you need it. At 244Gb you hit a sweet point where you can have access to large capability at a flexible price (Spot Market / On Demand / On Premise). This is what separates engineers with business acumen from run of the mill "consultants" with a search engine.


You mentioned 'single point of failure'.


"Your data fits in RAM", vs "Your data fits in RAM on around X machines", would be better. Any dataset fits in RAM.... but if its going to take more machines then I am willing to buy it really doesn't.


Before core, there was tape. Tape used to be backup medium, then disk became the new tape. Bubble memory begat SSD, so memory has in some sense become the new disk.

RAM is the new disk: now for some, later for others.


"Yes, your data fits in RAM... if you feel like buying a server at the same price as 3 Tesla Model S automobiles, a mansion in the Southern U.S., or a bachelor pad in San Francisco."


Any point on the stupidly big ass font? It does not fit in my screen.


BUT IT FITS IN RAM!


HP hints its new memristor memory computer will have the cost of flash and the speed of registers. An will mostly eliminate the multi-level memory hierarchies we have today.


Unlikely; the limiting factor is already distance - poor scaling from interconnects (wires) already means that we can't have all that much global state. This might increase the amount of state we can have, but unless you can fit gigabytes into a single chip you won't be eliminating the multi level memory hierarchy.

Like right now the L1 cache will have latencies of 1 or 2 cycles, and the L2 cache 15; this is due to the overheads of cache coherency protocols, moving the data around the chip; it's not that the memory's slower, it's all SRAM.

They are probably referring to enterprise workloads. Here you have large working sets (so caches are less useful) and you want maximum throughput. Clever multithreading (finegrained) can reduce effective latency by scheduling many (32?) processes at the same time, executing an instruction from each in round-robin fashion (see Sun Niagara). In that case, you can sometimes dump the L1 cache, and you would be able to get rid of the memory hierarchy.

There's also probably a benefit wrt hard drives/secondary storage; you can obviously make system storage very fast, which might improve random access times considerably. BUT this is probably not going to be transformative; it'll improve certain types of accesses, but current algorithms are already very highly tuned to spatial and temporal locality of reference. Furthermore, you'll still see these structures win out, because they can take advantage of hardware prefetching more easily.


The property of memristors having real values instead of 0 or 1, and the fact that their value can be path dependent, leads me to think that at least information density can be increased over conventional memory today.


Cute but "Big Data" is really just data that's not in the building and isn't feasible to just move around from one machine to another in your department.


Even if your data doesn't fit in RAM... and even if it does... when you're developing, you should be using a sample of your data that fits into RAM.


This is good marketing, but you know what would be even better marketing? Give me access to that server for a week. Let me setup a demo of my biggest customer, and then run my tasks. We've started (and are in progress) of investing thousands of dollars in moving to Azure. A server this large is not something I can buy, and experiment on easily. Hard numbers would convince my superiors that its a better solution, but they're not going to give me $10k to do the experiment.


That's how accidents are made. If you can't spend a small fraction of the budget for the solution to experimentally verify that it is in fact the optimum solution you may very well be leading the company down a road that will cost them significantly more. It's not up to the writer of the article to provide you with the tools to run your least-cost-analysis, that's up to you and your bosses! (After all, you're the beneficiaries.)


10k? Those sticks of RAM alone will cost you something like 75k USD. Then you'll need the processors, arguably 4 of the top of the line 18-core XEONs at 5000 USD each. Then you'll need to put it all together with software and a (properly cooled) rack, not to mention the terminal(s) to access it, plus the personnel to put this baby together for you. This box could easily cost you 150 grand.


Its not cost effective to use non E5-class Xeons, or go above 32GB DIMMs right now.... So you want a Dual-Processor setup, 16 DIMM slots, so 16x 32GB = 512GB w/ Dual Proc -- which you can do for about $10,000.


That's a very nice piece of kits for 10k I have to say. Thanks for the "sweetspot" price/perf advice. Seems like excellent value. I've had my heart on a badass mac pro but these specs put it to shame.


If I select 1KB, why does the link point to an HP server with up to 6TB of RAM? Linking to an 80's PC seems more appropriate :)


Wow, I wish I had the spare change for one of these beasts. I think I have enough NP-hard problems to fill any RAM to the brim :)



That's like saying "you can fly first class".

If you don't have money, you can't. Very few people can afford it.


It's more akin to saying "if you're looking at buying several economy tickets to go from A to B, a first class ticket on a direct flight might be cheaper and faster than stitching together several economy tickets"


Although we can theoretically handle up to 2^64 bytes of RAM (16 exabytes), the practical limit is much lower. I think someone on Wikipedia said it's somewhere around 8TB, but I imagine the performance of random access into 8TB RAM is much worse than a motherboard designed for up to 32GB RAM.

It's not as easy as just buying more RAM. You'll have to pay more attention to how you make use of the various caches in between your CPU and RAM.


I imagine that on a motherboard with 96x RAM slots, the access time between the first one in row and the last one will be actually quite different, due to the physical distance between them.


Sorry for the simple question but if you store it in ram what is the strategy for when the server is turned off?


The idea is more that when you process data, if you can fit it all in memory (and you don't need lots of CPU power, etc, etc, etc) then just use one machine and don't worry about "clusterising" it.

If you're expecting growth in the size of your dataset (beyond growth in RAM size availability), then, well, maybe don't just use a single machine. Same goes for a whole bunch of similar "it's too large for a single machine" considerations.

Storing data should probably still be persisted to disk, and backed up.


You turn it back on, and load it back from the hard drive.


There are multiple strategies that are usually handled by the database that you use. For some databases a hard power off will lose the uncommitted data, for more durable ones it waits until the write is confirmed.

Generally though, these posts are geared towards machine learning people that don't really have "live" data as frequently.


This all depends on what the data is used for. You may need to persist the data to disk on write even if all your data is in RAM.


Damn. my data is 6597069766657 bytes. Apparently if it was 6597069766656 bytes it would have fitted in RAM.


Well, hate to break it to you, but you probably have some overhead associated with your data, like your operating system or structures related to processing your data.


Our data fits in ram but it proved to have no speed benefit. So the ram just sits there, being empty.


Redis as a primary data store!


600 PiB "No, it probably doesn't fit in RAM (but it might)."

Well, well, well.


What is the point of this site other budget shaming?


Great googly moogly! terabytes of ram!


But does it fit in the L1 cache?


If you are programming in R, you sure better hope it does!


After reading the title I was sure there was something about R in the comments.

You can program R in Spark you can now program in R http://blog.revolutionanalytics.com/2015/01/a-first-look-at-...

Now you can work directly with SQL Server as announced this week by MS. http://www.computerworld.com/article/2923214/big-data/sql-se...

I have had a ton of arguments about R's "biggest weakness" being that it uses RAM. I haven't once in the almost 3 years of working in R that I ran into this road block, but I am sure others have. Which there are several good distributed choices that will keep getting better and better.

Using RAM instead of Distributed is better in R as well as really any other language in terms of complexity and flexibility.


For my workloads, R has always choked on its single thread long before it choked on memory. And the parallelism options are terrible hacks.


Brilliant!


6.000000000000000444 TiB


or you learn about data structures and algorithms and try to need less :-); Randomized algorithms for example are intriguing.


We build object stores... so, no it most definitely does not.


I sincerely hope nobody is using a tool like this to decide which enterprise servers to buy...


Me too. The links are mostly to back up my claim rather than as a suggestion of servers to buy (or I'd have found some affiliate links!)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: