It's probably worth extending "Your data fits in RAM" to "Your data doesn't fit ...

cornellphds · on May 21, 2015

This is classic case of "Algorithm/Problem Selection" if your algorithm/problem is tailored to a task such as PageRank, surely a single threaded highly optimized code will beat a cluster designed for ETL tasks. In real organizations where there are multiple workflows/algorithms, distributed systems always win out. Systems like Hadoop take care of Administration, Redundancy, Monitoring and Scheduling in a manner that a single machine cannot. Sure you can "Grep" faster on a laptop than AWS EMR with 4 Medium instances, but in reality where you have 12 types of jobs which are run by team of 6 people, you are much better off with a distributed system.

acqq · on May 21, 2015

Ditto for computationally intensive work: if it is CPU dominated, more CPU's calculating in parallel will be of advantage, even if the data could fit some RAM.

There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.

vidarh · on May 21, 2015

Part of the problem is that a lot of problems that are CPU dominated on a single system becomes IO dominated once you start distributing it at very low node counts without very careful attention to detail.

acqq · on May 21, 2015

Yes, and starting from the distributed version first, one can even miss the fact that everything can be immensely faster once the communication is out of the way.

There's no a single simple answer, except: don't decide about the implementation first, instead competently and without prejudice evaluate your actual possibilities.

Also don't decide the language like "Python" or "Ruby" first. If you expect that the calculations are going to take a month, then code them in Ruby and wait the month for the result, you can miss the fact that you could have had the results in one day by just using another language, most probably without sacrifying the readability of the code much. Only if the task is really CPU-bound, of course.

On another side, if you have a ready solution in Python, and you'd need a month to develop the solution for other language, the first thing you have to consider is how often you plan to repeat the calculations afterwards. Etc.

acdha · on May 21, 2015

I'd also toss in ease of correctness if you don't have a really well-known problem with a good test suite: I've seen a number of cases where people developed “tight” C code which gave results which appeared reasonable and then ran it for months before noticing logic errors which would either have been impossible or easier to find in a higher-level language or, particularly, one which had better numeric error handling characteristics.

There is a lot to be said for confirming that you have the right logic and then looking into something like PyPy/Cython/etc. or porting to a lower-level language.

acqq · on May 21, 2015

> one which had better numeric error handling characteristics

Which language do you suggest which matches this?

I don't know many people who use the language "numeric error handling characteristic" but I know the benefit of already having good written high-level primitives (or libraries).

acdha · on May 21, 2015

The first example which comes to mind are integer overflows, which are notoriously easy to miss in testing with small / insufficiently-representative data. Languages like Python which use a arbitrary-precision system will avoid incorrect results at the expense of performance and I believe a few other languages will raise an exception to force the developer to decide the correct way to handle them.

The other area I've seen this a lot is caused by the way floating point math works not matching non-specialists’ understanding – unexpectedly-failing equality checks, precision loss making the results dependent on the order of operations, etc. That's a harder problem: switching to a Decimal type can avoid the problem at the expense of performance but otherwise it mostly comes down to diligent testing and, hopefully, some sort of linting tool for the language you're using.

exelius · on May 21, 2015

The entire reason for the popularity of distributed systems is because application developers in general are very bad at managing I/O load. Most developers only think of CPU/memory constraints, but usually not about disk I/O. There's nothing wrong with that; because if your services are stateless then the only I/O you should have is logging.

In a stateless microservice architecture, disk I/O is only an issue on your database servers. Which is why database servers are often still run on bare metal as it gives you better control over your disk I/O - which you can usually saturate anyway on a database server.

In most advanced organizations, those database servers are often managed by a specialized team. Application servers are CPU/memory bound and can be located pretty much anywhere and managed with a DevOps model. DBAs have to worry about many more things, and there is a deeper reliance on hardware as well. And it doesn't matter which database you use; NoSQL is equally as finicky as a few of my developers recently learned when they tried to deploy a large Couchbase cluster on SAN-backed virtual machines.

vidarh · on May 21, 2015

For web services, distributed usually makes sense, because the resource usage of a single connection is rarely high. It's largely uninteresting to consider huge machines for that use case for the reasons you outline.

But that's not really what the discussion is about. In the web services case, your dataset often/generally fits in memory because your data set is tiny. You don't need large servers for that, most of the time. Even most databases people have to work with are relatively small or easily sharded.

In the context of this discussion, consider that what matters is the size of the dataset for an individual "job". If you are processing many small jobs, then the memory size to consider is the memory size of an individual job, not the total memory required for all jobs you'd like to run in parallel. In that case many small servers is often cost effective.

If you are processing large jobs, on the other hand, you should seriously consider if there are data dependencies between different parts of your problem, in which case you very easily become I/O bound.

speeder · on May 21, 2015

I own a ASUS gaming laptop...

It has two flaws: One, it has nVidia Optimus (that just suck, whoever implemented it should be shot).

Two, the I/O is not that good, even with a 7200 RPM disk, and Windows 8.1 make it much worse (windows for some reason keep running his anti-virus, superfetch, and other disk intensive stuff ALL THE TIME).

This is noticeable when playing emulated games: games that use emulator, even cartridge ones, need to read from the disc/cartridge on the real hardware, on the computer they need to read from disc quite frequently, the frequent slowndowns, specially during cutscenes or map changes is very noticeable.

The funny thing is: I had old computers, with poorer CPU (this one has a i7) that could run those games much better. It seems even hardware manufacturers forget I/O

acqq · on May 21, 2015

I can bet in your case it's not the disk I/O actually, unless you have some very strange emulator the file access should still go through the OS disk cache. VMware for example surely benefits from it. How many GB do you have on the machine? How much is still left free when you run the emulator and the other software you need?

speeder · on May 21, 2015

I noticed it as disk i/o because I would leave task manager running on second screen and every time the game lagged memory and CPU use were below 30% and disk was 100%, and if I left sorting by disk usage, the first places are the windows stuff, and after them, the emulator.

jacquesm · on May 21, 2015

See also:

http://en.wikipedia.org/wiki/Fallacies_of_distributed_comput...

lukegb · on May 21, 2015

> There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

This - if you're using this website as your sole calculation then something has gone seriously wrong.

It was more intended to provoke discussion and thinking around overengineering things which can easily be done with, say, awk or a few lines of <scripting language>.

If you have large CPU requirements then sure, use Hadoop/Spark/some other distributed architecture. If you have a >6TB dataset but you only need 1GB of it at a time, then, well, maybe you still only need one computer.

3pt14159 · on May 21, 2015

The problem is getting 100% of the work to be parallel. If I have to do that last 10% in one machine there isn't much point to more than 10 machines.

CrystalGamma · on May 21, 2015

You are talking about https://en.wikipedia.org/wiki/Amdahl's_law , am I right?

jbergens · on May 21, 2015

In most cases it's probably better to have one machine each for different jobs, then all cpu on every machine can be used for the current job. There are probably very few jobs that take extreme amount of cpu without also using a lot of data which would have to be moved around a cluster anyway.

digi_owl · on May 21, 2015

Cloud seems to have become a Wall Street flag on par with outsourcing. It is something a company does to grab market attention...

Retric · on May 21, 2015

Some things take 1 second on 1 machine beefy machine or 6 hours on a cluster due to latency issues. And yes I do mean 20,000 times slower, though 2,000 is far more common due to latency inside a datacenter being around ~500,000 ns vs ~100 ns for main memory vs 0.5 nm from L1 cache.

PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.

vardump · on May 21, 2015

A step between RAM and SSD could be "Your data fits in RAM in compressed form". LZ4 compression is takes 3-4x longer than memcpy. LZ4 decompression is only 50% [1] slower. 2-3 GB/s per core.

[1]: Your mileage may vary.

bpicolo · on May 21, 2015

If it's slower both in and out what's the benefit?

To guy below me: Ah, thanks. I thought the guy above was trying to say it's slower than paging to disk. : )

peapicker · on May 21, 2015

Basically, parsing the percentages, using compressed data in RAM is ~1 order of magnitude slower to get into memory, and less than 1 order of magnitude slower to read from RAM. That is still around 2 (or more depending on setup) orders of magnitude faster than the hit to go to SSD / SSD RAID.

Useful knowledge on latency: https://gist.github.com/jboner/2841832

jhgg · on May 21, 2015

Because it's so much faster than having to go to disk.

threeseed · on May 21, 2015

Scalable graph systems have been around for many years and have never taken off.

Most businesses doing big data (like ours) often have multiple disparate data sources that at the start of the pipelines are ETLing into some EDW. Trying to consolidate them into a single integrated view is very difficult and time/resource intensive. Having billions of disconnected nodes in the graph would be very hard to reason with.

stephengillie · on May 21, 2015

I feel compelled to point out that the average SSD has an order of magnitude (or maybe 2 orders) more IOPS than a 6-disk 15k RAID6 or RAID10 array.

And that's a single, standalone, non-RAIDed SSD. When you get a 6-SSD RAID10, magic starts to happen. And if you RAID enough SSDs (10-20?), you can theoretically start to get more bandwidth than you do with RAM.

xjia · on May 21, 2015

sounds good to me, but why people not doing that? ssd price too high?

vidarh · on May 21, 2015

People are doing that. If you see those PCIe "SSDs", most of them are effectively multiple SSD's + controller on a card, combined with either software or hardware RAID-0... E.g. some of the OCZ cards at least show up as 4 individual units under Linux, while the Windows drivers at least by default will show it as one device.

But yes, it is expensive if you measure per GB. If your dataset is small, though, and you are more interested in cost per IO operation, they can be very cheap.

toomuchtodo · on May 21, 2015

I've built boxes for computationally intensive workloads that use these PCIe SSD cards. The IO is STUPID fast, considering you're bypassing the SATA chipset and are limited only by PCIe speeds.

You don't get a large amount of space (I believe the cards I installed were only 64 or 128GB, but this was ~4 years ago), but for small datasets that you need extremely fast access to, they get the job done.

vidarh · on May 21, 2015

We have a few 480GB ones, and yes, they are stupid fast - we're getting 1GB/sec writes easily. You can get multiple TB now if you can afford it (can't justify it for our uses...)

fweespeech · on May 21, 2015

We run 8 SSD RAID 10 for all our database nodes. However, you still need the same metric ton of RAM you'd need before. However, if you have an I/O heavy load [which almost all databases do] you can cut down on total number of machines compared to SATA.

s_baby · on May 21, 2015

RAID SSD has the same latency as a single SSD. This setup will help where the bottleneck is I/O throughput.

stephengillie · on May 21, 2015

SSD prices are about as high as HDDs have traditionally been. The difference is now the bottom is falling out of the HDD market and you can get 3-5 2TB HDDs for as much as a 256GB SSD. So lots of devices and appliances use a single SSD for read- and write-caching, backed by a bank of high-capacity HDDs.

fanf2 · on May 22, 2015

The Cyrus IMAP server has a "split metadata" mode which supports this configuration. Very nice if you are running a mail server of any significant size.

stephengillie · on May 22, 2015

* VMWare uses this as the root of their vSAN technology.

* Nimble and other storage appliance manufacturers use a bank of SSDs to cache frequently-read data, and have layers of fast- and slow-HDDs to store data relative to its access frequency.

* Hybrid HDDs have been around for years.