It's probably worth extending "Your data fits in RAM" to "Your data doesn't fit in RAM, but it does fit on an SSD". So many problems will still work with quite reasonable performance when using an SSD instead. By using a single machine with an array of SSDs, you also avoid the complexity and overhead of distributed systems.
My favourite realization of this: Frank McSherry shows how simplicity and a few optimisations can win out on graph analysis in his COST work. In his first post[1], he shows how to beat a cluster of machines with a laptop. In his second post[2], he applies even more optimizations, both space and speed, to process the largest publicly available graph dataset - terabyte sized with over a hundred billion edges - all on his laptop's SSD.
This is classic case of "Algorithm/Problem Selection" if your algorithm/problem is tailored to a task such as PageRank, surely a single threaded highly optimized code will beat a cluster designed for ETL tasks. In real organizations where there are multiple workflows/algorithms, distributed systems always win out. Systems like Hadoop take care of Administration, Redundancy, Monitoring and Scheduling in a manner that a single machine cannot. Sure you can "Grep" faster on a laptop than AWS EMR with 4 Medium instances, but in reality where you have 12 types of jobs which are run by team of 6 people, you are much better off with a distributed system.
Ditto for computationally intensive work: if it is CPU dominated, more CPU's calculating in parallel will be of advantage, even if the data could fit some RAM.
There's no a single simple answer, but sure, whenever less computers are enough, less should be used.
The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.
Part of the problem is that a lot of problems that are CPU dominated on a single system becomes IO dominated once you start distributing it at very low node counts without very careful attention to detail.
Yes, and starting from the distributed version first, one can even miss the fact that everything can be immensely faster once the communication is out of the way.
There's no a single simple answer, except: don't decide about the implementation first, instead competently and without prejudice evaluate your actual possibilities.
Also don't decide the language like "Python" or "Ruby" first. If you expect that the calculations are going to take a month, then code them in Ruby and wait the month for the result, you can miss the fact that you could have had the results in one day by just using another language, most probably without sacrifying the readability of the code much. Only if the task is really CPU-bound, of course.
On another side, if you have a ready solution in Python, and you'd need a month to develop the solution for other language, the first thing you have to consider is how often you plan to repeat the calculations afterwards. Etc.
I'd also toss in ease of correctness if you don't have a really well-known problem with a good test suite: I've seen a number of cases where people developed “tight” C code which gave results which appeared reasonable and then ran it for months before noticing logic errors which would either have been impossible or easier to find in a higher-level language or, particularly, one which had better numeric error handling characteristics.
There is a lot to be said for confirming that you have the right logic and then looking into something like PyPy/Cython/etc. or porting to a lower-level language.
> one which had better numeric error handling characteristics
Which language do you suggest which matches this?
I don't know many people who use the language "numeric error handling characteristic" but I know the benefit of already having good written high-level primitives (or libraries).
The first example which comes to mind are integer overflows, which are notoriously easy to miss in testing with small / insufficiently-representative data. Languages like Python which use a arbitrary-precision system will avoid incorrect results at the expense of performance and I believe a few other languages will raise an exception to force the developer to decide the correct way to handle them.
The other area I've seen this a lot is caused by the way floating point math works not matching non-specialists’ understanding – unexpectedly-failing equality checks, precision loss making the results dependent on the order of operations, etc. That's a harder problem: switching to a Decimal type can avoid the problem at the expense of performance but otherwise it mostly comes down to diligent testing and, hopefully, some sort of linting tool for the language you're using.
The entire reason for the popularity of distributed systems is because application developers in general are very bad at managing I/O load. Most developers only think of CPU/memory constraints, but usually not about disk I/O. There's nothing wrong with that; because if your services are stateless then the only I/O you should have is logging.
In a stateless microservice architecture, disk I/O is only an issue on your database servers. Which is why database servers are often still run on bare metal as it gives you better control over your disk I/O - which you can usually saturate anyway on a database server.
In most advanced organizations, those database servers are often managed by a specialized team. Application servers are CPU/memory bound and can be located pretty much anywhere and managed with a DevOps model. DBAs have to worry about many more things, and there is a deeper reliance on hardware as well. And it doesn't matter which database you use; NoSQL is equally as finicky as a few of my developers recently learned when they tried to deploy a large Couchbase cluster on SAN-backed virtual machines.
For web services, distributed usually makes sense, because the resource usage of a single connection is rarely high. It's largely uninteresting to consider huge machines for that use case for the reasons you outline.
But that's not really what the discussion is about. In the web services case, your dataset often/generally fits in memory because your data set is tiny. You don't need large servers for that, most of the time. Even most databases people have to work with are relatively small or easily sharded.
In the context of this discussion, consider that what matters is the size of the dataset for an individual "job". If you are processing many small jobs, then the memory size to consider is the memory size of an individual job, not the total memory required for all jobs you'd like to run in parallel. In that case many small servers is often cost effective.
If you are processing large jobs, on the other hand, you should seriously consider if there are data dependencies between different parts of your problem, in which case you very easily become I/O bound.
It has two flaws: One, it has nVidia Optimus (that just suck, whoever implemented it should be shot).
Two, the I/O is not that good, even with a 7200 RPM disk, and Windows 8.1 make it much worse (windows for some reason keep running his anti-virus, superfetch, and other disk intensive stuff ALL THE TIME).
This is noticeable when playing emulated games: games that use emulator, even cartridge ones, need to read from the disc/cartridge on the real hardware, on the computer they need to read from disc quite frequently, the frequent slowndowns, specially during cutscenes or map changes is very noticeable.
The funny thing is: I had old computers, with poorer CPU (this one has a i7) that could run those games much better. It seems even hardware manufacturers forget I/O
I can bet in your case it's not the disk I/O actually, unless you have some very strange emulator the file access should still go through the OS disk cache. VMware for example surely benefits from it. How many GB do you have on the machine? How much is still left free when you run the emulator and the other software you need?
I noticed it as disk i/o because I would leave task manager running on second screen and every time the game lagged memory and CPU use were below 30% and disk was 100%, and if I left sorting by disk usage, the first places are the windows stuff, and after them, the emulator.
> There's no a single simple answer, but sure, whenever less computers are enough, less should be used.
This - if you're using this website as your sole calculation then something has gone seriously wrong.
It was more intended to provoke discussion and thinking around overengineering things which can easily be done with, say, awk or a few lines of <scripting language>.
If you have large CPU requirements then sure, use Hadoop/Spark/some other distributed architecture. If you have a >6TB dataset but you only need 1GB of it at a time, then, well, maybe you still only need one computer.
In most cases it's probably better to have one machine each for different jobs, then all cpu on every machine can be used for the current job. There are probably very few jobs that take extreme amount of cpu without also using a lot of data which would have to be moved around a cluster anyway.
Some things take 1 second on 1 machine beefy machine or 6 hours on a cluster due to latency issues. And yes I do mean 20,000 times slower, though 2,000 is far more common due to latency inside a datacenter being around ~500,000 ns vs ~100 ns for main memory vs 0.5 nm from L1 cache.
PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.
A step between RAM and SSD could be "Your data fits in RAM in compressed form". LZ4 compression is takes 3-4x longer than memcpy. LZ4 decompression is only 50% [1] slower. 2-3 GB/s per core.
Basically, parsing the percentages, using compressed data in RAM is ~1 order of magnitude slower to get into memory, and less than 1 order of magnitude slower to read from RAM. That is still around 2 (or more depending on setup) orders of magnitude faster than the hit to go to SSD / SSD RAID.
Scalable graph systems have been around for many years and have never taken off.
Most businesses doing big data (like ours) often have multiple disparate data sources that at the start of the pipelines are ETLing into some EDW. Trying to consolidate them into a single integrated view is very difficult and time/resource intensive. Having billions of disconnected nodes in the graph would be very hard to reason with.
I feel compelled to point out that the average SSD has an order of magnitude (or maybe 2 orders) more IOPS than a 6-disk 15k RAID6 or RAID10 array.
And that's a single, standalone, non-RAIDed SSD. When you get a 6-SSD RAID10, magic starts to happen. And if you RAID enough SSDs (10-20?), you can theoretically start to get more bandwidth than you do with RAM.
People are doing that. If you see those PCIe "SSDs", most of them are effectively multiple SSD's + controller on a card, combined with either software or hardware RAID-0... E.g. some of the OCZ cards at least show up as 4 individual units under Linux, while the Windows drivers at least by default will show it as one device.
But yes, it is expensive if you measure per GB. If your dataset is small, though, and you are more interested in cost per IO operation, they can be very cheap.
I've built boxes for computationally intensive workloads that use these PCIe SSD cards. The IO is STUPID fast, considering you're bypassing the SATA chipset and are limited only by PCIe speeds.
You don't get a large amount of space (I believe the cards I installed were only 64 or 128GB, but this was ~4 years ago), but for small datasets that you need extremely fast access to, they get the job done.
We have a few 480GB ones, and yes, they are stupid fast - we're getting 1GB/sec writes easily. You can get multiple TB now if you can afford it (can't justify it for our uses...)
We run 8 SSD RAID 10 for all our database nodes. However, you still need the same metric ton of RAM you'd need before. However, if you have an I/O heavy load [which almost all databases do] you can cut down on total number of machines compared to SATA.
SSD prices are about as high as HDDs have traditionally been. The difference is now the bottom is falling out of the HDD market and you can get 3-5 2TB HDDs for as much as a 256GB SSD. So lots of devices and appliances use a single SSD for read- and write-caching, backed by a bank of high-capacity HDDs.
The Cyrus IMAP server has a "split metadata" mode which supports this configuration. Very nice if you are running a mail server of any significant size.
* VMWare uses this as the root of their vSAN technology.
* Nimble and other storage appliance manufacturers use a bank of SSDs to cache frequently-read data, and have layers of fast- and slow-HDDs to store data relative to its access frequency.
My favourite realization of this: Frank McSherry shows how simplicity and a few optimisations can win out on graph analysis in his COST work. In his first post[1], he shows how to beat a cluster of machines with a laptop. In his second post[2], he applies even more optimizations, both space and speed, to process the largest publicly available graph dataset - terabyte sized with over a hundred billion edges - all on his laptop's SSD.
[1]: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
[2]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...