Hacker News new | past | comments | ask | show | jobs | submit login
Scalability, but at what cost? (2015) (frankmcsherry.org)
165 points by r-u-serious on June 7, 2016 | hide | past | favorite | 77 comments

I've been doing data analysis and machine learning client work for quite some time now and for companies as small as a 3 person startup to advising a department of the Canadian government.

Almost always numpy matrix math + cython or C or Java on a single machine is enough. Not always-always; but if you can relax requirements slighly say by accepting a 45 minute lag from new data impacting the total model, or by caching the results of the top 10k most likely queries, or by putting more effort into stripping out the garbage parts of the data, or, sometimes, just throwing a $10k a month server or mathematician at the problem (sure is cheaper than a bunch of cheap servers + larger infrastructure team).

The times you need real scalability you know you need it. You'd laugh at how silly someone would be for trying to put it onto one machine. You're solving the travelling salesman problem for UPS (although I can think of some hacks here - I probably can't get it down to a single machine), or you're detecting logos in every Youtube video ever made, or you're working for the NSA.

Even if you know for sure you're going to need scalability. I don't think it hurts to just do it on a single box at first. Iterating quickly on the product is more important and once you have something proven you can get money from the market or from VCs to distribute it.

This is kind of the same argument as microservices.

We could write 30 microservices deployed on 30 docker images with load balancing and FT and all that magic for a basic webapp...

Or we could just write a pretty fast webserver and do it with 1 server. (Or if it is stateless, do it with a few for still a lot less work than a giant microservice cluster).

I think in the last year or so microservices have become a little less cool, and people are more along the lines of "code cleanly so we can microservice if we need to down the road, but don't deploy it like that for 1.0"... seems similar for this.

People forget that the two methods of scaling are HORIZONTAL and VERTICAL. They think: "I can just put some micro-services behind HA proxy and boom, more capacity!".

And then they forget if they had just modified that one query and tweaked that one for-loop they could've had that same capacity without launching six new servers with all kinds of potential for the wiring to go down and cause downtime. Plus the dev time to build the services.

I usually think of scaling in 3 methods: horizontal (add more machines), vertical (get a bigger machine) and inward (improve your algorithms).

The largest performance gains I've seen are most often of the inward type -- like multi-thousands of percent, just by rethinking the code or the approach.

Bonus, improving inward scaling multiplies investments in horizontal and vertical scaling.

Vertical scaling requires hiring good engineers instead of mediocre ones (additional cost of $100,000s per year across the team). Horizontal scaling in comparison is much, much cheaper for your average CRUD app.

Maybe, maybe not.

For example choosing Java over Ruby would give you 2-10x better perf per server... and I am not sure that Java devs cost any more or less than Ruby devs.

Now we can get into an argument about developer productivity and all that.. but form a purely "i want to run 10x more users per server"... something like Java / Ruby gets you a long ways.

I talked to a CTO once that said he brought his RoR fleet down from 60 servers to 6-8 by switching to Scala.

When he said "switching to Scala" I reckon he probably meant "rewriting our platform."

It's very difficult to do a comparison like this in practice, because switching languages inherently involves a rewrite of a platform.

I am curious how much a rewrite staying in Ruby could have saved. Its always easier to write code when you have a good specification, and a working application would fulfill that role well.

I was once involved in a large-scale government project that rewrote a Java app to RoR. They went from 50 servers to 10.

It has probably got more to do with the rewrite and the new architecture than whatever language it was written in.

What? It absolutely has to do with the language.

Ruby binary trees: 57 seconds Scala binary trees: 11 seconds

[1] - http://benchmarksgame.alioth.debian.org/u64q/ruby.html [2] - http://benchmarksgame.alioth.debian.org/u64q/scala.html

That would mean something if Ruby apps are 100% Ruby and/or are performing binary tree operations all the time, or doing similar kinds of CPU-intensive operations as depicted in the alioth benchmarks. But they don't. Ruby web apps perform lots of string manipulation, memory allocation, I/O. A lot of expensive things are offloaded to C libraries. Things like XML parsing are offloaded to native libraries like libxml; nobody uses an XML parser fully implemented in Ruby. Ruby does not reimplemented gzip compression in Ruby, it uses zlib. So the alioth benchmarks are not representative of real-world performance.

>> Ruby web apps perform lots of string manipulation, memory allocation, I/O. <<

    string manipulation

    memory allocation

fasta, fasta-redux, reverse-complement write 250MB

regex-dna reads 50MB; k-nucleotide, reverse-complement read 250MB

>> …offloaded to native libraries… So the alioth benchmarks are not representative of real-world performance. <<

The benchmarks game does show C programs ;-)

The benchmarks game does show scripting-languages explicitly using native libraries:


> string manipulation

That benchmark performs string manipulations that rarely occur in web apps. Web apps need: concatenation, substring, find/replace, maybe with regexps. All of those are implemented in C.

> memory allocation

Web apps don't tend to implement entire trees in pure Ruby. That benchmark is completely non-representative of real-world performance.

What exactly are you getting at? Of course it's easy to find a bunch of synthetic benchmarks that show weaknesses in particular cases. Still doesn't prove anything.

> Web apps need: concatenation, substring, find/replace, maybe with regexps. All of those are implemented in C.

join and gsub?


> Web apps don't tend to implement entire trees in pure Ruby.

Nor do other apps but that is what Hans Boehm came up with as a simple GC benchmark.


> What exactly are you getting at?

You don't seem to know what is shown on the benchmarks game website.

Doesn't your regex-dna benchmark kind of prove my point? Just look at the comparisons here: http://benchmarksgame.alioth.debian.org/u64q/performance.php...

C GCC: 2.46 sec

Java: 8.23 sec

Ruby #8: 9.35 sec

It's only about 4x slower than pure C in this case, and only a little slower than Java which has a very good JIT.

> You don't seem to know what is shown on the benchmarks game website.

How funny of you to say that while acting as if the benchmarks "prove" Ruby is the ultimate spawn of the devil that eats away any and all performance. The website itself tells you not to jump to conclusions and that the app itself is the ultimate benchmark: http://benchmarksgame.alioth.debian.org/dont-jump-to-conclus...

>> Doesn't your regex-dna benchmark kind of prove my point? <<

7 days ago, you could have used the data shown on the benchmarks game website to try and make your point to joslin01.

Instead you chose to dismiss the data.

>> acting as if <<

You seem to have confused me with joslin01.

Or the fact that new servers are an order of magnitude faster than the ones they replaced.

Order of magnitude faster? I doubt that. Unless the old ones were really old / slow to begin with. But really, we are both speculating without knowing the facts.

> I am not sure that Java devs cost any more or less than Ruby devs

I don't know about that, how many bootcamps are pumping out Java programmers instead of Python/Ruby programmers?

Well in Chicago some of the colleges are pumping out Java Devs...

I guess if you are comparing a 4 year college Java dev to a 6 week bootcamp Ruby dev you have something, but that seems weird...

If you haven't seen stackoverflows server architecture posts you'd probably enjoy them: https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

I still think containers are great, and I really like the abstractions kubernetes provides, but if I ever had enough traffic to worry about scaling, I envision running a small cluster of very powerful computers rather than 100s of weak ones.

So basically, through careful engineering, they can run stackoverflow.com, one of the top-ten sites in the world, on a single web server and a single database server. Beefy machines for sure, but it kind of takes the air out of the web-scale hype balloon.

Ya I love this as a counterpoint to much scaling.

Stackoverflow has pretty good uptime and pretty good performance, can't fault the end result.

> one of the top-ten sites in the world

49th actually: http://www.alexa.com/siteinfo/stackoverflow.com

Physical servers they host themselves, mind you.

Huh? Why are they running Windows?

I don't know about the rest of the (initial) team, but Joel was lead on Excel as I recall. If you have a group if devs that know the nt kernel, SQL server and iis very well - why would you not use the ms stack?

Don't get me wrong, I prefer Linux and Free software myself - but that doesn't mean the thousands of Man years sybase and ms have spent on their stack is wasted effort.

Because they thought it was the better choice for them. You cannot argue with the result.


See also https://circleci.com/blog/its-the-future/, which is almost too true to be funny.

My solution was more about higher availability, and staggered deployments... but each of about 8 services (including web) would be deployed to 3 servers, identical config with dokku... then there would be a load balancing nginx instance that pointed to the app.foo on each of the three servers.. deployment would update/cycle nginx, deploy to first server, roll over, then deploy to the other two.

That was the plan. (I won't go into the political bs at a company I no longer work with).

> throwing a $10k a month server or mathematician at the problem

Do you have any examples of problems at which you have thrown a mathematician or could imagine doing so?

I like the analysis, basically it says "hey you don't have big data" :-) but that requires a bit more explanation.

The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory. We know that because Amdahl's law tells us that parallelizing something invariably adds overhead. So from a systems perspective that overhead has to be "paid for" by some other improvement, and we'll see that it is access to data.

If your task is to process a 2TB data set, on a single machine using a 6GBS SATA channel and 2TB of FLASH SSDs you can read in that dataset into memory in 3333 seconds (at 600MB/sec which is very optimistic for our system), process it, and lets say you write out a 200GB reduced data set for another 333 seconds. so, conveniently, an hour of I/O time.

Now you take that same dataset and distribute it evenly across a thousand nodes. Each one then has 2GB of the data on it. Each node can read in their portion of the data set in 3 seconds, process it and write out their reduction in .3 seconds.

You have "paid" for the overhead of parallelization by trading an I/O cost of an hour for an I/O cost of about 4 seconds.

That is when parallel data reduction architectures are better for a problem than a single threaded architecture. And that "betterness" is purely artificial in the sense that you would be better off with a single system that had 1,000 times the I/O bandwidth (cough mainframe cough) than 1,000 systems with the more limited bandwidth. However a 1,000 machines with one SSD it still cheaper buy than one mainframe of similar capability. So if, and its a big if, your algorithm can be expressed as a data map / reduce problem, and your data is large enough to push the cost of getting it into memory to have a look at significantly beyond cost of executing the program, then you can benefit positively by running it on a cluster rather than running it on a local machine.

For a 2 TB dataset you could also pay supermicro 50k to get a 40 core 3 TB RAM monster that can keep that whole dataset in RAM. At 50 GB / sec throughput that would keep your query roundtrip time at somewhere around the minute mark. Not quite 3 seconds, but then not quite a thousand nodes either. Of course, rebooting that machine would be awkward.

Still, I think the general rule applies that if you can buy a server that will fit your dataset into RAM, probably you don't need something like Hadoop.

I think this is also a vary salient observation. It is always possible to construct really large systems, right up to mainframes. And if you're asked to do so it helps to take as wide a view on the problem as you can.

So from a technical point of view, 50GB / sec on a single machine vs 600GB / sec on a 1000 node cluster. From a cost perspective, running 1 machine is going to be a lot less than running 1,000 machines.

Consider some other aspects as well. If a machine breaks, and you have only one, you are offline, if one breaks and you have 999 left, you're still 99.9% up and running. If you work in 2TB data sets, how many do you have? One? Two? Twenty? The more you have the more storage you end up putting on a machine, and even with a SAN the ability to move terabytes around is a pain. Then there is the enterprise value of the analysis. How much does the analysis add value to the product you sell? In the paper's example of Page Rank one could argue it really made Google's engine better so a lot of value. In an oil and gas context it might be the difference between finding oil or not, so again high value. But in a twitter 'bot' analysis, killing off all the identified bot accounts might have very little relative value to the overall business.

The bottom line is that none of these sorts of choices can be made in isolation. Looking at the choices through a single lens, whether it is performance, cost, or capability, is rarely sufficient to make the best choice. What is more the best choice may seem like a "bad" choice from an engineering perspective but great from a finance perspective. Similarly a good choice from a finance perspective could be a horrible choice from an engineering perspective.

What is important is to keep in mind the strengths of the various choices available to you, and their weaknesses. Then to select from them based on the current and future requirements for the resulting system.

Idle question: is 50 GB/s a reasonable throughput to expect for such a monster? DDR4 has a peak transfer rate of ~12.8-19.2 GB/s per stick (per https://en.wikipedia.org/wiki/DDR4_SDRAM), so I'd expect quite a bit more bandwidth for predictable accesses - are you using some useful rule-of-thumb, or just unduly pessimistic? ;-)

At NetApp when we were doing scaling analysis in the early 2000's it became clear the memory bandwidth was limited more by the transaction rate of the memory controller than it was the actual available bandwidth of the memory subsystem.

That is because a memory transaction involves "opening" a page, and then "doing the operation", which can be one to several hundred locations long. "Pointer chasing", code that reads in a structure, then deferences a pointer to another structure, then derefences that pointer to still another stucture, Etc. was really hard on the memory subsystem. It burned a lot of memory ops reading relatively small chunks of memory.

Its a great topic in systems architecture and there are a number of papers on it.

Could you suggest some good papers/articles on the topic?

I'd point out that modern SSDs are getting within the an order of magnitude of RAM in terms of speed. If your dataset is larger than what can fit in RAM, you can probably come up with some kind of on-disk storage mechanism, at the cost of speed. Buy a few and RAID them and you can probably get even closer.

Still not 1000 machines, but one cost in this scenario is that you also won't have to pay for the humans to run the network the 1000 machines live on, or the power for 999 machines, or the cooling, or the floorspace, etc.

"RAM is for the stack and SSDs are for heap" is a phrase I'm starting to hear.

How about when that operation is at the core of your business logic and happens multiple time a day? How about when it's a blocking operation?

There are plenty of cases where your rule doesn't hold true, and I don't think that they're that uncommon.

Plus having many machines has the advantage of a higher granularity of control over workload and balancing.

Hey, I'm building a hadoop system myself, so I know there are workloads where the rule doesn't apply. Like all rules there are exceptions.

Amdahl's law is not about overhead - it is true that parallelizing a computation usually has some cost, and you have to make sure what you're parallelizing is a heavy-enough computation that you still win in the presence of that cost. But that's not what Amdahl's law is about.

It is about the limitation of overall performance improvement when improving the performance of a component. If, say, you improve the performance of a piece of your application that only accounts for 10% of the running time, then you're limited by a 10% total performance improvement. This matters in a parallel context because if you have heavily parallelized your application, and it has actually improved performance, you're likely to eventually be limited by the performance of the sequential parts of your application. See https://en.wikipedia.org/wiki/Amdahl%27s_law ; although the last line of the introduction reads like nonsense.

[old HPC guy here, you kids get off my lawn!]

... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment.

This is a non-trivial problem. Ok, it is trivial, but its serial in most cases. Unless you start out with a completely distributed data set. And allocate permanent space on those 1000 nodes. So the data has to move once. And you can amortize that across all the runs.

In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale.

Our approach is not radical, its simple. Build a better architecture system, with much higher bandwidth/lower latency interconnects, so data motion can happen at 10-20 GB/s per machine. Then you can walk through your data in 50-100 seconds per machine (our customers do). And if you need to scale up/out, use 100Gb nets, and other things.

On Amdahl's law: In its simplest form, the law states that your performance is bound by the serial portion of the computation. If you can drive the parallel portion to zero time, you are still stuck with the serial portion literally bounding your performance. So lets take your example.

1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. And at 1GbE, you can move 2GB in about 20 seconds (hurray!). But you've got 1000 nodes, so its 20x 1000, or 1/4 day. Remember, the data starts out in one bolus, unless you allocate those machines and their storage permanently. That type of allocation would be cost prohibitive.

Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds.

Still not good.

For less than the cost of doing this with capable/fast machines in the cloud where you have to keep moving your data back and forth, you could get a simple bloody fast machine, that can handle the data read in 50-100 seconds.

Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now.

Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics.

Again, I am biased, as this is what my company does.

> 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution.

I find your statements confusing.

The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is.

My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"

Heh (on the jobs paused bits) ... most modern shared file systems are HA (or nearly HA) except when people build them cheaply. And then you get an effective RAID0.

I was pointing out that if you are doing the analytics at AWS or similar on-demand scenario (a common pattern I see people trying/using and eventually rejecting), you have a serial data motion step to distribute data to your data lake before processing. Then you extract your results, decommission all those servers. Rinse, and repeat.

The point is, that for ephermal compute/storage scenarios, you have a set of poorly architected resources tied together in a way that pretty much guarantees you have a large (dominant) serial step before anything goes in parallel.

What we advocate (and have been doing so for more than a decade), are far more capable building blocks of storage+compute. So if you are going to build a system to process a large amount of data, instead of buying 1000 nodes and managing them, buy 20-50 far more capable nodes at a small fraction of the price, and get the same/better performance.

It also doesn't take 0 seconds. There is a distribution/management overhead (very much non-zero), as well as data motion overhead for results (depending upon the nature of the query). When you subdivide the problems finely enough, the query management overhead actually dominates the computation (which was another aspect of the point I made, but it wasn't explicit on my part).

So we are fine with building hadoop/spark/kdb+ systems ... we just want them built out of units that can move 10-20GB/s between storage and processing. Which lets you hit 50-100s per TB of processed data. Which gives you a fighting chance at dealing with PB scale problems in reasonable time frames, which a number of our users have.

Sounds like you are doing the opposite of what joyent are doing: they basically pair (relevant parts of) data with the program (to process that part). And the reduce/aggregate over the result (that's my takeaway from joyent's marketing, anyway).

Which company do you work for? (unless it's a secret for some reason)

Actually Joyent's manta is quite similar in concept to what we've been doing for years (before manta came out). The idea is to build very capable systems and aggregate them. Not a bunch of fairly low end units (like typical AWS/etc). Our argument is that if you are going to build a high performance computing infrastructure, you ought to build it in an architecturally useful manner. The cost to do so is marginally more than "cheap n deep", while the savings (fewer systems needed for very large analytics) is substantial.

My company is Scalable Informatics (http://scalableinformatics.com)

Thank you for clarifying. Re-reading your first comment, in light of your second, I see that that's indeed what you were saying in the first place. But apparently that's not quite what I read :-)

> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

Also, aggregate network bandwidth. A major use case for clustered processing is that it is MUCH faster to download external data in parallel across a cluster than in parallel on one box. If timeliness is a major use case of the system you're building, this basically requires a cluster, unless you want to end up re-implementing cluster functionality yourself.

> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

No it isn't. There are plenty of CPU-bound tasks that run much faster if the work is distributed in parallel across multiple machines. We use Hadoop at my company primarily for that reason.

Can you say something more about your workload? Ever since dipping my feet in the MPI pool, I've wondered what kind of problems really lend themselves to running in parallell across multiple machines these days - assuming a single machine has at least 32 hardware threads.

I know there are modelling work loads, like weather forecasting and analysing seismic data - but curious what kind of work you are doing?

This is true but keep in mind that you have to set up a pipeline for collecting the data into HDFS (Storm? batch loading?) & you have to pay for the machines.

So while your analysis is valid, there are more "costs" at play like developer time, cluster maintenance, hardware. I like to play with Spark's ML libraries but am wary about designing projects specifically around them because of this overhead, especially when trying to distribute some API/tech that you'd like others to use.

Not trying to be a downer, I actually wish the choice to go distributed was more of a no-brainer, hah. Would love for some APIs to emerge that could be used locally/distributed transparently without actually having to run a dummy cluster & data migration to run locally.

I've [had] this conversation with clients, CTO level, mostly in context of microservices. A few observations:

- Peter Principle: most decision makers are/feel technically insecure in the blog driven tech age, and cave in to direction from below. Of course, young developers want to play with shiny new things (given the general drudgery of the work involved).

- Emergence of DevOps: Engineers are being commoditized. There is an undeniable deskilling that goes hand in hand with having to wear all the technical hats. (A side glance here to pattern of deskilling of pilots in the age of fly by wire.) Sure, you will need to learn new 'tools' as 'operators', but what's the vote HN: what percent of these engineers could actually build one of these distributed systems? (To say nothing of being able to reign in the asynchronous distributed monster when it starts hitting its pain points.)

- You're not Google: I'm rather blunt when a team points to "Google does it". Google and the like have made a virtue out of necessity. Google/Facebook/Netflix/etc. had to resort to the pattern of lots of disposable commodity boxes. They also have the chops in house to field SREs that are simply not going to play machine room operator for enterprise IT.

The overwhelming majority of systems out there can run on a deployment scheme that 1:1 matches the logical diagram (x2 for fail over). And yes, it is amazing what one can do on a single laptop these days.

I have long offered the following advice.

If you have code that is not able to run any more in a scripting language, and it is not embarrassingly parallel, you have two choices.

1. Move to something like C++, and optimize the heck out of it. You will gain something like 1-2 orders of magnitude in performance and then hit a wall.

2. Move to a distributed architecture. You immediately lose 1-2 orders of magnitude in performance, but then can scale essentially forever.

If you expect your distributed system to need less than 100 machines, you should seriously consider option #1.

I would call those options #2 and #3. Option #1 is:

1. Parallelize on a single machine. Most scripting languages have single-threaded bottlenecks (Python, Ruby, node.js, etc.), so this means using multiple processes. xargs -P goes a long way. The coding changes are essentially a subset of what you need to distribute your program anyway.

32x or 64x speedup is nothing to sneeze at on a modern machine. The difference between 5 minutes and 5 hours usually solves your problem, practically speaking. And this means you don't have to touch every line of code, as you would if you were doing a C++ rewrite.

But also don't forget that you can often rewrite 10% of your code in C++ and keep the other 90% in Python, and get a 10x speedup. This requires fairly deep understanding of both your program and of the Python/C interface. It helps to adopt a data flow style so you are not crossing the boundary a lot. And make sure you release the GIL, and consider starting threads in C++ rather than in Python, etc.

I've also optimized an R program in C++ and gotten 125x speedup -- and that's single threaded C++; multithreaded would be another 2 orders of magnitude!!! But it also involved fixing a bunch of R performance errors, so don't underestimate that too.

In practice, any program which you actually care enough about to rewrite for speed usually has some application-level performance bugs -- i.e. slowness unrelated to the slowness of the underlying language platform.

People scale out to many machines because they don't want to rewrite every line of code. But the first step is to "scale out" by using all the cores on a single machine. In some sense, this is the best of both worlds, because you are incurring neither the overhead of distribution (serialization and networking) nor the complexity of distributed error handling.

That is a worthwhile option, but parallelization hasn't generally offered me nearly that much of a win when I'm limited by memory access performance.

In my experience, you're best off optimizing relatively limited pieces of a system rather than big applications. And only consider the rewrite when you've looked at optimizing it in place. This means that the thing that needs to be optimized with a rewrite often has a chance of not having any giant stupid mistakes.

For example see http://bentilly.blogspot.com/2011/02/finding-related-items.h... for a case where I found a thousandfold speed increase by rewriting from SQL to C++. As much as was reasonably possible, I did not change the basic algorithm.

> and it is not embarrassingly parallel

Sounds like you had a problem that was "embarrassingly parallel". It's not going to be as simple if you have graph problems which require lots of IPC in python/ruby/v8. That's where moving to a language which has threading support can help.

I think the missed point is that Spark is very easy. I can get an average Java or Python developer trained up on it in less than a day. The python shell is very simple to use out of the box. And, it's incredibly convenient to be able to either run locally or on a huge cluster. I can use the same code to easily process batch jobs from 1 MiB to 100 TiB. In my mind, it's just a cost savings. Developer time is expensive, and it's hard to find great developers. Hardware is cheap.

No way am I a scalability expert, and I really don't have time to be one. I started using Spark when I had to sort 10 TiB on disk, and it scored the highest on sorting performance. I struggled with implementing a fast disk sort quickly, and I gave Spark a whirl, and it fixed my problem, fast. Since then, I've found it useful in a lot of other ways.

I agree with author that [in most cases] you don't need distributed processing for your algorithms. But sometimes you do, and when you do need it you have to understand that there is no silver bullet.

Creating a distributed system is very difficult, even when using platforms like Spark. Not all algorithms can be scaled easily or scaled at all, and not all algorithms in Spark MLLib or GraphX are actually designed to be truly distributed or work equally well for all use cases/data.

We tried to implement one of our algorithms (written in Java) that was taking hours on a single machine (even when using all the cores) using methods from Spark MLLib just to find that Spark job was constantly crashing. Turned out that some of the functions just fetch all the data to the "driver" instance and calculate the result there.

My guess this is what happens with author's use case - yes, he ran it on Spark, but only one node ended up crunching all numbers. And/or network overhead of course.

After we found out that MLLib can't give us what we need, we reimplemented it from scratch in Scala, making sure we balance the load equally and don't have too much network (shuffle) traffic between the nodes.

As a result, we went from 2.5 hours on a single machine, to under 2 minutes on a cluster of 25 instances (same Ivy Bridge processor, just more cores per node). The algorithm scaled almost linearly, but it required carefully designing it with Spark specifics in mind.

NLLib is ostensibly open source, why not improve it? Was your final solution too specialized?

Yes, we ended up implementing just project specific parts, not generic enough for MLLib contribution...

If you're interested in reading more, Frank moved his blog to a github repo:


The author recently gave a talk at a Rust meetup about similar things:


I think graph operations are not fair comparison. It is notoriously difficult to scale.

On other side AWS now offers 2TB RAM machine. And single huge machine has smaller per GB cost than several smaller machines. I think clustered computing as we know will be soon gone. Only reason for multiple machines will be availability.

Do you think our datasets will stop growing? It seems to me that data is growing faster than RAM, and has been for years. How do we find the upper limit of data? The Human Genome is finite, it will only get so big. What you did on facebook? Seems near infinite....

I'm sure the upper end of our datasets won't stop growing for the foreseeable future. But a huge proportion of problems has growth rates well below the growth rate of RAM.

And for that matter, even when we can't stuff it in RAM, the boundaries of what we can do on a single server is also constantly pushed back thanks to SSDs. It's just a few years ago since I was unable to get read speeds of more than 6GB/sec out of a RAM disk. Today I have servers that easily do 2GB/sec out of NVMe SSDs.

It's not that we never need to go beyond a single server. But people often really have no concept of when they'll need to.

For businesses? Sure there's gonna be a lot of limits. Only 7 billion humans so that sorta limits any user tables. Only so many things those people can buy each day, so that limits your orders table.

VISA does what, 150M transactions per day? I was doing more volume than that for telephone calls a couple years ago with a $5K server and a default install of MSSQL. (Full ACID, updating balances -- yes I know a CC tx is probably heavier than a VoIP call but still.) At 4KB per tx, VISA could use VoltDB and store all tx's in RAM for a week for like a million or two.

An 150M a day today is not that much more than it was 15 years ago it seems. (50-100% more?).

For many, many, businesses, data size just isn't an issue any more cost wise, and soon won't be a technical challenge at all either. Yet 10, 20 years ago, we're talking 8-digit+ implementations.

Sure there's some things that grow faster, like all this increased tracking. But in general?

I can't remember where it was I saw it recently, but it was pointed out that there are diminishing returns on using more data. Your accuracy with something like Google's pagerank is good with 1 billion data points for input, but doesn't improve much with more data.

Big data systems aren't about squeezing every drop of performance out of the hardware. They're about being able to scale your solution up effortlessly. Having a uniform programming model so that you don't have to rewrite your solution N times when you go from a small test dataset to production data, to 1000x the data. It's also about using a system that other people use, so that you can actually hire admins and service companies with experience in maintaining your solution.

Note that this is from January 2015 and thus predates the stable Rust 1.0 release in May 2015, so it's possible that the code examples do not compile on post-1.0 Rust.

Thanks for the analysis -- it is good that people have this context in their heads when designing systems. The missing conversation from this article is that some people conflate scalability with performance. They are different, and you absolutely trade one for the other. At large scale you end up getting performance simply from being able to throw more hardware at it, but it takes you quite a while to catch up to where you would have been on a single machine.

This is true not just for computing algorithms, but for developer time/brain space as well. Single-threaded applications are far simpler to understand.

The takeaway shouldn't be "test it on a single laptop first", but rather "will the volume/velocity of data now/in the future absolutely preclude doing this on a single laptop". At my work, we process probably a hundred TB in a few-hour batch processing window at night, Terabytes of which remain in memory for fast access. There is no choice there but to pay the overhead.

This reminds me of the "your data fits in ram" website which was on HN last year. Basically that site asked you your data size, then answered "yes" for any size up to a few TB.

The website is down, but the HN discussion is still there : https://news.ycombinator.com/item?id=9581862.

In fact the top comment there links to the original post here.

> The website is down,

Maybe they ran out of ram?

I was at mozilla for your talk!! Very interesting stuff

I bet the author didn't count the account of time of downloading data to a single box. Scalability, sometimes, is not a choice.

I really want a content-addressed storage system with differential compression to solve this problem.

I was browsing through dat for awhile, but haven't caught up with it lately:


Basically disk is so cheap that you should just keep 2 or 3 copies of your data around. And then you can sync them really quickly and do the processing on any one of N machines.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact