
Scalability, but at what cost? (2015) - r-u-serious
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
======
3pt14159
I've been doing data analysis and machine learning client work for quite some
time now and for companies as small as a 3 person startup to advising a
department of the Canadian government.

Almost always numpy matrix math + cython or C or Java on a single machine is
enough. Not always-always; but if you can relax requirements _slighly_ say by
accepting a 45 minute lag from new data impacting the total model, or by
caching the results of the top 10k most likely queries, or by putting more
effort into stripping out the garbage parts of the data, or, sometimes, just
throwing a $10k a month server or mathematician at the problem (sure is
cheaper than a bunch of cheap servers + larger infrastructure team).

The times you need real scalability you know you need it. You'd laugh at how
silly someone would be for trying to put it onto one machine. You're solving
the travelling salesman problem for UPS (although I can think of some hacks
here - I probably can't get it down to a single machine), or you're detecting
logos in every Youtube video ever made, or you're working for the NSA.

Even if you know _for sure_ you're going to need scalability. I don't think it
hurts to just do it on a single box at first. Iterating quickly on the product
is more important and once you have something proven you can get money from
the market or from VCs to distribute it.

~~~
brianwawok
This is kind of the same argument as microservices.

We could write 30 microservices deployed on 30 docker images with load
balancing and FT and all that magic for a basic webapp...

Or we could just write a pretty fast webserver and do it with 1 server. (Or if
it is stateless, do it with a few for still a lot less work than a giant
microservice cluster).

I think in the last year or so microservices have become a little less cool,
and people are more along the lines of "code cleanly so we can microservice if
we need to down the road, but don't deploy it like that for 1.0"... seems
similar for this.

~~~
ben_jones
People forget that the two methods of scaling are HORIZONTAL and VERTICAL.
They think: "I can just put some micro-services behind HA proxy and boom, more
capacity!".

And then they forget if they had just modified that one query and tweaked that
one for-loop they could've had that same capacity without launching six new
servers with all kinds of potential for the wiring to go down and cause
downtime. Plus the dev time to build the services.

~~~
aianus
Vertical scaling requires hiring good engineers instead of mediocre ones
(additional cost of $100,000s per year across the team). Horizontal scaling in
comparison is much, much cheaper for your average CRUD app.

~~~
brianwawok
Maybe, maybe not.

For example choosing Java over Ruby would give you 2-10x better perf per
server... and I am not sure that Java devs cost any more or less than Ruby
devs.

Now we can get into an argument about developer productivity and all that..
but form a purely "i want to run 10x more users per server"... something like
Java / Ruby gets you a long ways.

~~~
joslin01
I talked to a CTO once that said he brought his RoR fleet down from 60 servers
to 6-8 by switching to Scala.

~~~
FooBarWidget
I was once involved in a large-scale government project that rewrote a Java
app to RoR. They went from 50 servers to 10.

It has probably got more to do with the rewrite and the new architecture than
whatever language it was written in.

~~~
joslin01
What? It absolutely has to do with the language.

Ruby binary trees: 57 seconds Scala binary trees: 11 seconds

[1] -
[http://benchmarksgame.alioth.debian.org/u64q/ruby.html](http://benchmarksgame.alioth.debian.org/u64q/ruby.html)
[2] -
[http://benchmarksgame.alioth.debian.org/u64q/scala.html](http://benchmarksgame.alioth.debian.org/u64q/scala.html)

~~~
FooBarWidget
That would mean something if Ruby apps are 100% Ruby and/or are performing
binary tree operations all the time, or doing similar kinds of CPU-intensive
operations as depicted in the alioth benchmarks. But they don't. Ruby web apps
perform lots of string manipulation, memory allocation, I/O. A lot of
expensive things are offloaded to C libraries. Things like XML parsing are
offloaded to native libraries like libxml; nobody uses an XML parser fully
implemented in Ruby. Ruby does not reimplemented gzip compression in Ruby, it
uses zlib. So the alioth benchmarks are not representative of real-world
performance.

~~~
igouy
>> Ruby web apps perform lots of string manipulation, memory allocation, I/O.
<<

    
    
        string manipulation
    

[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=revcomp&lang=yarv&id=2)

    
    
        memory allocation
    

[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=yarv&id=1)

    
    
        I/O
    

fasta, fasta-redux, reverse-complement write 250MB

regex-dna reads 50MB; k-nucleotide, reverse-complement read 250MB

>> …offloaded to native libraries… So the alioth benchmarks are not
representative of real-world performance. <<

The benchmarks game does show C programs ;-)

The benchmarks game does show scripting-languages explicitly using native
libraries:

[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=pidigits&lang=php&id=5)

~~~
FooBarWidget
> string manipulation

That benchmark performs string manipulations that rarely occur in web apps.
Web apps need: concatenation, substring, find/replace, maybe with regexps. All
of those are implemented in C.

> memory allocation

Web apps don't tend to implement entire trees in pure Ruby. That benchmark is
completely non-representative of real-world performance.

What exactly are you getting at? Of course it's easy to find a bunch of
synthetic benchmarks that show weaknesses in particular cases. Still doesn't
prove anything.

~~~
igouy
> Web apps need: concatenation, substring, find/replace, maybe with regexps.
> All of those are implemented in C.

join and gsub?

[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=yarv&id=1)

> Web apps don't tend to implement entire trees in pure Ruby.

Nor do other apps but that is what Hans Boehm came up with as a simple GC
benchmark.

[http://hboehm.info/gc/gc_bench/](http://hboehm.info/gc/gc_bench/)

> What exactly are you getting at?

You don't seem to know what is shown on the benchmarks game website.

~~~
FooBarWidget
Doesn't your regex-dna benchmark kind of prove my point? Just look at the
comparisons here:
[http://benchmarksgame.alioth.debian.org/u64q/performance.php...](http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=regexdna)

C GCC: 2.46 sec

Java: 8.23 sec

Ruby #8: 9.35 sec

It's only about 4x slower than pure C in this case, and only a little slower
than Java which has a very good JIT.

> You don't seem to know what is shown on the benchmarks game website.

How funny of you to say that while acting as if the benchmarks "prove" Ruby is
the ultimate spawn of the devil that eats away any and all performance. The
website itself tells you not to jump to conclusions and that the app itself is
the ultimate benchmark: [http://benchmarksgame.alioth.debian.org/dont-jump-to-
conclus...](http://benchmarksgame.alioth.debian.org/dont-jump-to-
conclusions.html)

~~~
igouy
>> Doesn't your regex-dna benchmark kind of prove my point? <<

7 days ago, you could have used the data shown on the benchmarks game website
to try and make your point to joslin01.

Instead you chose to dismiss the data.

>> acting as if <<

You seem to have confused me with joslin01.

------
ChuckMcM
I like the analysis, basically it says "hey you don't have big data" :-) but
that requires a bit more explanation.

The _only_ advantage of clustered systems like Spark, Hadoop, and others is
aggregate bandwidth to disk and memory. We know that because Amdahl's law
tells us that parallelizing something invariably adds overhead. So from a
systems perspective that overhead has to be "paid for" by some other
improvement, and we'll see that it is access to data.

If your task is to process a 2TB data set, on a single machine using a 6GBS
SATA channel and 2TB of FLASH SSDs you can read in that dataset into memory in
3333 seconds (at 600MB/sec which is very optimistic for our system), process
it, and lets say you write out a 200GB reduced data set for another 333
seconds. so, conveniently, an hour of I/O time.

Now you take that same dataset and distribute it evenly across a thousand
nodes. Each one then has 2GB of the data on it. Each node can read in their
portion of the data set in 3 seconds, process it and write out their reduction
in .3 seconds.

You have "paid" for the overhead of parallelization by trading an I/O cost of
an hour for an I/O cost of about 4 seconds.

 _That_ is when parallel data reduction architectures are better for a problem
than a single threaded architecture. And that "betterness" is purely
artificial in the sense that you would be better off with a single system that
had 1,000 times the I/O bandwidth (cough _mainframe_ cough) than 1,000 systems
with the more limited bandwidth. _However_ a 1,000 machines with one SSD it
still cheaper buy than one mainframe of similar capability. So _if_ , and its
a big if, your algorithm can be expressed as a data map / reduce problem, and
your data is large enough to push the cost of getting it into memory to have a
look at significantly beyond cost of executing the program, _then_ you can
benefit positively by running it on a cluster rather than running it on a
local machine.

~~~
Joeri
For a 2 TB dataset you could also pay supermicro 50k to get a 40 core 3 TB RAM
monster that can keep that whole dataset in RAM. At 50 GB / sec throughput
that would keep your query roundtrip time at somewhere around the minute mark.
Not quite 3 seconds, but then not quite a thousand nodes either. Of course,
rebooting that machine would be awkward.

Still, I think the general rule applies that if you can buy a server that will
fit your dataset into RAM, probably you don't need something like Hadoop.

~~~
JoachimSchipper
Idle question: is 50 GB/s a reasonable throughput to expect for such a
monster? DDR4 has a peak transfer rate of ~12.8-19.2 GB/s per stick (per
[https://en.wikipedia.org/wiki/DDR4_SDRAM](https://en.wikipedia.org/wiki/DDR4_SDRAM)),
so I'd expect quite a bit more bandwidth for predictable accesses - are you
using some useful rule-of-thumb, or just unduly pessimistic? ;-)

~~~
ChuckMcM
At NetApp when we were doing scaling analysis in the early 2000's it became
clear the memory bandwidth was limited more by the transaction rate of the
memory controller than it was the actual available bandwidth of the memory
subsystem.

That is because a memory transaction involves "opening" a page, and then
"doing the operation", which can be one to several hundred locations long.
"Pointer chasing", code that reads in a structure, then deferences a pointer
to another structure, then derefences that pointer to still another stucture,
Etc. was really hard on the memory subsystem. It burned a lot of memory ops
reading relatively small chunks of memory.

Its a great topic in systems architecture and there are a number of papers on
it.

~~~
teLeopardthy
Could you suggest some good papers/articles on the topic?

------
eternalban
I've [had] this conversation with clients, CTO level, mostly in context of
microservices. A few observations:

\- Peter Principle: most decision makers are/feel technically insecure in the
blog driven tech age, and cave in to direction from below. Of course, young
developers want to play with shiny new things (given the general drudgery of
the work involved).

\- Emergence of DevOps: Engineers are being commoditized. There is an
undeniable deskilling that goes hand in hand with having to wear all the
technical hats. (A side glance here to pattern of deskilling of pilots in the
age of fly by wire.) Sure, you will need to learn new 'tools' as 'operators',
but what's the vote HN: what percent of these engineers could actually build
one of these distributed systems? (To say nothing of being able to reign in
the asynchronous distributed monster when it starts hitting its pain points.)

\- You're not Google: I'm rather blunt when a team points to "Google does it".
Google and the like have made a virtue out of necessity.
Google/Facebook/Netflix/etc. _had to_ resort to the pattern of lots of
disposable commodity boxes. They also have the chops in house to field SREs
that are simply not going to play machine room operator for enterprise IT.

The overwhelming majority of systems out there can run on a deployment scheme
that 1:1 matches the logical diagram (x2 for fail over). And yes, it is
amazing what one can do on a single laptop these days.

------
btilly
I have long offered the following advice.

If you have code that is not able to run any more in a scripting language, and
it is not embarrassingly parallel, you have two choices.

1\. Move to something like C++, and optimize the heck out of it. You will gain
something like 1-2 orders of magnitude in performance and then hit a wall.

2\. Move to a distributed architecture. You immediately lose 1-2 orders of
magnitude in performance, but then can scale essentially forever.

If you expect your distributed system to need less than 100 machines, you
should seriously consider option #1.

~~~
chubot
I would call those options #2 and #3. Option #1 is:

1\. Parallelize on a single machine. Most scripting languages have single-
threaded bottlenecks (Python, Ruby, node.js, etc.), so this means using
multiple processes. xargs -P goes a long way. The coding changes are
essentially a subset of what you need to distribute your program anyway.

32x or 64x speedup is nothing to sneeze at on a modern machine. The difference
between 5 minutes and 5 hours usually solves your problem, practically
speaking. And this means you don't have to touch every line of code, as you
would if you were doing a C++ rewrite.

But also don't forget that you can often rewrite 10% of your code in C++ and
keep the other 90% in Python, and get a 10x speedup. This requires fairly deep
understanding of both your program and of the Python/C interface. It helps to
adopt a data flow style so you are not crossing the boundary a lot. And make
sure you release the GIL, and consider starting threads in C++ rather than in
Python, etc.

I've also optimized an R program in C++ and gotten 125x speedup -- and that's
single threaded C++; multithreaded would be another 2 orders of magnitude!!!
But it also involved fixing a bunch of R performance errors, so don't
underestimate that too.

In practice, any program which you actually care enough about to rewrite for
speed usually has some application-level performance bugs -- i.e. slowness
unrelated to the slowness of the underlying language platform.

People scale out to many machines because they don't want to rewrite every
line of code. But the first step is to "scale out" by using all the cores on a
single machine. In some sense, this is the best of both worlds, because you
are incurring neither the overhead of distribution (serialization and
networking) nor the complexity of distributed error handling.

~~~
btilly
That is a worthwhile option, but parallelization hasn't generally offered me
nearly that much of a win when I'm limited by memory access performance.

In my experience, you're best off optimizing relatively limited pieces of a
system rather than big applications. And only consider the rewrite when you've
looked at optimizing it in place. This means that the thing that needs to be
optimized with a rewrite often has a chance of not having any giant stupid
mistakes.

For example see [http://bentilly.blogspot.com/2011/02/finding-related-
items.h...](http://bentilly.blogspot.com/2011/02/finding-related-items.html)
for a case where I found a thousandfold speed increase by rewriting from SQL
to C++. As much as was reasonably possible, I did not change the basic
algorithm.

------
ap22213
I think the missed point is that Spark is very easy. I can get an average Java
or Python developer trained up on it in less than a day. The python shell is
very simple to use out of the box. And, it's incredibly convenient to be able
to either run locally or on a huge cluster. I can use the same code to easily
process batch jobs from 1 MiB to 100 TiB. In my mind, it's just a cost
savings. Developer time is expensive, and it's hard to find great developers.
Hardware is cheap.

No way am I a scalability expert, and I really don't have time to be one. I
started using Spark when I had to sort 10 TiB on disk, and it scored the
highest on sorting performance. I struggled with implementing a fast disk sort
quickly, and I gave Spark a whirl, and it fixed my problem, fast. Since then,
I've found it useful in a lot of other ways.

------
Eugr
I agree with author that [in most cases] you don't need distributed processing
for your algorithms. But sometimes you do, and when you do need it you have to
understand that there is no silver bullet.

Creating a distributed system is very difficult, even when using platforms
like Spark. Not all algorithms can be scaled easily or scaled at all, and not
all algorithms in Spark MLLib or GraphX are actually designed to be truly
distributed or work equally well for all use cases/data.

We tried to implement one of our algorithms (written in Java) that was taking
hours on a single machine (even when using all the cores) using methods from
Spark MLLib just to find that Spark job was constantly crashing. Turned out
that some of the functions just fetch all the data to the "driver" instance
and calculate the result there.

My guess this is what happens with author's use case - yes, he ran it on
Spark, but only one node ended up crunching all numbers. And/or network
overhead of course.

After we found out that MLLib can't give us what we need, we reimplemented it
from scratch in Scala, making sure we balance the load equally and don't have
too much network (shuffle) traffic between the nodes.

As a result, we went from 2.5 hours on a single machine, to under 2 minutes on
a cluster of 25 instances (same Ivy Bridge processor, just more cores per
node). The algorithm scaled almost linearly, but it required carefully
designing it with Spark specifics in mind.

~~~
Eridrus
NLLib is ostensibly open source, why not improve it? Was your final solution
too specialized?

~~~
Eugr
Yes, we ended up implementing just project specific parts, not generic enough
for MLLib contribution...

------
grayrest
If you're interested in reading more, Frank moved his blog to a github repo:

[https://github.com/frankmcsherry/blog](https://github.com/frankmcsherry/blog)

------
dikaiosune
The author recently gave a talk at a Rust meetup about similar things:

[https://air.mozilla.org/bay-area-rust-meetup-
may-2016/](https://air.mozilla.org/bay-area-rust-meetup-may-2016/)

------
wueiued
I think graph operations are not fair comparison. It is notoriously difficult
to scale.

On other side AWS now offers 2TB RAM machine. And single huge machine has
smaller per GB cost than several smaller machines. I think clustered computing
as we know will be soon gone. Only reason for multiple machines will be
availability.

~~~
brianwawok
Do you think our datasets will stop growing? It seems to me that data is
growing faster than RAM, and has been for years. How do we find the upper
limit of data? The Human Genome is finite, it will only get so big. What you
did on facebook? Seems near infinite....

~~~
vidarh
I'm sure the upper end of our datasets won't stop growing for the foreseeable
future. But a huge proportion of problems has growth rates well below the
growth rate of RAM.

And for that matter, even when we can't stuff it in RAM, the boundaries of
what we can do on a single server is also constantly pushed back thanks to
SSDs. It's just a few years ago since I was unable to get read speeds of more
than 6GB/sec out of a RAM disk. Today I have servers that easily do 2GB/sec
out of NVMe SSDs.

It's not that we never need to go beyond a single server. But people often
really have no concept of when they'll need to.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=8901983](https://news.ycombinator.com/item?id=8901983).

------
colin_mccabe
Big data systems aren't about squeezing every drop of performance out of the
hardware. They're about being able to scale your solution up effortlessly.
Having a uniform programming model so that you don't have to rewrite your
solution N times when you go from a small test dataset to production data, to
1000x the data. It's also about using a system that other people use, so that
you can actually hire admins and service companies with experience in
maintaining your solution.

------
kibwen
Note that this is from January 2015 and thus predates the stable Rust 1.0
release in May 2015, so it's possible that the code examples do not compile on
post-1.0 Rust.

------
theanomaly
Thanks for the analysis -- it is good that people have this context in their
heads when designing systems. The missing conversation from this article is
that some people conflate scalability with performance. They are different,
and you absolutely trade one for the other. At large scale you end up getting
performance simply from being able to throw more hardware at it, but it takes
you quite a while to catch up to where you would have been on a single
machine.

This is true not just for computing algorithms, but for developer time/brain
space as well. Single-threaded applications are far simpler to understand.

The takeaway shouldn't be "test it on a single laptop first", but rather "will
the volume/velocity of data now/in the future absolutely preclude doing this
on a single laptop". At my work, we process probably a hundred TB in a few-
hour batch processing window at night, Terabytes of which remain in memory for
fast access. There is no choice there but to pay the overhead.

------
dzdt
This reminds me of the "your data fits in ram" website which was on HN last
year. Basically that site asked you your data size, then answered "yes" for
any size up to a few TB.

The website is down, but the HN discussion is still there :
[https://news.ycombinator.com/item?id=9581862](https://news.ycombinator.com/item?id=9581862).

In fact the top comment there links to the original post here.

~~~
herge
> The website is down,

Maybe they ran out of ram?

------
psiclops
I was at mozilla for your talk!! Very interesting stuff

------
eva1984
I bet the author didn't count the account of time of downloading data to a
single box. Scalability, sometimes, is not a choice.

~~~
chubot
I really want a content-addressed storage system with differential compression
to solve this problem.

I was browsing through dat for awhile, but haven't caught up with it lately:

[https://github.com/maxogden/dat](https://github.com/maxogden/dat)

Basically disk is so cheap that you should just keep 2 or 3 copies of your
data around. And then you can sync them really quickly and do the processing
on any one of N machines.

