

Google: Taming The Long Latency Tail - When More Machines Equals Worse Results - yarapavan
http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html

======
dgreensp
The "insightfulness" of these points is overblown, especially if you've worked
at Google, where the solution to "our bloated application server takes 20
minutes to recompile" is "then use our distributed compiler that runs on 100
machines," and the solution to "sometimes a worker machine takes a long time
to come up or doesn't come up at all" is "then use redundant workers and fire
up 200 machines."

I wouldn't be surprised if Google does pull off some snazzy new real-time
architecture to use internally, but so far I think their strategy of farming
out execution to huge numbers of crappy machines, while innovative and very
successful, has pretty much exactly the problems you'd expect it to have.

~~~
jrockway
_the solution to "our bloated application server takes 20 minutes to
recompile" is "then use our distributed compiler that runs on 100 machines,"_

This is a bit disingenuous. The app servers are bloated because every library
is rebuilt and statically linked, avoiding deployment problems and apps that
use massively-outdated libraries. Continuous integration is expensive, but we
can afford it.

------
haberman
> With flash you can read 4KB of data in 100 microseconds or so, if your read
> is stuck behind an erase you may have wait 10s of milliseconds. That’s a
> 100x increase in latency variance for that particular variance that used to
> be very fast.

I had never heard this claimed before -- is this true? Wikipedia says:

"Less expensive SSDs typically have write speeds significantly lower than
their read speeds. Higher performing SSDs have similar read and write speeds."

So it sounds like this isn't as much of an issue as the article claims.

~~~
ori_b
I'm going to over-simplify a bit (and try to remember; it's been years since
I've actually paid attention to flash). Flash works effectively by erasing to
all-1s, and NAND'ing data with an existing block of bits. If you have a
freshly erased block, it's already all 1s, and you don't have to clear it off
before you can use it. This is fast.

If there are no free blocks, or for whatever reason, the flash device elects
to use a block you've already written to, you need to erase the entire block,
even if you're only writing one bit. This is relatively slow.

Most high end flash devices, as I understand it, try to keep a pool of pre-
erased blocks so that they can stay on the fast path, on average.

As far as I can tell, what they're saying in this article is that sometimes
you can hit the slow path, and that causes high latency. On average, it's not
a big deal, but for some random small number of writes, it can be an issue.

~~~
littledanehren
Note: This is my own opinion, and not that of my employer (Google) or based on
trade secrets or other IP from my employer.

Keeping pre-erased blocks is useful, but it can only reduce the write latency
to what the chip gives you. And the chip gives you a longer write latency than
read latency, especially for MLC with smaller process sizes.

~~~
ori_b
True, there is a difference between read and write latency, but at least that
is consistent and therefore easy to plan for. Large variance makes things far
more difficult, in my opinion.

~~~
mvgoogler
It's actually not consistent. On MLC flash, individual write operations can
vary by a factor of 6.

see: <http://cmrr-star.ucsd.edu/starpapers/309-Grupp-1.pdf> (PDF)

------
strictfp
It is possible to fix the max latency problem. If you have K distinct queries,
just make sure that each such query is answered by N servers. If you do this
you go from worst latency amongst all servers to worst latency amongst all
N-tuples. Overall latency will then approach average of the slowest distinct
query as N increases.

~~~
mvgoogler
Sure. It's trivial to arbitrarily reduce the probability of this one issue by
throwing N times the hardware at it.

Now you have to deal with the problems of:

1) Buying, powering, managing, etc N times the hardware. When you buy hardware
by the warehouse, this will be a big bill :-) 2) Since you are now increasing
your network traffic, power, etc. by _at least_ N times, it is very likely you
will hit another problem that produces latency spikes.

Or, as Chris Colohan likes to say: "If you think solving problems is massively
parallel computing systems is easy, please come join my team" :-)

~~~
strictfp
I'm not saying that its easy, but the problem formulation in the article is
simplified enough to be solved by this method. So I guess that my major
objection is how the problem is stated.

Also in practice you don't really need N times the hardware. You pick out the
queries with the worst jitter and either solve it or install two or three of
these services, and you are good to go.

To summarize, I agree with you in principle, but would like to add that the
method still is sound, but perhaps not in its most extreme form.

------
PaulHoule
Well, in the ordinary "enterprise", people put throughput first and could case
less about latency... With disasterous effects on user experience.

