
Numbers everyone should know - chuhnk
http://highscalability.com/numbers-everyone-should-know
======
grammr
My databases professor at my university once brought up Jim Gray's data
latency analogy to illustrate the costs of using different types of storage.
Basically it goes something like this:

Using a register is like interacting with information in your own brain, using
an L1 cache is like walking to a bookshelf in the room you're in, using an L2
cache is like walking across campus, using RAM is like going to Sacramento (we
were in Berkeley), hitting the disk is like _going to Pluto_ , and using tape
drives is like going to Andromeda.

I never fully registered the consequences of hitting disk until then.

~~~
kingkilr
My company does this, but in reverse, whenever someone is getting something to
eat there's a metaphor for it. Cache means you've already got it, hitting
memory means you need to go to the kitchen, disk means you need a trip to the
store, regisers of course means you're actively chewing.

~~~
Johngibb
I really like this analogy. When the article said "Numbers everyone should
know", it's not really the numbers that matter, but the general relative scale
of performance costs for common operations. IOTW, memorizing the nanoseconds
of a cache hit isn't useful, but knowing how many times slower disk is than
ram is.

------
fleitz
People should also know for writes that there are a variety of ways to
mitigate this. The first way is to write sequentially (no seeking). The second
way is to install battery backed write cache which will do this for you. If
you're ok with losing a little bit of data you can also just have the OS write
to memory before flushing to disk. BBWC controllers seem to handle write
reorder optimization much better than the OS. So I'd recommend the BBWC over
writing to memory.

This is a large reason I dislike cloud hosting, most of the hardware is crap
and you could get much better performance from it with a little more expense.
Also it's mostly pointless to bother making your writes sequential as the
performance degrades to random IO because of all the other instances on the
machine.

------
pfedor
Well if anyone wants to learn more from Jeff Dean than just twelve numbers,
here's a talk he gave at Standford about Google's architecture and system
building in general:

[http://ee380.stanford.edu/cgi-
bin/videologger.php?target=101...](http://ee380.stanford.edu/cgi-
bin/videologger.php?target=101110-ee380-300.asx)

The slides for the talk:
[http://www.stanford.edu/class/ee380/Abstracts/101110-slides....](http://www.stanford.edu/class/ee380/Abstracts/101110-slides.pdf)

------
rythie
Needs updating for SSDs, I'm not sure it's really worth the development time
and increased time to market to optimise for rotating disks.

~~~
idlewords
You seem to have accidentally traveled back in time to 2011 from whatever
golden future you live in before posting this comment.

~~~
rythie
We spent a lot of time over several years tweaking every setting imaginable,
trying to get the performance up of a NFS server, none of which really made
any difference. Installing SSDs fixed the problem completely. Disk I/O now
stays in the single digit percentage range when previously would stay at 100%,
sometimes for minutes.

SSDs about twice the price of 15k rpm drives, hardly out of reach for most.
Also for VPSs, server axis 20mb unmetered SSDs are cheaper than slicehost's
rotating disk based systems. <http://serveraxis.com/vps-ssd.php>

~~~
kragen
Yeah, SSDs are going to totally kill 15krpm disks. But killing disks in
general is a ways off. In fact, disks might actually be getting cheaper per
bit faster than SSDs are.

------
Getahobby
I know noSQL is all the rage with the crazy kids these days and I know it has
its place but why is half of the literature regarding noSQL about solving
problems that are trivial in an RDBMS?

~~~
idlewords
This particular article is addressing problems that would not be trivial in an
RDBMS. If you're worried about concurrency when incrementing a counter, you're
at a level of scale few people get to work on.

There's a legitimate rant to be had about using elephant architecture to serve
mouse traffic, and I think that's what annoys me most about the NoSQL fad. But
that doesn't mean the problems mentioned in this article don't exist.

~~~
edanm
"If you're worried about concurrency when incrementing a counter, you're at a
level of scale few people get to work on."

I'm not sure I understand. If I make an application that has to increment a
counter, shouldn't I always be worried about concurrency?

What I mean is, sure, if I'm serving only a few requests, then of course the
probability of running into concurrency issues is lower than a site with more
requests/sec. But it's still a matter of chance, there is _some_ probability
that two requests will come in at exactly the same time and cause a problem.

~~~
j_random_hacker
You are worrying about _correctness_ of concurrency. Yes, this should always
be worried about.

The parent poster is worrying about _performance_ of concurrency. This doesn't
need to be worried about unless you are one of about 5-10 tech companies whose
name is recognisable to people on the street.

------
j_random_hacker
I love the concept of sharding counters to trade write performance for read
performance! As so often happens, my understanding advances by looking at
something I thought I knew the only possible answer to ("Of course a counter
goes in a single variable!") and seeing that other alternatives are possible.

------
darwinGod
An article with 'numbers everyone should know' without linking to Peter
Norvig?! Here goes: <http://norvig.com/21-days.html>

------
hammock
Not trying to be mean, but sometimes I wonder how many programmers actually
have friends outside the industry. This happens when I read titles like
"Numbers everyone should know" and then find out that it's numbers like the
time needed for an L2 cache reference or how to optimize for low write
contention. Does anyone realize that this stuff is about as esoteric as it
gets?

~~~
wingo
I think the author assumed readers were programmers. That's OK in a trade
publication.

Indeed, every programmer should know those numbers.

~~~
bigiain
There's still an argument to be made that it's largely esoteric even for
programmers. There's an enormously large cross section of professional
programmers for whom this is really only relevant to satisfy some curiosity
rather than in any way relevant to their day to day coding. There's 100% no
use in knowing the difference between an L1 and L2 cache hit compared to a
branch misprediction if your day job involves, say, writing Wordpress plugins
in php or iOS apps or maintaining some corporate Java app.

~~~
robryan
Your right that you can't really think about this stuff to much sitting in a
high level language. It is useful though to have a good idea of the underlying
architecture which your taking advantage of at a higher level. Sure L1 and L2
cache optimization is out of your hands but if you are writing an app for the
GAE knowing the relative speeds of writes in relations to reads could
influence the decisions you make.

------
mleonhard
Another way to make unique ordered keys is to use a timestamp and append a
random value, such as a random UUID.

key = timestamp + UUID4

~~~
j_random_hacker
I didn't like this idea at first, since it struck me that if for some reason
you had to create many records in quick succession (e.g. some kind of bulk
upload) they would essentially be placed in random order. But thinking about
it, the only scenarios where you would need to do this _and_ want to preserve
the order exactly would be if all the records were coming from the same source
-- in which case you could just choose a single UUID for the batch and append
increasing values of a _local_ counter to this.

TL;DR: This would actually work really well I think!

~~~
bmm6o
Isn't this exactly what the article suggests doing? Time stamp + user id +
comment count.

------
alecco
Blogspam repost of a 2yo+ presentation. A very good one, but still...

~~~
chuhnk
I'm sorry you feel that way. I was re-reading information on systems
architecture and came across the article. I thought it was relevant to HN and
thought hey this is something people might like to read. Not everyone here has
been here since 09 and not everyone here will have seen the article. Just
trying to help educate fellow members of this community.

~~~
mleonhard
I get connection reset when I try to comment on your blog.

~~~
chuhnk
I have a blog?

------
known
[http://www.beej.us/guide/bgipc/output/html/singlepage/bgipc....](http://www.beej.us/guide/bgipc/output/html/singlepage/bgipc.html#flocking)

------
velshin
Excellent tidbits!

