
The Full Stack, Part I - dimm
http://www.facebook.com/notes/facebook-engineering/the-full-stack-part-i/461505383919
======
nerfhammer
> Well, if the table is in InnoDB format it will result in one disk-seek,
> because the data is stored next to the primary key and can be deleted in one
> operation. If the table is MyISAM it will result in at least two seeks,
> because indexes and data are stored in different files. A hard drive can
> only do one seek at a time, so this detail can make the difference between
> 1X or 2X transactions per second.

Nonsense.

Innodb has clustered primary keys, which means that the row data is attached
to the leaf nodes of the primary key index as the author correctly states.
However, the leaf nodes and the non-leaf index nodes are actually stored in
different segments of the table space! While in the same (giant) file, it is
unlikely that they would ever be in contiguous space on disk enough to be read
in a single random IO operation.

But it's more complicated than that: if any of the index pages or data pages
have been read recently they will probably still be in the buffer pool, which
means that they will require no disk operations.

But that's just the seek operation to find the row. The write operation is a
different story yet.

What innodb will do is modify the row by marking it with the transaction id in
which it was deleted. It will keep the row in place so readers with an older
transaction id will still see it until all those transactions are complete.
The change in the row and the row page will be written to the copies of the
affected pages in memory only. Eventually the data pages and any affected
index pages will be flushed to disk, potentially grouped with other changes to
the same pages. IO operations occur on the level of reading and overwriting
whole pages only, if not more.

Concurrently it will record in the log buffer every change it makes to the
pages in memory. This won't get written to disk right away either, it will
flush the log buffer to disk once per second in the default configuration.

So there are many more potential disk operations required of innodb than
myisam. Generally innodb is preferably because it is vastly more reliable, and
because it can handle concurrent read/writes to the same data -- MyISAM
basically can't. MyISAM will in fact generally be FASTER for any single
operation than InnoDB, because it simply does less.

~~~
aristus
I think you're right. I'll try to come up with a simpler example and update
the piece. Thanks!

~~~
aristus
Corrected, though I'm not completely happy with it:

<http://carlos.bueno.org/2010/11/full-stack.html>

------
tom_b
I started as a performance software person, from memory and cache conscious
algorithm design in grad school, to network stack testing, to web server/db
server/client-side performance testing and optimization. It is an _excellent_
way to develop that feel for the big picture of how things are working. In
enterprise software, there used to be a lot of low hanging fruit and it was
fun to get "heroic" results with simple SQL tuning.

The flipside of that experience and mindset is that now, as I try to shift my
way towards more functional and declarative programming styles is that I
sometimes get sucked into the premature optimization black hole - I overthink
rather than just doing some simple exploratory programming.

I keep telling myself I'm going to implement some fairly large project using
nothing but lists and a Lisp to break it down.

~~~
ohyes
I get around this by implementing the abstract version of my code first; I
then implement the abstractions that I used in the abstract version of my
code. So I'm not even worried about running code until I have a general idea
of what I want it to be doing.

I obviously then rework stuff as I learn that my assumptions were totally
wrong, but this way I've started from an incorrect conceptual model; rather
than an incorrect full implementation. I think this makes it easier to
manipulate.

edit: And I guess the point of that is that according to XKCD, I start at
sociology, and work my way down the stack until I get to whatever
language/libraries (hopefully not Math) I'm actually programming in. Like some
sort of crazy fractal.

------
tedunangst
I don't think I agree with the conclusion that performance will "depend almost
completely on the random seek time." You can store the entire 10TB library on
5 2TB spinning drives. 5 drives can easily serve up 500mbps (that's only
60MB/s, one drive territory). So, on to seeking.

2000 streams, 5 drives, that's 400 per disk. Let's say we have the world's
worst disks, that can only do 10 seeks per second. 400 / 10 means we have to
buffer 40 seconds of data per stream (per seek) and we have to read it in 0.1
seconds before moving to the next stream. 300kbps * 40s / 8 = 1500K of data.
1.5M / 60MB/s disk transfer takes 0.025 seconds, well under 0.1.

I guess that's alluded to by "non standard prefetching", but I don't think
it's that advanced. Especially since in a streaming video application, the
client software is already going to be doing buffering for you. The bottleneck
is bandwidth.

Check my math please? :)

~~~
aristus
This sounds reasonable. AFAIK the default readahead on Linux is 128KB, so
you'll probably need to test and tune quite a bit to get it right. I had a
paragraph roughly along those lines in an early draft, but decided it was too
much detail. "Concurrent reads" is probably more accurate than plain "seek
time".

~~~
tedunangst
The kernel readahead is actually irrelevant. Your app issues 1MB reads one at
a time and the kernel will fetch that all at once.

------
brown9-2
Reading this makes me really curious how Netflix's Instant Watching service is
architected. Anyone have any details?

~~~
contol-m
As far as I know, Netflix does not handle the actual streaming of bits - it is
done by CDNs like Akamai or Limelight.

------
aw3c2
This is very interesting and well written so someone "dumb" like me can learn
a lot about planning and optimisation from this. Riht in the third paragraph I
just realised why some SQL query I wrote is so slow, embarassing but true,
heh!

~~~
lsb
Are you just inexperienced? Don't prematurely sell yourself short. I'd read
that a Rails hosting company found that 70% of their customers hadn't properly
indexed their tables, and knowing's half the battle.

------
ajays
A much more readable, FB-free version of this is at his blog,
<http://carlos.bueno.org/>

~~~
dsmithn
Thanks, facebook.com is blocked at work.

------
efsavage
A company well-stocked with full-stack engineers who can communicate makes the
execution of good ideas so frictionless and the death of bad ones so quick and
painless that it cannot help but to succeed.

If this is the caliber of engineering talent that Facebook values, there
should be no surprise they are taking over the world.

------
gcb
From the ex-yahoo who got 'famous' by using a microwave near his wifi laptop
to simulate dropped packages.

