
What's wrong with 1975 programming - dchest
http://varnish-cache.org/wiki/ArchitectNotes
======
tptacek
This is one of the best systems programming articles I've read in a very long
time. Short summary:

* Trust the VM system to figure out how to page things (hey, 'antirez, what's your take on that? You wrote an ad hoc pager for Redis.) instead of getting fancy, because if you get fancy you'll end up fighting with the VM system.

* Minimize memory accesses and minimize the likelihood that you'll compete with other cores for access to a cache line; for instance, instead of piecemeal allocations, make a large master allocation for a request and carve it out.

* Schedule threads in most-recently-busy order, so that when a thread goes to pick up a request it's maximally likely to have a pre-heated cache and resident set of variables to work with.

~~~
JoachimSchipper
On one hand, mmap() is very nice. On the other, you may run into the limits on
process size, and this is essentially unsolvable [1]. So your program is
limited to 3GB of data on 32-bit machines.

I also think hierarchical allocators - a slightly-formalized version of phk's
"carve chunks off a block of pre-allocated memory" - deserve more attention.
See SAMBA's talloc, or halloc.

[1] Well, you can write some code to page things out as appropriate, but by
the time you're inventing your own virtual memory manager you're definitely
doing it wrong. Just go with old-fashioned file-based code.

~~~
jacquesm
If you're running on 32 bit architectures for your main servers you are just
playing around and you don't need stuff like varnish.

If you're going to try to push multiple Gbps out of a single box the least you
could do is put a bunch of ram in it and install a 64 bit OS. That's a lot
more bang for the buck than installing multiple 32 bit boxes with only a bit
of memory in each.

~~~
JoachimSchipper
Certainly, the "professional" thing to do is install a 64-bit box with lots of
memory. But don't underestimate "playing around", especially as it pertains to
the popularity of OSS.

~~~
Psyonic
Can you even buy a 32-bit box anymore?

~~~
jjs
Do smartphones count? ;)

~~~
jacquesm
Only if you use them as servers.

------
herf
These systems seem to be, in practice, relying on having working sets that
mostly fit in RAM. It's a nice world when you can stack 64GB in a box and
never worry about disk, but having worked on systems that need more working
set than this, I don't think this idea actually scales up as well as is
claimed.

Two issues: 1\. Small objects aren't batched by a VM, so they can't be
combined into long streaming writes and reads. Doing 4k writes and reads is
useless on a modern disk. 128k to 1MB is more appropriate if you can organize
things that way.

2\. If your workload "mostly" fits in RAM, and then as your application
scales, the working set exceeds RAM, you wind up doing some queries that are
instant, and some that are a million times slower (a real, uncached disk
seek). Unless you plan for this, your app can fall over and you just won't
know what happened. You need to know the fraction of requests that will go to
disk and plan to have spindles (or RAM) to handle it. I don't like systems
that make measuring what's going on at this level any harder to do.

The author's claim is that you get to share RAM with the VM, which is great,
but in practice a lot of systems use kernel-supported sendfile() to do work
(i.e. have a small streaming buffer per network connection), or cache super-
hot objects in a relatively small amount of RAM. The assumption that all user-
mode caches try to allocate as much RAM as possible is not true.

Alternately, larger systems separate RAM cache entirely from disk-bound boxes
(e.g., dedicated memcached).

I think a more complete treatment of this problem would explain scaling over
working set more aggressively, and it would explain using instrumentation how
the system degrades at scale.

~~~
jacquesm
Currently serving up billions of images per day from memory images that are
more than 100 times the available physical memory (and there's plenty of
that).

Small objects can be batched in to a single page at the application level, the
VM will then move these in and out of the resident pool as one unit.

The operating system uses elevator sorting and knows enough about the disk
that it will attempt to sequentially invalidate pages.

The assumption is not that the working set will fit (mostly) in to RAM, the
assumption is that a page fault is a relatively rare occurrence and that other
threads will not be stalled by the IO done for one. It is the _changes_ to the
working set that determine page faults, not the size of the total set.

On my boxes I solved one issue with this (and the maximum number of systemwide
sockets) by running a varnish instance for every physical CPU in the machines.

------
drv
These "1975" programming concepts are essentially the same ideas taught in the
CS computer architecture course I took (circa 2006). The main focus was on the
memory hierarchy of modern computers, which was fine, but there was not much
advice about letting virtual memory do its job - the course covered mostly how
things work under the hood, but did not give practical programming
recommendations. I can see how budding programmers could get the idea that
"disks are slow, so I should try to manage when data gets copied to and from
RAM" without realizing that the system software is already taking care of it
and that attempting to do it manually will generally just make matters worse.

As an aside, I found it really hard to read this article because of the
grammar errors. Nearly every paragraph has a run-on sentence, and there's
plenty of missing punctuation. The style also seemed more like off-the-cuff
rambling (especially the long-winded attempts at humor near the beginning)
than carefully-written advice, which almost made me stop reading before
getting to the real content. The message is good, though, even obscured by
these problems.

------
helmut_hed
It seems to me that the first point made (on the performance of Squid's LRU
caching) is really just that Squid is doing a bad job at it, and Varnish
doesn't try at all (lets the VM system take care of it). In theory, managing
memory at the object/application level should give you some advantages over
doing it at the page/kernel level. I can imagine, for example, cases where
Squid might perform better by moving entire objects to/from disk, rather than
a page at a time in response to faults. In this case "1975 programming" really
means _trying to manage the memory hierarchy in the interest of performance_ ,
which is timeless. Indeed, the author later states that Varnish tries hard to
"reuse memory which is likely to be in the caches", which sounds like the same
idea applied to a different level.

~~~
JoachimSchipper
The kernel VM system has a lot more ("global") information, though.

------
_delirium
I know infinitely less about caching and VM than the author of the linked
article, but I was surprised by this part:

    
    
      Varnish also only has a single file on the disk whereas 
      squid puts one object in its own separate file. The HTTP 
      objects are not needed as filesystem objects, so there is 
      no point in wasting time in the filesystem name space 
      (directories, filenames and all that) for each object, all 
      we need to have in Varnish is a pointer into virtual 
      memory and a length, the kernel does the rest.
    

I've had more than one systems person give me the opposite advice, that yes,
using the OS's caching layer to do your disk/RAM balancing is good, but you
should write into files that are divided on logical boundaries that correlate
with how you use the data. Their argument was that this gives the caching
layer more information, e.g. it can consolidate all your tiny objects into one
part of the cache to avoid your small objects unnecessarily pinning a ton of
VM pages, and can do things like prefetch pages when you start to read a big
object, or even choose not to load a very large object into the cache at all
if you're reading it sequentially (keeping it from clobbering the cache). When
evicting pages it can also take small-versus-big-object and these-pages-go-
together issues into account, as opposed to all pages looking alike.

That's all hearsay, though, and I have no idea if it actually improves things
in practice on current OSs or with which kinds of workloads.

~~~
nkurz
Both arguments might be correct. If you are dealing with a language that makes
a strong distinction between the object and the memory layout of that object,
you might be better off handling the serialization yourself. But Varnish is in
C, and data on disk is mapped to memory then used directly as a struct.

Thus there are no small objects scattered around--- each 'object' is
contiguous, and most likely on a single page. Prefetches happen automatically
--- in Linux at least, disk caching and the VM are essentially synonymous. The
benefit of using the VM directly with mmap() is that you have more control
over the details and less overhead.

~~~
_delirium
> in Linux at least, disk caching and the VM are essentially synonymous

Ah that could explain it. The people I know seem to be working on fs-level
caches that operate at least in part on file granularity, so maybe they assume
Linux/FBSD do fs-level stuff as well. They seem to try to do things like
deciding whether to cache or not based in part on how big the file being read
is, and what its historical usage patterns are.

~~~
andrewf
Windows, IIRC, has a filesystem caching system which is distinct from VM.

------
buro9
I prefer the bit in this article: <http://www.varnish-
cache.org/wiki/ArchitectureInlineC>

Where he says: "It is a particular common kind of hubris for IT architects, to
think that they know better than 100% of everybody else, this is less of a sin
in Open Source than in Closed Source, but a sin nontheless."

------
gwern
Fighting with the OS - such as the VM system - is one of the primary arguments
for exokernel-style OSs; see

<http://en.wikipedia.org/wiki/Exokernel>

<http://pdos.csail.mit.edu/exo.html>

(In fact, I think one of their use cases showing an order of magnitude or 2
improvement specifically involves pairing an app with a custom VM algorithm.)

------
old-gregg
MongoDB guys outsourced the entire caching/memory management to OS VM and on
my (modest) workloads it performs really well. It probably also means that OS
is a bit more relevant now and the choice b/w BSD/Linux isn't just about
personal preference anymore, I can imagine that their VM characteristics are
quite different and "2006 style" software like mongod won't work the same.

Anyone with low-level experience with BSD/Linux VMs?

------
Oxryly
Hmm... I thought the scourge of 1975 programming was ignoring the vast and
growing gap between the speed of the processor and the speed of RAM.

2010 programming has to deal with the fact that chasing a (non-cached) pointer
can consume hundreds of processor cycles. So much for trees...

------
kaib
There is one interesting argument for not using the OS virtual memory system.
By using VM you have just turned disk errors into RAM errors. Many programs
can potentially handle disk corruption, almost none can cope with bad RAM.

------
haberman
This Squid vs. Varnish comparison is quite similar to the sync vs. async
debate for network programming. In both cases, the question is: do I use some
OS abstraction (Virtual Memory or threads, respectively) as my application's
primary scheduling mechanism, or do I handle scheduling more explicitly at the
application level?

Of course the OS guys like PHK or Linus think you should use the OS
mechanisms. Linus hates O_DIRECT (<http://kerneltrap.org/node/7563>) and PHK
is taking a similar tack with this article. Just let the OS handle it.

But there are real downsides to this approach. One is that it makes you far
more dependent on the quality of the OS implementation. I'm sure PHK trusts
FreeBSD and Linus trusts Linux, but if you're writing for cross-platform you
might end up on a bad VM implementation. The last thing you want to tell your
customers is that they have to upgrade or change their OS to get decent
performance.

Also, the OS is by design a more static and less flexible piece of software
than anything you put in user-space. What if you need something that your VM
system doesn't currently provide? For example, how are you going to measure
(from user-space) the percentage of requests that incurred a disk read? Disk
reads are invisible with mmap'd VM. What if you need to prioritize your I/O so
that some requests get their disk reads serviced even if there are lots of
low-priority reads already queued? If you've bought in whole-hog into an OS-
based approach and your OS doesn't support features like this, you don't have
a lot of options.

And while it's great in lots of cases that the page cache can be shared across
processes, OS's don't have great isolation between processes using the page
cache. If you run some giant "cp" and completely trash the page cache, your
Varnish process is likely to take a latency hit. In a shared server
environment, you want to be able to draw walls of isolation so that each user
gets the resources that he/she was promised. A shared page cache is hard to
fit within an isolation model like this, whereas an explicit cache in user-
space works fine.

Think about the microkernel vs. monolithic kernel debate. Maybe monolithic
kernels won, but it's still a good principle that if it can be left _out_ of
the kernel without loss of performance, it should. Why is it better to use an
interface like VM than to use some user-space library that manages disk I/O?
The kernel's one advantage is that it can handle page faults (and so can make
a memory reference into an I/O operation), but that's also the property that
makes it difficult to do good accounting of when you're actually incurring I/O
operations.

One final thing to mention: if you're using VM in this way, things degenerate
badly in low-memory situations. Since the pages of data are competing with
pages of the program itself, the program can get swapped out to service data
I/O. If you've ever seen a Linux box thrash with its HDD light flashing like
mad, you know how bad things can get when memory is temporarily too scarce to
even let programs stay resident. Using vast amounts of VM exacerbates this,
because it makes your programs and your data compete for the same RAM.

~~~
jacquesm
No matter what the quality of your VM, if you're going to have several IO hits
versus only the one where the VM (even a crappy one) pages the data in or out
just once you will always be faster in a scenario like phk describes.

Doubling or even quadrupling your IO operations is very expensive.

In a situation such as the one for which this article is meant you set things
up in advance to _never_ get in to a situation where you start trashing your
disk, programs are allocated a fixed amount of memory and if a program does
not abide by that it is considered faulty.

The trashing situation you describe can happen on machines that are run with
less rigid setups, but on a production server that you count on serving up a
few billion files every day you can't afford the luxury of random scripts
firing off CRON and other niceties like that.

Custom kernel, very limited set of processes that you know are 'well behaved',
as predictable as possible.

~~~
haberman
> No matter what the quality of your VM, if you're going to have several IO
> hits versus only the one where the VM (even a crappy one) pages the data in
> or out just once you will always be faster in a scenario like phk describes.

You can keep your own cache explicitly in user-space, and get multiple hits
with a single load into RAM.

> but on a production server that you count on serving up a few billion files
> every day you can't afford the luxury of random scripts firing off CRON and
> other niceties like that.

In a data center where you have tens of thousands of heterogenous jobs
competing for thousands of machines, you can't afford the luxury of giving out
exclusive access to a machine. You have to have good enough isolation that
multiple jobs can run on the same machine without impacting each other
negatively. As CPUs get more cores this will become even more important.

~~~
jacquesm
The whole point of this article - and it is a very good point - is that
keeping your cache in user-space is not the right way to approach the problem.
And you can get multiple hits anyway if you make sure that data that will
expire together will end up in the same page.

Your other description does not match the use case of a production web server
running varnish instances as the front-end.

~~~
haberman
The whole point of my comment is that PHK's analysis leaves out many downsides
of leaving it all to the kernel.

His main argument against doing it is user-space is that you will "fight with
the kernel's elaborate memory management." But if you just turn off swap
completely and read files with read/write instead of mmap(), there is no
fight. Everything happens in user-space.

Leaving it all to the kernel has many disadvantages as I spent many paragraphs
explaining.

~~~
jacquesm
I missed the 'if you just turn off swap completely' bit in the paragraphs
above.

Edit: even on re-reading it all I can't find it.

------
afhof
I was taught that the purpose of virtual memory was originally for running
multiple programs without their memory spaces overlapping rather than to use
more memory than the system actually had. Is this incorrect?

~~~
HenryR
You're correct in a sense - virtual memory gives each process its own
completely independent memory space that's addressed linearly.

In order to preserve the illusion of independence, the OS has to deal with the
possibility that the sum of the sizes of all the memory that each process
wants to use might be greater than the amount of physical memory available. So
rather than aggressively limit the amount of virtual address space that each
process can use, it simply only keeps a subset of that memory in physical
memory at any one time.

You can have virtual memory without paging, but then each process has to
compete for a _very_ limited resource. You can also have paging without
virtual memory - process A's copy of physical address X can be swapped out to
be replaced with process B's copy. However processes are then still limited by
having only as large an address space as physical memory in the machine, and
virtual memory is such a huge win for hiding the layout of physical memory
from processes as well as isolating them from one another (so no corruption
possibilities) that it's pretty much unheard of to do this.

You get such a big advantage from having a layer of indirection sometimes...

------
vladev
Does anyone know a good book that explains these things (CPU caches, RAM,
virtual memory, etc.) in more detail?

~~~
lfittl
I've found "Inside the Machine" by Jon Stokes to be quite a good read, though
it's a bit dated by now (published 2006)

<http://www.amazon.com/dp/1593271042/>

~~~
wmf
For programmers I would recommend a real computer architecture textbook (e.g.
<http://books.google.com/books?id=57UIPoLt3tkC>) instead of Stokes's analogy-
laden book.

------
c00p3r
There are so many cheap RAM and CPU idle circles that you even can build data
storages or programming languages on top of ridiculously inefficient JVM. ^_^

The point is that there is the kernel to manage system resources, which is
very good one.

------
DannoHung
I wish that this explanation could be expanded upon a little. Perhaps with
some annotations of selected sections of the source code.

