Hacker News new | past | comments | ask | show | jobs | submit login
What's wrong with 1975 programming (varnish-cache.org)
293 points by dchest on July 28, 2010 | hide | past | web | favorite | 115 comments

This is one of the best systems programming articles I've read in a very long time. Short summary:

* Trust the VM system to figure out how to page things (hey, 'antirez, what's your take on that? You wrote an ad hoc pager for Redis.) instead of getting fancy, because if you get fancy you'll end up fighting with the VM system.

* Minimize memory accesses and minimize the likelihood that you'll compete with other cores for access to a cache line; for instance, instead of piecemeal allocations, make a large master allocation for a request and carve it out.

* Schedule threads in most-recently-busy order, so that when a thread goes to pick up a request it's maximally likely to have a pre-heated cache and resident set of variables to work with.

It's pretty good. Maybe because I'm also a kernel developer, but these suggestions seem obvious to me. That is, instead of Varnish being novel and well-optimized, it appears that Varnish is pretty conventionally optimized and Squid is just a god-awful piece of development.

As he says, this is a 2006 architecture. VM, cache line-size d working set, MRU scheduling -- this is stuff that's been around for a while. If he were writing this article about a 2010 architecture it would look a bit different. For example, trying to minimize memory access is a laudable goal but in the normal data sets the cache is usually overwhelmed pretty quickly. A 2010 architecture would feature NUMA optimization, although his practice of allocating thread data on the thread stack actually helps this a fair bit (inadvertent?). If Varnish uses threading at a significant level, this kind of optimization is only going to get more significant over time. The modern CPU architecture product cycle goes through a kind of give and take -- it alternates between having just enough fast memory and not enough. Right now, we are just about smack dab in the middle of "just enough," but the next Intel release is projected to move to "not enough." Threaded programs in NUMA start to matter a whole lot more on 16, 32, 64 cores.

P.S. We actually just ignore 32-bit architectures. Even modern commodity CPUs don't have them and anyone running a server architecture with only 4 GB of RAM has bigger problems than the proxy software.

Edit: Just saw antirez's post: very good. The VM architecture is good for general case, but it really wasn't designed for specialty applications. There are many operations that a VM can make if it knows a fair bit about the working set and the memory pattern, but that's not possible in a generic OS VM (prefetching and streaming are big ones).

Squid is just a god-awful piece of development

This is nonsense. I was one of the developers on Squid for many years, which makes me more aware of its flaws than most folks, and it does have many flaws, even in current versions. But, to say it is god-awful development is pretty deeply ill-informed, and insulting to the small handful of dedicated volunteers who currently maintain Squid (which, I might add, is used by an order of magnitude more people on an order of magnitude more servers than Varnish; it is the most popular proxy cache in the world, serving about 80% of the market last time I saw research on the question).

Squid has different priorities and goals from Varnish. One of Squids priorities is to run on as many architectures and operating systems as possible. Another is to provide an extreme level of functionality, making it a swiss army knife of web proxy caching tasks. A third is being compatible with everything; and in the HTTP protocol world, this means dealing with thousands of edge cases and broken implementations. These things are not always consonant with being the fastest proxy, or with being the simplest design. Edge cases have a cost, which Squid pays because that's one of its goals.

Certainly, Squid has vestigial design decisions that cost it performance and efficiency on modern hardware (though the select/poll loop was the best we had until just a few short years ago, and the VM layer on many UNIXes is retarded; calling the way Squid does things 1975 technology is idiotic), but there have been improvements in a number of areas over the years.

Anyway, I would advise taking the flames of a developer on a dramatically less popular and less capable project with a grain of salt. There are interesting ideas to be found in Varnish and in its developers comments, but his knowledge of Squid is clearly limited, and his opinions are clearly motivated partly by a weird bit of anger toward the much more popular Squid.

I wonder if Squid's priorities are now less useful than they were when Squid was designed, and if that's a source of the derision lately.

Many products designed in the 90s and early 00s took compatibility very, very seriously. Things had to run on everything, and a lot of engineering effort went into patching over the differences between platforms. So you got things like Java, autoconf, Squid, Apache, Linux, various GUI abstraction layers, etc.

Now, it seems like the pendulum is swinging towards performance, simplicity, and ease-of-use, and people are saying "fuck compatibility". Since so much software has moved to the server side, people are just standardizing on one type of server (usually some Linux flavor), saying "We'll develop for this", and ignoring everything else. And as a result, products that used to be essential because things really needed to run on everything are getting a lot of hate for their byzantine config options.

I see this most with autoconf, another project that has had lots of abuse thrown at it, but it seems like it could apply to Squid or Java as well.

I think that is extremely likely, and autoconf is an excellent example. autoconf hate is almost nonsensical when you realize what it replaced (though, I kinda hate it, too, because it's very confusing, but definitely not more than what it replaced...). In a world of people only building for Linux (and where Linux has mostly standardized to the point where building for one is the same as building for any other), it seems needlessly complex, but in the world of Solaris, AIX, SCO, Irix, Linux, FreeBSD, OpenBSD, NetBSD, and a bunch of others, it was simply necessary. If you didn't let autoconf handle some of it, you simply had to write your own crazy shell scripts to figure things out.

I think it comes down to younger folks not having experienced the pre-Linux world of the server-side (or at least the pre-Linux domination). There was a time when there were dramatic differences between the various UNIXen, and only something like autoconf could make an Open Source project build on anything approaching a majority of them.

I think a good example of a weird characteristic of Squid is its cache_dir directory layout. By default, it'll distribute files into a directory/subdirectory/ hierarchy. This doesn't really serve a useful purpose on modern systems (it doesn't really hurt anything, either, it just looks weird to modern users), but on some UNIXes back when Squid was being built there were dramatic performance effects from having huge directories full of thousands of files.

Likewise, the whole virtual memory argument that the original article talks about is complaining about something that was intentionally added to the Squid design in order to address the horrible characteristics of several major UNIX VM implementations (even now, having a good VM layer is not something you can just assume).

I feel a little bit of unease when I see projects that do choose the "we develop for this" model. I'm not sure it is a useful feeling, but I think we probably should keep our options open. I've made cross-platform a religion for most of my developer life, and it is perhaps less important now...but I don't think it's wise to just say, "OK, it runs on the latest Linux kernel. Fuck it, let's ship it. Screw everybody using anything else."

Now that DVCS has made branching and merging so much easier, maybe a middle ground approach to portability becomes reasonable: the trunk can support only Linux and people who want to run on other OSes can maintain appropriate branches. Like OpenSSH and OpenSSH-portable. (This also prevents minority platforms from externalizing their costs.)

Interestingly, Robert Collins has been one of the core developers on Squid for about a decade and also was one of the original developers on the Bazaar/bzr/arch project, and is currently employed by Canonical to develop Bazaar. And yet, to date, I think Squid still uses CVS. I think that's irony, or something.

In the past there were branches for oddball systems, like Windows. But they eventually merged into HEAD. I'm not entirely sure DVCS solves the problems that having multiple branches introduces. I'd need to see some numbers or something. OpenSSH isn't in a DVCS, is it? So they aren't a data point in either direction, though I wasn't aware of OpenSSH-portable.

OpenSSH is a bit odd, too. It was pretty much started by OpenBSD people, which makes the team more focused on OpenBSD; and the -portable version doesn't just make it compile on Linux/Solaris/..., but also adds PAM support and other stuff. Most teams are more heterogenous, and most programs don't need to integrate that much with the OS.

I don't know, I always thought xmkmf was the more elegant solution.

>But, to say it is god-awful development is pretty deeply ill-informed, and insulting to the small handful of dedicated volunteers who currently maintain Squid

Not to poke at Squid (I'm far from qualified), but that's a horrible argument. There are plenty of extremely-highly-motivated people working on god-awful projects, believing they're being brilliant and revolutionary, when they're in fact re-doing something solved a decade earlier, built into the operating system they're using, and theirs runs 100x slower than the average naive implementation found by Googling.

Insulting it may be, but wrong it may be not.

Hold up...Which is a horrible argument?

I said it is deeply ill-informed. I said this after explaining that I was a Squid developer for many years. I was making a statement based on experience, not pleading a case. If you believe I'm incapable of determining whether a project is god-awful development or not, that's fine.

And then I said that it is insulting to dedicated volunteers. This was not an argument for why Squid is not awful development. I explained that later. I was simply stating that it's insulting.

And, as for this:

"when they're in fact re-doing something solved a decade earlier"

Yeah, in a lot of cases Squid is the "solved a decade earlier" example here.

"built into the operating system they're using, and theirs runs 100x slower than the average naive implementation found by Googling"

If you believe Squid fits this description you're really not qualified to comment on it.

As I said, Squid has many flaws, but it was not designed or developed by children or idiots. It was built by competent software developers who wrote many of the papers on the topics that it addresses, and invented many of the techniques that are now standard in proxy caching software. Varnish takes more ideas from the Squid developers than Squid developers could ever take from Varnish, whether he knows it or not.

Squid does have legacy and baggage. Squid also has capabilities that Varnish would never have need for (ICP, for example, cache digests, hierarchy features, etc.).

"Insulting it may be, but wrong it may be not."

And I said that it is both.

As an entrenched-developer in a project, you likely have emotional baggage about your project. You're more likely to unjustly-defend your own project. I probably should've included the used-by-80% portion as well, as it's part of the claim that it's not god-awful.

As I've said, I am not qualified to comment about Squid, nor was I. I was commenting about some / other developers and projects, not pointing a finger in any particular direction, and pretty clearly referring to extreme cases.

I haven't been a Squid developer since 2006. I'm defending it in the same way I would defend Apache, or BIND, or MySQL; and I'm doing so as a Linux/UNIX/Open Source old-timer that remembers what it was like before these projects existed, and realizes how much the developers of these projects have given us over the years. It is an institution for a reason (or a lot of reasons), and the people who built it and currently maintain it are deserving of a modicum of respect. And the software is deserving of an honest assessment of its flaws and strengths, rather than unbacked assertions of being "god-awful" or "1975 technology".

But, to say it is god-awful development is pretty deeply ill-informed, and insulting to the small handful of dedicated volunteers who currently maintain Squid

PHK was obviously referring to a specific issue; Squid's implementation and usage of caching in its particular area of application. Varnish is obviously better at that. Squid is obviously better when it comes to portability and flexibility.

I'm going to think out loud here, and please take a minute to consider this, because I'll just try to be honest, not mean or elitist: Is it enough for my software to compile and run correctly on multiple platforms for me to call it 'portable', or should I take advantage of the characteristics of each platform?

I know that when it comes to my work, I choose the second, and I'm guessing that PHK would too. The "1975 programming" bunch, well, maybe not.

"Is it enough for my software to compile and run correctly on multiple platforms for me to call it 'portable', or should I take advantage of the characteristics of each platform? I know that when it comes to my work, I choose the second, and I'm guessing that PHK would too. The "1975 programming" bunch, well, maybe not."

Squid supports async IO threads on Linux, has experimental (maybe stable by now) support for epoll on Linux and kpoll on FreeBSD, it automatically chooses the best option for poll/select based on the platform it is building on, uses the best available malloc (and will use alternatives if available and the build system has a retarded one; it used to include dlmalloc in the source tree just in case no acceptable malloc was available but that's probably gone by now), can run as a service on Windows, etc., etc., etc.

So, Squid does both, in many regards. It not only builds and runs reliably on pretty much every platform you can throw at it, it runs better on platforms that provide the mechanisms it needs to run better (Linux and FreeBSD, in particular).

It can't trivially replicate the dumb (but very fast) pipeline from net to memory to net that Varnish uses because Squid has a dozen or more access points into the data passing through. It could certainly be cleaner and simpler, and there was a zero copy patch running around for a long time for Squid 2, which never reached stability (due to the huge amount of code you have to touch to make such an idea work). Maybe Squid 3 has done something about all that, I'm not sure.

Squid is not incredibly fast, but it's definitely able to take advantage of better and more modern platforms.

Fair enough, I was just going on the description given in the article. Yours sounds a lot more reasonable -- I should have checked it out first. That said, everything I said about optimization still stands (and my comments on the VM layer are still correct).

Ya but the major point of the article was to point out that certain things Squid does ostensibly to improve performance actually end up being counterproductive. Do you disagree with that?

According to your logic, since more people use Windows, it is clearly a better engineered piece of software than Linux, right?

The things that Squid does to improve performance do improve performance on many platforms and for many workloads. This is demonstrable, and was supported by evidence at the time the changes were made. Squid didn't spring up from nothing yesterday. It was developed over a decade by a number of very smart people.

Do they improve performance on the latest version of Linux in the particular workloads for which Varnish is designed? Probably not.

"According to your logic, since more people use Windows, it is clearly a better engineered piece of software than Linux, right?"

Where do you see me stating that popularity equals better engineering? My only comments about popularity were about the motivations of the Varnish developer in his criticisms of Squid.

But, I would argue that one of the reasons Squid is more popular than Varnish (and any other proxy cache) is that it works in so many situations and does so many things. It doesn't have to do with engineering, at all, but answering the needs of a broad spectrum of people. It also helps that it has reliably been answering the needs of a broad spectrum of people for over a decade.

Squid is a goddamned institution, and while there's plenty of room for other proxy caching tools, like Varnish, it pisses me off to see folks hurling pointless abuse at people I know to be extremely good software developers (two of whom probably have written an awful lot of code in places that everybody here uses on a regular basis...if you use Ubuntu, for example).

Isn't Squid like, 5 or 6 years older than any other open sourced proxy cache? That would make it more popular for the same reason that Apache is more popular than other open sourced webservers: name recognition? I remember that for a while, the Squid was the only proxy cache of which I knew. and the only one I had ever attempted to install.

Squid's ancestor (a caching component of the Harvest project) was the first web proxy cache, period. One of its developers (Peter Danzig) went on to NetApp and produced the NetCache, which was the first commercial proxy cache. The Open Source option predated the commercial variant by a couple of years.

There have been numerous proxy caches that have come and gone in those years, and there will likely be numerous that come and go while Squid continues doing its thing.

"That would make it more popular for the same reason that Apache is more popular than other open sourced webservers: name recognition?"

Do you really believe name recognition is the only reason Apache is the most popular web server?

It doesn't have anything to do with the huge array of capabilities that Apache has that no other web server has? Or the broad ancillary tools support? Or the huge pool of knowledge available for it? Or that it is proven reliable? Or that it can safely be expected to exist and still be a viable option in five or ten years?

But, to answer your question of whether I think Squid is popular for the same reason (or reasons, as I believe is the case) Apache is popular: Yes.

Squid is popular for exactly the same reasons Apache is popular. It is a reliable product that serves a wide variety of users very well. It is well-maintained and has a long history of being well-maintained. It is well-understood by a lot of people, so it's not going to be neglected or a source of contention if the IT guy moves to another job. It is fast enough for a lot of uses. It is extensible via a number of scriptable access points, so developers can easily make it do what they need it to do. And, despite the accusations of horrible design, it has proven itself to be quite resilient. The number of security issues in Squid, for example, has been truly miniscule in the past decade.

The funny thing is how many people are acting as though Varnish is going to somehow take the place of Squid as soon as people realize that Varnish is "better". For one small subset of problems for which Varnish is specifically built, it may make sense. But, for a huge array of other proxy and caching problems, Varnish isn't even a contender.

I should maybe explain that my first company was a web caching proxy company, building products based on Squid. I deployed Squid several thousand times over seven years. Varnish would have worked in maybe a few dozen of those deployments. Squid solves a ridiculous array of problems, even though it solves none of them as well as a purpose built application could solve any one of them. Varnish solves its handful of problems very well...and does nothing for all the rest of the use cases.

"Do you really believe name recognition is the only reason Apache is the most popular web server?"

I think that that's a huge component of it. I know that I've been in a number of shops where they deployed Apache, not because it was the best tool for the job, but because it was the only tool of which the admins knew, and they didn't know where else to look.

I no opinions about Varnish vs Squid, but I will point out that I had no idea what Varnish was until I read this article, but I've known what Squid was for a while, so they're not competing on equal terms.

eh, from a SysAdmin point of view, popularity and familiarity are reasonable arguments. Certainly not the only arguments you should take into account, but the ease with which I can hire people who know the system, and the ease with which I can use a search engine to find answers to my problems is very relevant. I can't say I've been using squid for 10 years constantly, but I've used it on and off for 10 years. Just about any reasonably experienced sysadmin has used it some.

Caching problems can be a little tricky, and having someone experienced with the caching system when something goes wrong can help a lot.

I'd say squid is a bit like NFS. It's not the best system possible in theory, but goddamnit, it's been tested to hell and back in systems much larger than I'm dealing with, and we know where all the problems lie.

(Also note, I believe Windows is only more popular on the desktop; Windows has never been a majority player in terms of webservers and other public Internet infrastructure, unless you count private intranets. in the public Internet server space, windows is fighting with sun and apple for the scraps. Linux is the big player here. )

"(Also note, I believe Windows is only more popular on the desktop; Windows has never been a majority player in terms of webservers and other public Internet infrastructure, unless you count private intranets. in the public Internet server space, windows is fighting with sun and apple for the scraps. Linux is the big player here. )"

No, you'd be surprised. Windows, (running IIS) actually has about a quarter to a third of the public webserver market share: http://news.netcraft.com/archives/2010/07/16/july-2010-web-s...

ASP.NET is actually very popular.

huh. so it does have a reasonable lead on sun and apple. Still, according to that graph, IIS has something less than half the market share apache does. (I know you /can/ run apache on windows, I just don't know anyone who does.)

Varnish is a massive VM to network pipe, nothing 'sophisticated' happens to the data that it deals with.

If that were the case I wonder how well the model would hold up but it does what it does very well and there is as far as I have been able to determine no other program that comes close to the kind of throughput you can achieve with varnish.

Disclaimer: very happy varnish user here.

If you have mostly big objects (relative to hardware page size) and the encoding in memory and when stored on disk are pretty similar, this is indeed a good idea (to trust the virtual memory I mean).

Otherwise... no way, if you have data structures everything is fragmented around (and you want to use a lot object sharing, caching, ... for performance, without to mention hash tables that are very cool at filling at least 1 byte of tons of pages even with 2% of data inside).

Also data structures can often be serialized on disk using 1/10 of the space.

Using the VM is cool, but not so generally applicable. The proxy stuff is perfect. Also on-disk DB is perfect using the VM the other way around, to get a memory-cache for free, if you don't need strict consistency (see MongoDB).

If your data structures can be serialized using 1/10 of the space and you can identify what you're not going to use for a while, then why not serialize it into RAM that way and let VM serialize it to disk if needed? You accomplish the same thing, but with less code and fewer system calls.

To me the big reasons to serialize to disk are as a way of sharing data between processes, and as a way of making data permanent. But not for performance. Because compared to RAM, disk has none.

Sorry I replied to the wrong comment, check one child later please ;) The best representation in memory and on disk are usually completely different, while VM will force you to use the same for both the worlds.

I can't find the "one child later" that you're referring to.

As for the best representation, I am not sure I agree. I agree that the best representation for frequently accessed data and seldom accessed data are frequently very different. Of course normally the former is in RAM while the latter is on disk. However it is not obvious to me that it is at all a bad thing to have some space used in RAM by infrequently accessed data. And if it is not, then you wind up with the solution I suggested. Just write out the data as you want it on disk, into RAM, don't touch it, and let the VM worry about when (and if) it needs to actually be paged to disk.

Of course this doesn't work if you're not using virtual memory. :-)

I've said this before, but I think the optimal would be to leverage the OS's vm by relocating subpage objects to create coherence by putting frequently coaccessed objects on the same page.

I'm pretty ignorant on systems programing, but I gather on most platforms there's a way to create a custom page fault handler, that could preserve your custom serialization format. Depending on how redis objects are linked however, allowing relocation may be complicated.

What are the difficulties in using some sort of compression algorithm like LZJB on the disk's page file?

If, as you say, data can be that easily compressible, the IO speedups from reading the compressed data from disk and space savings would outweigh the CPU cycles necessary to do the decode, but my brief Google search for compressed page file implementations came up with nothing.

A linked list is hard to compress in memory as the overhead is the metadata: pointers, malloc overhead, ... but when written on disk can be represented as prefixed length strings.

With VM the live representation and serialization format are the same.

Paul Wilson worked on compressed paging at the turn of the century [1] and tried to address this by introducing specialized compression algorithms for different kinds of pages (e.g., x86 instructions, pointer-filled data structures). Interesting stuff, but I don't think I've ever seen it commercialized, except in Newton OS where we used compressed code pages in binaries (read-only, not swap).

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

just curious: is it not possible to have some form of unrolled-linked-list to mitigate this effects somewhat ? or perhaps you have already tried it, and it doesn't really fit the bill.

It kind of sounds like you're describing a sparse array.

> It kind of sounds like you're describing a sparse array.

hmm not really. this: [ http://en.wikipedia.org/wiki/Unrolled_linked_list ] is what i had in mind

Here's something from Android: http://code.google.com/p/compcache/

The major assumption is that the OS VMM's caching behaviour is as smart or smarter than the application's caching behaviour; that is, the operating system knows better than the application when to evict something from the cache. That may be true for many stupid applications, but I seriously doubt it's always the case.

I was informed recently in a Masters level course on database implementation that databases typically wish to explicitly circumvent a lot of this caching, especially for IO buffering, since they can do the same thing at the app level, but leverage the fact they have much more well-defined access patterns than some generic kernel system that has to perform adequately across all workloads.

I believe the O`_DIRECT flag can be used for this purpose, although a quick bit of googling suggests it's getting flack from all over the place, including Linus himself.

Another big reasons that databases with to circumvent this caching is that they really, really need to make sure that when a transaction is committed, that they know what is on disk. Having the OS pretend stuff is written to disk is great for performance, but not so great when you're trying to recover after someone pulled the power.

This is where you get DBs that use their own raw block devices, as well as keeping transaction logs on other regular filesystems for integrity.... Oracle does something similar no?

For example PostgreSQL actually has almost none caching built-in and almost exclusively relies on OS (in default and recommended configuration).

I expect that most of DBMS that wish to circumvent OS-level caches do this because (a) it was good idea when they was first implemented and (b) use some highly-peculiar on-disk layout that might actually benefit from this (modern databases tend to not have any special "well-defined" access patterns).

I don't think Postgres' "default config" is really the "recommended config." In particular, the default size of Postgres' buffer cache is small, to reduce footprint on non-dedicated servers, and to fit within most systems' default SysV IPC limits. (God I wish they'd just use mmap()).

Postgres does expect that in addition to its explicitly-managed cache (the shared_buffers setting), the OS will be caching a bunch of stuff as well, and its query optimizer takes this into account (the effective_cache_size setting). Reasonable conservative settings are 25% and 50% of system RAM, respectively.

There's no One True Source on this AFAICT, but http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serve... and http://www.postgresql.org/docs/current/static/runtime-config... are handy.

Well, with the Squid example, it's that paging out to disk wasn't even really a consideration. What if your system simply has lots of ram and no swap? THat sort of takes care of things right there, doesn't it, and let the OS file caching mechanism do it's thing. After all - if you are swapping, you'er taking a huge performance hit.

> hey, 'antirez, what's your take on that? You wrote an ad hoc pager for Redis

Antirez referenced this article and explained why he disregarded these advices in the initial Redis VMEM post.

Neato; link?


(See comments #29 and #30 for the reference)

Mmmm crap technically he didn't reference this article (explicitly), I should have checked.

Here is a contrarian perspective.

The performance characteristics of disk and RAM are so completely different that any program which swaps to disk is going to be effectively unusable. Therefore if you care about latency, turn VM off and provision a generous amount of RAM. Because VM only seems to work, until you actually use it, and only then do you discover how broken it is.

I look at swap as a parachute. Yeah, most of the time, you don't want to use it. But especially if you are going up in a rickity old plane, or trying to save money by giving yourself /just enough/ fuel to get there, a little extra swap can make your landing, if not comfortable, a lot less painful than the crash that would have happened if you run out of ram without swap.

Now, much of this problem is caused by Linux's /really aggressive/ default memory over commit policies. Besides, swap gives you a free out if you are running programs that load up unnecessarily large libraries.

What you say is appropriate in some use cases only.

My desktop has swap. The servers I use do not.

out of curiosity, do you leave the oom-killer enabled? or do you set your boxes to panic in out of memory situations? Do you adjust your memory overcommit levels?

Those problems have been addressed, but I really shouldn't say anything more than that.

It's actually an interesting problem most people ignore... which, if you have enough swap, is perfectly reasonable; but if you are running without swap, they can become /much/ bigger problems. The funny thing is, memory overcommit both becomes a much larger risk /and/ confers more benefit when you have no swap

Now if you've dealt with this already and just have your techniques secret you know all this but I'm going to yammer on anyhow, just 'cause I think it's a interesting subject; feel free to participate or not.

So in the modern virtual memory system there are lots of situations where more memory is allocated than used. If I have some massive, 200 megabyte webapp running under mod_perl or mod_php or what have you, and I have apache fork 1024 processes, (well, in that case I probably need to increase threads... but I digress) even through I should be using 200 gigs of ram, I'm not. fork uses copy on write; it only copies the data that changes.

this is pretty cool... I get to use 200 gigs of ram, but I only actually have to buy 400 megabytes or something. The problem is that it's impossible for the virtual memory system to tell ahead of time how much of that copy on write will actually be copied. without memory overcommit, the box will keep track of every fork and figure the max allocated ram, assuming no copy on write savings, and it will fail your fork or malloc when it reaches the total amount of swap plus ram. If you are trying to run 1024 identical 200 megabyte processes on a box with 8 gigs of ram and no swap this really sucks. So memory overcommit, especially when you don't have a lot of swap, is a nice thing to have.

But, you say, what if my 1024 identical 200MiB processes become non-identical? what if they start changing their memory, and copying and start using more ram than I have ram and swap in the box? on most linux systems, you'll get the oom-killer, which will randomly (well, not randomly, but it seems that way sometimes) kill a process. Sometimes it kills something unimportant.... sometimes it kills the webapp the box was built to house... sometimes it kills some background system process you were depending on. You can tweek this to hell and back, but any way you slice it, the oom-killer is bad news. Another option is to tell the computer to just panic when it finds it runs out of memory.

Now, if you turn memory overcommit off...well, your landings are much softer. without memory overcommit, the only time the box runs out of ram is at malloc time, and it can cause malloc to return an error, and (hopefully) be handled gracefully by the program asking for more ram.

the real problem with turning memory overcommit off is that if you are trying to run 1024 identical 200MiB processes on a box with only 4 gigabytes of combined ram and swap, your 21st fork will fail, even though, thanks to the magic of copy on write, there was plenty of unused ram to go around.

Now, the advantage to having a lot of swap on a system without memory overcommit is that the virtual memory manager is pretty smart; while the computer can /commit/ to allocating swap, as long as the magic of copy on write leaves it with free physical memory, it will use that physical memory. If it turns out it was too aggressive about over committing memory, well, it hits swap, and depending on how much ram you have that is seldom used, your box slows down by quite a lot. Of course, if you use swap for ram that is actually accessed very often, most people agree that the box might as well have just crashed or frozen. the disagreement, I think, comes when there is ram that is allocated for some seldom-used library or the like. the virtual memory system swaps that out to disk, and gives that bit of physical ram to something else it needs.

If you are running any modern program, there will be a proportion of its functionality and data you never use. That ends up paged to disk, freeing up RAM for what you do use. That's what VM is for, really. Using it as "memory" is very 1970s. It's more of a "forgettory".

I suspect you're thinking "desktop" and btilly is thinking "server", and you're going to end up talking past each other. Or, if you are thinking servers, then you're thinking "one server" and he is thinking "one million servers".

In a highly-scalable webapp, that infrequently-used functionality lives on another server, called via RPC. Instead of paging it out, you simply provision fewer machines for it, and then only make calls to it as necessary. Paging out one component of a large distributed system is dumb: you're potentially slowing everything else down because you couldn't spare one machine to keep the whole binary in RAM.

Shhh...don't give away any secrets. ;-)

However my comment applies to much smaller websites as well. If you have a website running on a number of webservers, and the working set exceeds RAM, then the website slows to a crawl, requests pile up, RAM gets even more overloaded, and you have a very bad failure mode which can be hard to sort out. By contrast if you have no swap and lots of RAM, you could have kept humming. If you hit the limits of RAM, you stay fast but get OOM messages in the error logs, which is much easier to debug. Furthermore it now obviously makes sense to monitor how much RAM is really in use, so you can head off problems before you run out of RAM. This gives you early warning of issues before they happen.

Yes, it is true that this approach creates a potential problem when you wouldn't necessarily have had one before. But creating an avoidable potential problem frequently results in fewer actual outages than having a problem which can sneak up on you in a non-obvious way.

This was pointed out to me by a very competent sysadmin several years ago when she explained why she had set up the servers with no swap. And she proved her point when she told us, before anything went wrong, that we needed to buy more RAM for the servers.

I ran a mid sized website that way briefly. Put simply: oom behavior can be far worse than a swap storm depending on how coupled your app servers are.

It sounds like you didn't monitor RAM in use and proactively make sure you had enough. In 5 years of working with the sysadmin I am talking about, we never once ran out of RAM. Nor was any performance problem ever caused by the webservers.

What if I've analyzed how much ram I need, and ensued that I have "enough" ram in my production system and don't have to worry about needing more. I'll know when I need more when my monitoring metrics tell me I'm getting low, or if I't monitroing, when the server hard fails because it can't allocate ram. I prefer that situation than a gradual degradation eating into swap.

Ram is cheap.

If that works for you, then you obviously aren't trying to consistently hit a low latency SLA. Which brings me back to my point.

Now that's interesting. I've been asking around for a while to find out what other systems programming experts think of this article, specifically the first point about not fighting with VM, which is the most prominent one (including in phk's provocative recent ACM article). I'd also like to know what other things, besides these, Varnish does to achieve its performance. I'm sold on the idea that most of us have been doing this kind of thing wrong. Let's hear more about how to do it right.

For what it's worth, I've been in and out of the FreeBSD kernel since 1995 and am an unabashed PHK fan.

IT's not so much that it's "wrong" - it's just ignored.

What if you are designing for a system that won't ever have a swapfile/pagefile? What if your application is designed so your working set fits in ram? Then you can use other methods and not worry about fighting as much with the VM.

I love this article - because a reverse proxy is a great example of this.

(You want to cache transitory data to disk - You plan to roll your own caching system to keep data locally so you don't go back and query source nodes. You want some legroom to work with working-sets that may be larger than your physical ram, and you may want that ram cache to survive reboots intact.

So memory-map the file and let the OS take care of the rest - makes sense to me.

EDIT: What I meant to say was "It's not so much not fighting with the VM - it's realizing that the VM is more than something to ignore, and that it can be used to your large advantage by treating it like what it really is - a system that deals with shuttling data between different levels of storage (disk/ram).

> hey, 'antirez, what's your take on that? You wrote an ad hoc pager for Redis.

I am very curious about this too. I could see doing that in Java where you don't have ability to advise the VM on how you're going to be accessing memory mapped files (no madvise() or fadvise() available despite NIO support for mmap, thanks Sun!), but asides from portability to non-Linux OSes e.g., ZFS ARC Cache on Solaris doesn't honour madvise()/fadvise() the same way Linux page cache does, what would be the reasons against trusting the VM in C/C++ code where you can explicitly "advise" (Linux treats madvise() as imperative rather than advisory) the VM on your usage?

Redis is a cool product and antirez is a smart guy, so I am sure there must be a reason for this.

On one hand, mmap() is very nice. On the other, you may run into the limits on process size, and this is essentially unsolvable [1]. So your program is limited to 3GB of data on 32-bit machines.

I also think hierarchical allocators - a slightly-formalized version of phk's "carve chunks off a block of pre-allocated memory" - deserve more attention. See SAMBA's talloc, or halloc.

[1] Well, you can write some code to page things out as appropriate, but by the time you're inventing your own virtual memory manager you're definitely doing it wrong. Just go with old-fashioned file-based code.

If you're running on 32 bit architectures for your main servers you are just playing around and you don't need stuff like varnish.

If you're going to try to push multiple Gbps out of a single box the least you could do is put a bunch of ram in it and install a 64 bit OS. That's a lot more bang for the buck than installing multiple 32 bit boxes with only a bit of memory in each.

Certainly, the "professional" thing to do is install a 64-bit box with lots of memory. But don't underestimate "playing around", especially as it pertains to the popularity of OSS.

Can you even buy a 32-bit box anymore?

Do smartphones count? ;)

Only if you use them as servers.

I understand the issue of process size, but I'm confused why this is a concern.

1) You're not limited to having a single giant map of your entire data set for the life of the process. You can map in only the parts you are going to touch. For a CPU intensive request, a mmap()/munmap() pair isn't that much overhead.

2) You also can't address more than 64K of space on a 16-bit machine, but no one seems to worry too much about this anymore. Why would one worry about 32-bit machines when implementing a system that is handling gigabytes of data? Why not just run 64-bit and solve the 'problem'?

I agree with you on the allocators. A pet projects that I'd like to get back to is redesigning dlmalloc to run fast out of a shared mmap(). With some simple locking, I think one could get some really fast file-backed data structures.

Yes, you can mmap() and munmap() as needed, but that makes things a lot less convenient - you'll still need to keep track of which data is on the disk and which data isn't, etc. At this point, you've already solved half of the problem of using malloc(), read() and free() instead of mmap() and munmap().

POSIX mandates a 32-bit int - UNIX hasn't ever run on a 16-bit machine (yes, I know there are some abominations.) Au contraire, 32-bit machines are still common; download one of the 'whole archive' files from http://popcon.debian.org/ and run:

    $ awk '$2 ~ /^linux-image-/ && $2 ~ /86/    { SUM += $6; }; END {print SUM;}' by_recent
    $ awk '$2 ~ /^linux-image-/ && $2 ~ /amd64/ { SUM += $6; }; END {print SUM;}' by_recent
I got 8174 for /86/ and 4420 for /amd64/. Admittedly, this is in no way representative of anything, but clearly people are still using i386 (note that I summed the 'recent' column!)

In short, i386 is not dead. And suggesting that it cannot handle "gigabytes of data" is absurd - Postfix/Dovecot, or PostgreSQL, or SAMBA, or pretty much whatever, can handle many, many gigabytes of data without issue on a i386 platform. Yes, you'll need to upgrade to 64-bit for the truly demanding tasks, but a database with a couple of (tens of) GB of data in it can work perfectly well on a scavenged i386, which is one of the good things of a unix-ish system.

(That said, you can trade off ease of implementation for being crippled on 32-bit platforms. But you are sacrificing something.)

> POSIX mandates a 32-bit int - UNIX hasn't ever run on a 16-bit machine (yes, I know there are some abominations.)

That's afaik absolutely not true, Unix was actually developed on 16 bit machines, back in the stone age of computing, roughly 1969 (that's why the unix 'epoch' starts in 1970).

After a while 32 bit machine became available and Unix was ported to them.

Unless you wish to call the earliest versions of Unix 'abominations, but I'm thinking you have Xenix and such in mind when you write that.

Yes, sorry, I knew that. But if you're targetting UNIX machines, assuming some level of POSIX compatibility is completely reasonable.


Fully agree, but he did write UNIX.

Sorry, I didn't notice that he used "UNIX" right after "POSIX."

> you'll still need to keep track of which data is on the disk and which data isn't

I'm not following this. As I see it, all you have to do is keep insure that no more than 3GB are mapped for your process at a given time. You still let the kernel decide what's faulted in and when things are written out.

Maybe my perspective is skewed: I'm viewing it from the perspective of real-time search, with large posting lists being frequently read and updated, and lazily written out to disk. Is there a different situation that is harder?

> In short, i386 is not dead. And suggesting that it cannot handle "gigabytes of data" is absurd

I didn't mean to imply it couldn't, or that it wasn't in common use. Rather I was wondering why I as a open source software author (http://wiki.apache.org/incubator/LucyProposal) would choose to make my life more complicated by targeting new software at a 32-bit platform at the expense of code simplicity.

Our current plan is to optimize for 64-bit systems, and allow 32-bit compatibility for smaller data sets (tens of GB) by sprinkling in mmap()/munmap() calls as necessary: http://www.mail-archive.com/lucy-dev@lucene.apache.org/msg00...

Edit to add: I'm not trying to claim I'm definitively right. Rather, my point is that I care about the answers to these questions in practice as well as in theory. I worry about taking the wrong approach.

Admittedly, at this point you could probably consider that 32b machines are legacy and unsupported (FWIW there was a Steam stats thing posted a few months back indicating that more than half of their Windows 7 users were on 64b... I can only expect that those ratios are much higher for unices)

Basically he's just saying that, hey, if you want to do a caching proxy server - doing it the way squid did it was all fine and good, but it actually doesn't take into account how VM paging will interact with things - so you end up with potentially duplicated data in pagefiles and on disk.

By making use of the mmap() method, he just ensures that when data he wants stored on disk somehow needs to be paged out of ram, it won't be duplicated.

These systems seem to be, in practice, relying on having working sets that mostly fit in RAM. It's a nice world when you can stack 64GB in a box and never worry about disk, but having worked on systems that need more working set than this, I don't think this idea actually scales up as well as is claimed.

Two issues: 1. Small objects aren't batched by a VM, so they can't be combined into long streaming writes and reads. Doing 4k writes and reads is useless on a modern disk. 128k to 1MB is more appropriate if you can organize things that way.

2. If your workload "mostly" fits in RAM, and then as your application scales, the working set exceeds RAM, you wind up doing some queries that are instant, and some that are a million times slower (a real, uncached disk seek). Unless you plan for this, your app can fall over and you just won't know what happened. You need to know the fraction of requests that will go to disk and plan to have spindles (or RAM) to handle it. I don't like systems that make measuring what's going on at this level any harder to do.

The author's claim is that you get to share RAM with the VM, which is great, but in practice a lot of systems use kernel-supported sendfile() to do work (i.e. have a small streaming buffer per network connection), or cache super-hot objects in a relatively small amount of RAM. The assumption that all user-mode caches try to allocate as much RAM as possible is not true.

Alternately, larger systems separate RAM cache entirely from disk-bound boxes (e.g., dedicated memcached).

I think a more complete treatment of this problem would explain scaling over working set more aggressively, and it would explain using instrumentation how the system degrades at scale.

Currently serving up billions of images per day from memory images that are more than 100 times the available physical memory (and there's plenty of that).

Small objects can be batched in to a single page at the application level, the VM will then move these in and out of the resident pool as one unit.

The operating system uses elevator sorting and knows enough about the disk that it will attempt to sequentially invalidate pages.

The assumption is not that the working set will fit (mostly) in to RAM, the assumption is that a page fault is a relatively rare occurrence and that other threads will not be stalled by the IO done for one. It is the changes to the working set that determine page faults, not the size of the total set.

On my boxes I solved one issue with this (and the maximum number of systemwide sockets) by running a varnish instance for every physical CPU in the machines.

These "1975" programming concepts are essentially the same ideas taught in the CS computer architecture course I took (circa 2006). The main focus was on the memory hierarchy of modern computers, which was fine, but there was not much advice about letting virtual memory do its job - the course covered mostly how things work under the hood, but did not give practical programming recommendations. I can see how budding programmers could get the idea that "disks are slow, so I should try to manage when data gets copied to and from RAM" without realizing that the system software is already taking care of it and that attempting to do it manually will generally just make matters worse.

As an aside, I found it really hard to read this article because of the grammar errors. Nearly every paragraph has a run-on sentence, and there's plenty of missing punctuation. The style also seemed more like off-the-cuff rambling (especially the long-winded attempts at humor near the beginning) than carefully-written advice, which almost made me stop reading before getting to the real content. The message is good, though, even obscured by these problems.

It seems to me that the first point made (on the performance of Squid's LRU caching) is really just that Squid is doing a bad job at it, and Varnish doesn't try at all (lets the VM system take care of it). In theory, managing memory at the object/application level should give you some advantages over doing it at the page/kernel level. I can imagine, for example, cases where Squid might perform better by moving entire objects to/from disk, rather than a page at a time in response to faults. In this case "1975 programming" really means trying to manage the memory hierarchy in the interest of performance, which is timeless. Indeed, the author later states that Varnish tries hard to "reuse memory which is likely to be in the caches", which sounds like the same idea applied to a different level.

The kernel VM system has a lot more ("global") information, though.

I know infinitely less about caching and VM than the author of the linked article, but I was surprised by this part:

  Varnish also only has a single file on the disk whereas 
  squid puts one object in its own separate file. The HTTP 
  objects are not needed as filesystem objects, so there is 
  no point in wasting time in the filesystem name space 
  (directories, filenames and all that) for each object, all 
  we need to have in Varnish is a pointer into virtual 
  memory and a length, the kernel does the rest.
I've had more than one systems person give me the opposite advice, that yes, using the OS's caching layer to do your disk/RAM balancing is good, but you should write into files that are divided on logical boundaries that correlate with how you use the data. Their argument was that this gives the caching layer more information, e.g. it can consolidate all your tiny objects into one part of the cache to avoid your small objects unnecessarily pinning a ton of VM pages, and can do things like prefetch pages when you start to read a big object, or even choose not to load a very large object into the cache at all if you're reading it sequentially (keeping it from clobbering the cache). When evicting pages it can also take small-versus-big-object and these-pages-go-together issues into account, as opposed to all pages looking alike.

That's all hearsay, though, and I have no idea if it actually improves things in practice on current OSs or with which kinds of workloads.

Both arguments might be correct. If you are dealing with a language that makes a strong distinction between the object and the memory layout of that object, you might be better off handling the serialization yourself. But Varnish is in C, and data on disk is mapped to memory then used directly as a struct.

Thus there are no small objects scattered around--- each 'object' is contiguous, and most likely on a single page. Prefetches happen automatically --- in Linux at least, disk caching and the VM are essentially synonymous. The benefit of using the VM directly with mmap() is that you have more control over the details and less overhead.

> in Linux at least, disk caching and the VM are essentially synonymous

Ah that could explain it. The people I know seem to be working on fs-level caches that operate at least in part on file granularity, so maybe they assume Linux/FBSD do fs-level stuff as well. They seem to try to do things like deciding whether to cache or not based in part on how big the file being read is, and what its historical usage patterns are.

Windows, IIRC, has a filesystem caching system which is distinct from VM.

In this kind of case I find it helpful to distinguish between what optimizations the system could implement and what it does implement. There are so many theoretical optimizations that don't work in practice that you can spend forever debating them (especially since no one can be proven wrong in a theoretical debate). I actually like the abstraction to leak a little when it comes to performance. And keep in mind that PHK develops the FreeBSD kernel that he's using, so he knows what it's going to do.

Varnish tries to limit the filesystem overhead as much as it can.

By only having the one file so the number of system calls for a read or a write is '1', no need to juggle file descriptors (which you really really need for your sockets, not for the file system in most varnish setups).

Having 'small' objects clustered together is a good idea anyway, but you can do that in a big file just as easily as you could do it in multiple smaller files.

I prefer the bit in this article: http://www.varnish-cache.org/wiki/ArchitectureInlineC

Where he says: "It is a particular common kind of hubris for IT architects, to think that they know better than 100% of everybody else, this is less of a sin in Open Source than in Closed Source, but a sin nontheless."

Fighting with the OS - such as the VM system - is one of the primary arguments for exokernel-style OSs; see



(In fact, I think one of their use cases showing an order of magnitude or 2 improvement specifically involves pairing an app with a custom VM algorithm.)

MongoDB guys outsourced the entire caching/memory management to OS VM and on my (modest) workloads it performs really well. It probably also means that OS is a bit more relevant now and the choice b/w BSD/Linux isn't just about personal preference anymore, I can imagine that their VM characteristics are quite different and "2006 style" software like mongod won't work the same.

Anyone with low-level experience with BSD/Linux VMs?

Hmm... I thought the scourge of 1975 programming was ignoring the vast and growing gap between the speed of the processor and the speed of RAM.

2010 programming has to deal with the fact that chasing a (non-cached) pointer can consume hundreds of processor cycles. So much for trees...

There is one interesting argument for not using the OS virtual memory system. By using VM you have just turned disk errors into RAM errors. Many programs can potentially handle disk corruption, almost none can cope with bad RAM.

This Squid vs. Varnish comparison is quite similar to the sync vs. async debate for network programming. In both cases, the question is: do I use some OS abstraction (Virtual Memory or threads, respectively) as my application's primary scheduling mechanism, or do I handle scheduling more explicitly at the application level?

Of course the OS guys like PHK or Linus think you should use the OS mechanisms. Linus hates O_DIRECT (http://kerneltrap.org/node/7563) and PHK is taking a similar tack with this article. Just let the OS handle it.

But there are real downsides to this approach. One is that it makes you far more dependent on the quality of the OS implementation. I'm sure PHK trusts FreeBSD and Linus trusts Linux, but if you're writing for cross-platform you might end up on a bad VM implementation. The last thing you want to tell your customers is that they have to upgrade or change their OS to get decent performance.

Also, the OS is by design a more static and less flexible piece of software than anything you put in user-space. What if you need something that your VM system doesn't currently provide? For example, how are you going to measure (from user-space) the percentage of requests that incurred a disk read? Disk reads are invisible with mmap'd VM. What if you need to prioritize your I/O so that some requests get their disk reads serviced even if there are lots of low-priority reads already queued? If you've bought in whole-hog into an OS-based approach and your OS doesn't support features like this, you don't have a lot of options.

And while it's great in lots of cases that the page cache can be shared across processes, OS's don't have great isolation between processes using the page cache. If you run some giant "cp" and completely trash the page cache, your Varnish process is likely to take a latency hit. In a shared server environment, you want to be able to draw walls of isolation so that each user gets the resources that he/she was promised. A shared page cache is hard to fit within an isolation model like this, whereas an explicit cache in user-space works fine.

Think about the microkernel vs. monolithic kernel debate. Maybe monolithic kernels won, but it's still a good principle that if it can be left out of the kernel without loss of performance, it should. Why is it better to use an interface like VM than to use some user-space library that manages disk I/O? The kernel's one advantage is that it can handle page faults (and so can make a memory reference into an I/O operation), but that's also the property that makes it difficult to do good accounting of when you're actually incurring I/O operations.

One final thing to mention: if you're using VM in this way, things degenerate badly in low-memory situations. Since the pages of data are competing with pages of the program itself, the program can get swapped out to service data I/O. If you've ever seen a Linux box thrash with its HDD light flashing like mad, you know how bad things can get when memory is temporarily too scarce to even let programs stay resident. Using vast amounts of VM exacerbates this, because it makes your programs and your data compete for the same RAM.

No matter what the quality of your VM, if you're going to have several IO hits versus only the one where the VM (even a crappy one) pages the data in or out just once you will always be faster in a scenario like phk describes.

Doubling or even quadrupling your IO operations is very expensive.

In a situation such as the one for which this article is meant you set things up in advance to never get in to a situation where you start trashing your disk, programs are allocated a fixed amount of memory and if a program does not abide by that it is considered faulty.

The trashing situation you describe can happen on machines that are run with less rigid setups, but on a production server that you count on serving up a few billion files every day you can't afford the luxury of random scripts firing off CRON and other niceties like that.

Custom kernel, very limited set of processes that you know are 'well behaved', as predictable as possible.

> No matter what the quality of your VM, if you're going to have several IO hits versus only the one where the VM (even a crappy one) pages the data in or out just once you will always be faster in a scenario like phk describes.

You can keep your own cache explicitly in user-space, and get multiple hits with a single load into RAM.

> but on a production server that you count on serving up a few billion files every day you can't afford the luxury of random scripts firing off CRON and other niceties like that.

In a data center where you have tens of thousands of heterogenous jobs competing for thousands of machines, you can't afford the luxury of giving out exclusive access to a machine. You have to have good enough isolation that multiple jobs can run on the same machine without impacting each other negatively. As CPUs get more cores this will become even more important.

The whole point of this article - and it is a very good point - is that keeping your cache in user-space is not the right way to approach the problem. And you can get multiple hits anyway if you make sure that data that will expire together will end up in the same page.

Your other description does not match the use case of a production web server running varnish instances as the front-end.

The whole point of my comment is that PHK's analysis leaves out many downsides of leaving it all to the kernel.

His main argument against doing it is user-space is that you will "fight with the kernel's elaborate memory management." But if you just turn off swap completely and read files with read/write instead of mmap(), there is no fight. Everything happens in user-space.

Leaving it all to the kernel has many disadvantages as I spent many paragraphs explaining.

I missed the 'if you just turn off swap completely' bit in the paragraphs above.

Edit: even on re-reading it all I can't find it.

> do I use some OS abstraction (Virtual Memory or threads, respectively)

Bzzt! You got your analogy backwards.

Using a thread pool is doing it yourself from cross-platform primitives that work on even the shittiest UNIX-wars-era platform, and an event loop using epoll/kqueue is the modern pure OS abstraction.

Read the second-half of the sentence you quoted: "as my application's primary scheduling mechanism." Using O(requests) threads leaves the OS in charge of scheduling CPU tasks, just as using VM leaves the OS in charge of scheduling I/O.

> Using a thread pool is doing it yourself from cross-platform primitives that work on even the shittiest UNIX-wars-era platform, and an event loop using epoll/kqueue is the modern pure OS abstraction.

You are very confused. First of all, epoll/kqueue are just optimizations of select(2), which first appeared in 4.2BSD (released in 1983). No standard interface for threading on UNIX appeared until pthreads was standardized in 1995.

But all of this assumes that my argument has anything at all to do with history. It does not. The question is whether you are leaving the OS in charge of scheduling decision or not.

With select/poll/epoll/kqueue/etc, the OS wakes you up and says "here is a set of events that are ready." It does not actually schedule any work. The application gets to decide, at that moment, what work to do next.

Contrast this with O(requests) threads or VM. If several threads are available to run, the OS chooses which one will run based on its internal scheduling policy. Likewise with VM, the OS is responsible for scheduling pages of RAM and when they will be evicted, based on its own internal logic and policy. This is what makes them higher-level primitives.

> the program can get swapped out to service data I/O

mlock() and friends can help for server applications.

I was taught that the purpose of virtual memory was originally for running multiple programs without their memory spaces overlapping rather than to use more memory than the system actually had. Is this incorrect?

You're correct in a sense - virtual memory gives each process its own completely independent memory space that's addressed linearly.

In order to preserve the illusion of independence, the OS has to deal with the possibility that the sum of the sizes of all the memory that each process wants to use might be greater than the amount of physical memory available. So rather than aggressively limit the amount of virtual address space that each process can use, it simply only keeps a subset of that memory in physical memory at any one time.

You can have virtual memory without paging, but then each process has to compete for a very limited resource. You can also have paging without virtual memory - process A's copy of physical address X can be swapped out to be replaced with process B's copy. However processes are then still limited by having only as large an address space as physical memory in the machine, and virtual memory is such a huge win for hiding the layout of physical memory from processes as well as isolating them from one another (so no corruption possibilities) that it's pretty much unheard of to do this.

You get such a big advantage from having a layer of indirection sometimes...

Does anyone know a good book that explains these things (CPU caches, RAM, virtual memory, etc.) in more detail?

I've found "Inside the Machine" by Jon Stokes to be quite a good read, though it's a bit dated by now (published 2006)


For programmers I would recommend a real computer architecture textbook (e.g. http://books.google.com/books?id=57UIPoLt3tkC) instead of Stokes's analogy-laden book.

Professional Linux Kernel Architecture has a good introduction into the Linux/os nomenclature.

There are so many cheap RAM and CPU idle circles that you even can build data storages or programming languages on top of ridiculously inefficient JVM. ^_^

The point is that there is the kernel to manage system resources, which is very good one.

I wish that this explanation could be expanded upon a little. Perhaps with some annotations of selected sections of the source code.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact