
What's wrong with 2006 programming? (2010) - baotiao
http://oldblog.antirez.com/post/what-is-wrong-with-2006-programming.html
======
antirez
Hello, a few things that changed in the latest 6-7 years:

1\. The Redis project abandoned attempts to have a mixed memory-disk approach,
at least for the near future. I want to focus on trying to do at least one
thing well and it is already hard ;-) You know, the no-need-to-konquer-the-
world approach. Otherwise the project per se is interesting. Redis Labs has a
commercial fork that works that way for instance (which _I believe_ was
initially based on the Redis "diskstore" branch I was working on in order to
replace the former "virtual memory" Redis feature), but not the OSS side.
Maybe I'll change my mind in the future but so far I can't see signs of my
mindchange ;-)

2\. About threads, we are now a bit more threaded: Redis 4.0 is able to
perform deletion of keys in background, Redis Modules have explicit support
for blocking operations that use threads, and so forth. However my goal in the
next 1/2 years is to finally have threading in the I/O, in order to scale
syscals, protocol parsing, to multiple threads but _not data access_. So
regarding the 2006 programming, things will be the same.

Basically I still believe that to do application-side paging now that disks
are also faster (ratio compared to RAM) is an interesting approach. I still
think that using the kernel VM to do so is a bad idea in general, but could
work for certain apps.

~~~
eternalban
> Basically I still believe that to do application-side paging now that disks
> are also faster (ratio compared to RAM) is an interesting approach. I still
> think that using the kernel VM to do so is a bad idea in general, but could
> work for certain apps.

Please elaborate. If disk/block-device performance is improving, wouldn't the
VM benefit as well?

Also the last sentence seems to make more sense the other way around: VM in
the general case, user-land memory management for "certain apps".

~~~
antirez
The OS VM would benefit, the problem is with using the OS VM in order to
implement paging in certain applications like Redis. Does not work well
because there is a tension between flexibility of in-memory representation and
data locality, and OS VM needs data locality because it has no info about
content and requires logically grouped data to be in near pages.

About VM in the general case: yes if for general case you mean, a random
process is running and is out of memory. If we are talking about in-memory
systems wanting to off-load data to disk IMHO the default is that VM does not
work well.

------
camtarn
Since the article doesn't mention it til quite far in: Redis is apparently
single threaded, which is why the blocking nature of OS page swapping is so
disastrous. Presumably for a more traditional server with lots of worker
threads this would be less true.

~~~
tyingq
It does make the conversation interesting, as the "varnish guy" sort of subtly
suggests that single threaded is subpar. Which seems odd, given that nginx is
single threaded, and in a somewhat similar space to varnish...and seems to
enjoy a good reputation for performance.

~~~
mdasen
[http://www.cs.princeton.edu/~vivek/pubs/pai_flash_99.pdf](http://www.cs.princeton.edu/~vivek/pubs/pai_flash_99.pdf)

That's a reasonably good paper on the trade-offs between event-driven, multi-
threaded, and hybrid approaches to file serving.

I don't know that much about nginx in particular, but it seems like they've
implemented thread pools for blocking operations:
[https://www.nginx.com/blog/thread-pools-boost-
performance-9x...](https://www.nginx.com/blog/thread-pools-boost-
performance-9x/). "Hard drives are slow (especially the spinning ones), and
while the other requests waiting in the queue might not need access to the
drive, they are forced to wait anyway." So, if you're blocking reading a file
from the hard drive, all the other requests are queued up behind it.

The thread-pool approach noted in the nginx blog sounds pretty much the same
as the approach in the linked paper.

nginx does have a good reputation for performance, but I think a lot of that
reputation comes as a front-end for web applications rather than serving lots
of hard-to-cache files.

Anyway, the nginx blog article as well as the academic paper note that single-
threaded event-driven has drawbacks around file io and using a worker pool of
threads or processes to offload blocking operations onto can help mitigate
that.

~~~
tyingq
The thread pools are optional, and reading the link you posted, not
recommended unless specific conditions exist. They use streaming media as a
good use case for the thread pools.

Nginx is commonly used as a caching proxy, and called out as being high
performance in those cases. I can't speak as to whether what's being cached is
"hard-to-cache" files.

------
dvirsky
It is important to note that in the many years since this post, while redis
has remained single-threaded - it also removed the entire concept of VM, and
now works only fully in memory.

~~~
baotiao
However, redis transfer these works to jemalloc. Now jemalloc control the
entire VM

~~~
dvirsky
In the past you could tune redis to hold a dataset larger than the memory you
had, and it would swap pages on its own. About a year after this 2010 post,
antirez decided to remove this completely (in redis 2.6 or 2.8, I don't
remember) and focus entirely on fully in-memory situations. VM in the redis
sense used to be redis itself swapping stuff to disk with multiple threads.

Here are the redis configuration notes on VM from redis 2.2:

# Virtual Memory allows Redis to work with datasets bigger than the actual

# amount of RAM needed to hold the whole dataset in memory. # In order to do
so very used keys are taken in memory while the other keys

# are swapped into a swap file, similarly to what operating systems do

# with memory pages.

....

# vm-max-memory configures the VM to use at max the specified amount of

# RAM. Everything that deos not fit will be swapped on disk _if_ possible,
that

# is, if there is still enough contiguous space in the swap file.

...

# Redis swap files is split into pages. An object can be saved using multiple

# contiguous pages, but pages can't be shared between different objects.

# So if your page is too big, small objects swapped out on disk will waste

# a lot of space. If you page is too small, there is less space in the swap

# file (assuming you configured the same number of total swap file pages).

# If you use a lot of small objects, use a page size of 64 or 32 bytes.

....

# Max number of VM I/O threads running at the same time.

# This threads are used to read/write data from/to swap file, since they

# also encode and decode objects from disk to memory or the reverse, a bigger

# number of threads can help with big objects even if they can't help with

# I/O itself as the physical device may not be able to couple with many
reads/writes operations at the same time.

# The special value of 0 turn off threaded I/O and enables the blocking
Virtual Memory implementation.

vm-max-threads 4

~~~
baotiao
get it, thank you

------
geerlingguy
The comments on this post are enlightening. I use both Varnish and Redis, and
the architecture discussion is great!

------
koverstreet
One thing that would really help is if we had buffered asynchronous IO.

------
trevyn
(2010)

------
ploxiln
PHK's post, which inspired this, assumes that the process is swapping. It
describes writing an page to disk to free up that page, then reading in the
anonymous page of data that needs to be used for the write() system call the
process uses to manually cache the data to disk. For the stuff that I use and
work on, if the system is swapping anonymous pages, the situation is dire and
it's time to kill (processes).

Let me back up and try to explain a bit:

While OS kernel developers have put a huge amount of effort into virtual
memory management and paging, which was and is a good and necessary thing, the
definition of "interactive" and "low latency" has changed. Long ago, half-
second latency at a virtual terminal connected to a mainframe with hundreds or
thousands of users was fantastic, compared with dropping off your stack of
punch-cards and coming back 12 hours later.

For most of the software I use and work on today, I want low sub-second
latency. It's often only achievable with reasonable direct control of what is
in memory and what is on disk. If I click a menu in a GUI program that I
haven't clicked in weeks, I don't want to wait half a second for a few
scattered pages to be paged in/out of swap. Same goes for requests to web or
api servers - I don't want less-common requests to take a half second longer
than the typical 50ms or so. For desktop environments, GUIs, databases,
caches, services: no swap.

Certainly, _data_ , multimedia files, dictionaries, etc will need to be read
from disk. The processes can arrange for separate threads to do that. We can
have responsive progress bars, cancel buttons, priorities, timeouts before
hitting an alternative data source - but only if the process itself is in RAM,
not in swap.

Now that desktop and server systems measure DRAM in 10s of gigabytes, this
really should not be hard to achieve!

I've struggled with swap and out-of-memory situations on Linux many times. The
linux kernel never seems to OOM-kill processes fast enough for me. If I have
no swap, then if memory pressure sets in, the kernel struggles to shrink
buffers, practically freezing most processes, for _a few minutes_ before
finally killing the obvious culprit. (I've also tried memory-limiting
containers, and they suffer the same problem - freeze up for a few minutes
instead of immediately killing when OOM.) I used to enable plenty of swap,
more than RAM, because that was the common wisdom, but it causes the same
problem when the system comes under memory pressure, everything freezes for a
few minutes. But it also has the additional problem that despite setting
swappiness to 1 or 0, some strange services/applications will cause the kernel
to put some anonymous pages in swap, even when there's _plenty_ of free
physical memory. I never want that! I need to periodically swapoff and swapon
to correct it.

So, at each company I work for, I end up writing a bash script, run by cron
each minute, which checks for low system memory, looks among the application
services for an obvious culprit, and sends it SIGTERM. In practice, this
solves the problem pretty much every time, in the most graceful way. It's
extremely rare that a critical system process is the problem or looks like the
problem. (Except dockerd a couple times ;)

(This is not to bash Linux in particular, Windows and MacOS use way more RAM
and swap in general. I've heard the BSDs have been good at particular things
at particular times, but driver support has always been more of a struggle.
Besides the swap / OOM behavior, I'm pretty happy with Linux.)

Letting the OS manage disk and RAM makes perfect sense for bulk data
processing - hadoop, spark, or other map-reduce or stream-processing where a
few seconds pause here and there is no problem if throughput is maximized. But
I personally don't work much on those things - and I'm not a rare case.

------
smegel
> OS paging is blocking as hell

No, Linux is rubbish. Seriously. FreeBSD does this properly.

 _Edit: FreeBSD, Windows, OSX, Solaris, AIX, HP-UX(?)..._

~~~
trungaczne
Do you have any articles that talk about how FreeBSD does memory management
differently?

~~~
smegel
[https://people.freebsd.org/~jlemon/papers/kqueue.pdf](https://people.freebsd.org/~jlemon/papers/kqueue.pdf)

~~~
wmf
Admittedly I didn't read it in detail, but I don't see anything about page-ins
there. Can you explain the connection between kqueue and paging?

~~~
smegel
Sure it would be my pleasure. Kqueue allows a read request to be scheduled
that is non-blocking on a page fault. Linux always blocks the thread executing
read() on a page fault. This is still true using aio_read(), as all that does
is run another thread to call read() which blocks. Which is great for small
numbers of read requests but scales poorly.

And the bit from the paper that is relevant:

> A non kqueue-aware application using the asynchronous I/O (aio) facility
> starts an I/O request by issuing aio read() or aio write() The request then
> proceeds independently of the application, which must call aio error()
> repeatedly to check whether the request has completed, and then eventually
> call aio return() to collect the completion status of the request. The AIO
> filter replaces this polling model by allowing the user to register the aio
> request with a specified kqueue at the time the I/O request is issued, and
> an event is returned under the same conditions when aio error() would
> successfully return. This allows the application to issue an aio read()
> call, proceed with the main event loop, and then call aio return() when the
> kevent corresponding to the aio is returned from the kqueue, saving several
> system calls in the process.

~~~
wmf
OK, so you're talking about AIO and other people here are talking about mmap.
If you have working AIO then you can indeed write a fully async server at the
cost of extra memory copies.

~~~
smegel
Sadly mmap is also blocking on page faults on Linux :(

~~~
mfukar
I would be very interested in how you envision a system that does NOT block on
page faults.

