1. The Redis project abandoned attempts to have a mixed memory-disk approach, at least for the near future. I want to focus on trying to do at least one thing well and it is already hard ;-) You know, the no-need-to-konquer-the-world approach. Otherwise the project per se is interesting. Redis Labs has a commercial fork that works that way for instance (which I believe was initially based on the Redis "diskstore" branch I was working on in order to replace the former "virtual memory" Redis feature), but not the OSS side. Maybe I'll change my mind in the future but so far I can't see signs of my mindchange ;-)
2. About threads, we are now a bit more threaded: Redis 4.0 is able to perform deletion of keys in background, Redis Modules have explicit support for blocking operations that use threads, and so forth. However my goal in the next 1/2 years is to finally have threading in the I/O, in order to scale syscals, protocol parsing, to multiple threads but not data access. So regarding the 2006 programming, things will be the same.
Basically I still believe that to do application-side paging now that disks are also faster (ratio compared to RAM) is an interesting approach. I still think that using the kernel VM to do so is a bad idea in general, but could work for certain apps.
Please elaborate. If disk/block-device performance is improving, wouldn't the VM benefit as well?
Also the last sentence seems to make more sense the other way around: VM in the general case, user-land memory management for "certain apps".
About VM in the general case: yes if for general case you mean, a random process is running and is out of memory. If we are talking about in-memory systems wanting to off-load data to disk IMHO the default is that VM does not work well.
That's a reasonably good paper on the trade-offs between event-driven, multi-threaded, and hybrid approaches to file serving.
I don't know that much about nginx in particular, but it seems like they've implemented thread pools for blocking operations: https://www.nginx.com/blog/thread-pools-boost-performance-9x.... "Hard drives are slow (especially the spinning ones), and while the other requests waiting in the queue might not need access to the drive, they are forced to wait anyway." So, if you're blocking reading a file from the hard drive, all the other requests are queued up behind it.
The thread-pool approach noted in the nginx blog sounds pretty much the same as the approach in the linked paper.
nginx does have a good reputation for performance, but I think a lot of that reputation comes as a front-end for web applications rather than serving lots of hard-to-cache files.
Anyway, the nginx blog article as well as the academic paper note that single-threaded event-driven has drawbacks around file io and using a worker pool of threads or processes to offload blocking operations onto can help mitigate that.
Nginx is commonly used as a caching proxy, and called out as being high performance in those cases. I can't speak as to whether what's being cached is "hard-to-cache" files.
Netflix uses nginx to serve hard-to-cache files (using aio sendfile under FreeBSD).
Let's start with Redis being single threaded, the path we are taking is to build "Redis Cluster"...This means that Redis will run 48 instances in a 48 core CPU...
nginx runs workers by default, which (I believe) can be tuned by a couple config options.
To run multiple redis instances as a part of the same cluster, you need a way to shard your data (which you have to reason about client-side), you need separate config files, data directories, etc. for each instance. It's a huge pain.
So the single model work well for redis, but it doesn't work well for nginx, since if there is a request in nginx that is blocking for about 10s, people can't tolerate this situation.
Here are the redis configuration notes on VM from redis 2.2:
# Virtual Memory allows Redis to work with datasets bigger than the actual
# amount of RAM needed to hold the whole dataset in memory.
# In order to do so very used keys are taken in memory while the other keys
# are swapped into a swap file, similarly to what operating systems do
# with memory pages.
# vm-max-memory configures the VM to use at max the specified amount of
# RAM. Everything that deos not fit will be swapped on disk if possible, that
# is, if there is still enough contiguous space in the swap file.
# Redis swap files is split into pages. An object can be saved using multiple
# contiguous pages, but pages can't be shared between different objects.
# So if your page is too big, small objects swapped out on disk will waste
# a lot of space. If you page is too small, there is less space in the swap
# file (assuming you configured the same number of total swap file pages).
# If you use a lot of small objects, use a page size of 64 or 32 bytes.
# Max number of VM I/O threads running at the same time.
# This threads are used to read/write data from/to swap file, since they
# also encode and decode objects from disk to memory or the reverse, a bigger
# number of threads can help with big objects even if they can't help with
# I/O itself as the physical device may not be able to couple with many reads/writes operations at the same time.
# The special value of 0 turn off threaded I/O and enables the blocking Virtual Memory implementation.
Let me back up and try to explain a bit:
While OS kernel developers have put a huge amount of effort into virtual memory management and paging, which was and is a good and necessary thing, the definition of "interactive" and "low latency" has changed. Long ago, half-second latency at a virtual terminal connected to a mainframe with hundreds or thousands of users was fantastic, compared with dropping off your stack of punch-cards and coming back 12 hours later.
For most of the software I use and work on today, I want low sub-second latency. It's often only achievable with reasonable direct control of what is in memory and what is on disk. If I click a menu in a GUI program that I haven't clicked in weeks, I don't want to wait half a second for a few scattered pages to be paged in/out of swap. Same goes for requests to web or api servers - I don't want less-common requests to take a half second longer than the typical 50ms or so. For desktop environments, GUIs, databases, caches, services: no swap.
Certainly, data, multimedia files, dictionaries, etc will need to be read from disk. The processes can arrange for separate threads to do that. We can have responsive progress bars, cancel buttons, priorities, timeouts before hitting an alternative data source - but only if the process itself is in RAM, not in swap.
Now that desktop and server systems measure DRAM in 10s of gigabytes, this really should not be hard to achieve!
I've struggled with swap and out-of-memory situations on Linux many times. The linux kernel never seems to OOM-kill processes fast enough for me. If I have no swap, then if memory pressure sets in, the kernel struggles to shrink buffers, practically freezing most processes, for a few minutes before finally killing the obvious culprit. (I've also tried memory-limiting containers, and they suffer the same problem - freeze up for a few minutes instead of immediately killing when OOM.) I used to enable plenty of swap, more than RAM, because that was the common wisdom, but it causes the same problem when the system comes under memory pressure, everything freezes for a few minutes. But it also has the additional problem that despite setting swappiness to 1 or 0, some strange services/applications will cause the kernel to put some anonymous pages in swap, even when there's plenty of free physical memory. I never want that! I need to periodically swapoff and swapon to correct it.
So, at each company I work for, I end up writing a bash script, run by cron each minute, which checks for low system memory, looks among the application services for an obvious culprit, and sends it SIGTERM. In practice, this solves the problem pretty much every time, in the most graceful way. It's extremely rare that a critical system process is the problem or looks like the problem. (Except dockerd a couple times ;)
(This is not to bash Linux in particular, Windows and MacOS use way more RAM and swap in general. I've heard the BSDs have been good at particular things at particular times, but driver support has always been more of a struggle. Besides the swap / OOM behavior, I'm pretty happy with Linux.)
Letting the OS manage disk and RAM makes perfect sense for bulk data processing - hadoop, spark, or other map-reduce or stream-processing where a few seconds pause here and there is no problem if throughput is maximized. But I personally don't work much on those things - and I'm not a rare case.
No, Linux is rubbish. Seriously. FreeBSD does this properly.
Edit: FreeBSD, Windows, OSX, Solaris, AIX, HP-UX(?)...
FreeBSD does page faults just like everybody else. Your process blocks until the memory is read in.
And the bit from the paper that is relevant:
> A non kqueue-aware application using the asynchronous
I/O (aio) facility starts an I/O request by issuing
aio read() or aio write() The request then proceeds independently
of the application, which must call aio error()
repeatedly to check whether the request has completed,
and then eventually call aio return() to collect the completion
status of the request. The AIO filter replaces this
polling model by allowing the user to register the aio request
with a specified kqueue at the time the I/O request
is issued, and an event is returned under the same conditions
when aio error() would successfully return. This
allows the application to issue an aio read() call, proceed
with the main event loop, and then call aio return() when
the kevent corresponding to the aio is returned from the
kqueue, saving several system calls in the process.
Also, using a worker pool for I/O scales just fine and has no real disadvantages when talking to a disk or SSD.
Pretty simple stuff really.
2. kernel: hey thread, that page is still on SLOOOOW spinning disk, why don't you go off and do something else while I get it for you. I'll let you know with an event notification, so be sure to check in with kqueue regularly.
3. thread: OK then, i'll go off and serve these other people while you do that for me. kthanx.
4. kernel: hey DMA controller, I'd like you to get sectors 4, 5 and 6 from platter 3 on HDD 2 and load them into memory address 0x4fe6bb. Please send me an interrupt when done.
5. DMA controller: OK servo, please adjust read head to this offset. Read head, read me those bytes. Memory controller, please store bytes at address 0x4fe6bb. Hey CPU, here's an interrupt to store in your interrupt table, please wake up the kernel guy and let him know.
6. kernel: wow I just got interrupted. The interrupt seems to map to a request for data from this particular thread. Better let him know. (sends event up thread's kqueue).
7. thread: hey, I just got a kqueue notification that the file is now ready to be read. That means it must be in memory...cool!
Make sense now?
Yes, but the one thing they don't give you is notification the data is in memory. Do you really want to spin a hot loop calling mincore?
> However, once you get an actual page fault, the only option is to block the process.
See steps 1-3 again.
2. while it is being paged into memory, <DO OTHER STUFF>
3. now your data is in memory and you can update it (writes are async by default even on Linux, as the write just goes into memory and will get synced out by the kernels page sync mechanism but you can override that by setting the O_SYNC flag).
Writing different code is not the question I asked. That's not an OS feature. Do the OSes you like have the latter ability, or are you blowing smoke?
The POSIX semantics on the other hand are simply that the read won't block due to lack of input, so it does as much as it can straight away and then returns. If there's data, but the buffer is paged out, the non-blocking read call has to take more time, because the problem is a paged-out buffer and not lack of input.
(The Windows equivalents of man page section 1 are so awful that most POSIX fans just run a mile and install cygwin. More fool them; once you get to man page section 2, it's a lot better. MAXIMUM_WAIT_OBJECTS is lame, though.)
Well you are asking about doing something non-blocking. That means you must be in some kind of event-loop scenario, otherwise you would be happy with synchronous operations. And yes, you would need to add a read event on the file, and then trigger the socket.send() once the data is in memory.
I never said it was free. But at least it is viable.
In the Redis article, it is talking about in-memory data structures. The page fault happens from following a pointer to another location in memory, for example, a linked list, or a skip list. There is no read call to replace with an async read call. Your code evaluates the pointer to a page that has been swapped out, a page fault occurs, and the OS has to swap it back in for you to be able to read that memory.
You could potentially create a new signal, for page faults (maybe there is one I've never heard of), but that would still not let you continue executing from the previous location.