A "50/50" rule such as the one described is a clear signal that the memory manager's cache eviction policy is an ad-hoc hack. This sort of problem would benefit from some old-fashioned operations research-style statistical and mathematical modeling. What's the probability that a page will be needed? What's the cost of keeping it around? A proper model would be able to answer these questions. A "recency cache" and "frequency cache" is missing the bigger picture.
> What's the probability that a page will be needed? What's the cost of keeping it around?
The answer is pretty simple: It depends. What's your workload? There's no general answer.
So, Linux picked a heuristic. They could probably have picked a better one, but short of some sort all knowing of machine learning algorithm which can predict future workloads, the heuristic they pick will be suboptimal in some regard.
> The answer is pretty simple: It depends. What's your workload? There's no general answer.
Which is, ironically, an argument for microkernels. In some microkernel designs you could, in principle, allow for applications to provide their own disk and memory managers.
To a degree. Both memory and disk bandwidth are shared resources. It's hard to prevent one algorithm from stomping on another without giving it exclusive control. So while you may be able to swap in a memory manager that doesn't have this pathological case, it might negatively affect other applications.
To give an extreme example, imagine that an application sets up a custom memory manager that takes all available RAM, and makes it exclusive to that application. In reality, it would probably be less insane, but you would still get regressions with multiple memory managers. And if you decide you have to have one true memory manager... you're back to the situation that exists today, only potentially slightly easier to configure away from bad behavior.
Definitely, but my argument is that OSes must account for the shared-server case, but in practice many servers are dedicated to a single service.
Supposing you built a memory manager exclusively tuned for PostgreSQL, with the assumption that it has the machine to itself. That memory manager would have a different performance profile from the kind of "veil of ignorance" that a general-purpose memory manager has to cope with.
There is nothing stopping the memory manager having tunable parameters from being added. Making core functions pluggable doesn't seem to be the way that the Linux developers want to do things, and I think it's probably justified (see the CFS debacle with Con Kolivas and his ideas around pluggable schedulers in the Linux kernel [1]).
Certainly it looks like there is some ideas that Johannes Weiner is considering for the future, and is actively working on with the set of patches he has merged:
Future
Right now we have a fixed ratio (50:50) between inactive and active
list but we already have complaints about working sets exceeding half
of memory being pushed out of the cache by simple streaming in the
background. Ultimately, we want to adjust this ratio and allow for a
much smaller inactive list. These patches are an essential step in
this direction because they decouple the VMs ability to detect working
set changes from the inactive list size. This would allow us to base
the inactive list size on the combined readahead window size for
example and potentially protect a much bigger working set.
Another possibility of having thrashing information would be to
revisit the idea of local reclaim in the form of zero-config memory
control groups. Instead of having allocating tasks go straight to
global reclaim, they could try to reclaim the pages in the memcg they
are part of first, as long as the group is not thrashing. This would
allow a user to drop e.g. a back-up job in an otherwise unconfigured
memcg and it would only inflate (and possibly do global reclaim) until
it has enough memory to do proper readahead. But once it reaches that
point and stops thrashing it would just recycle its own used-once
pages without kicking out the cache of any other tasks in the system
more than necessary. [2]
Applications can already manage their own memory in Linux. O_DIRECT bypasses the buffer cache when doing I/O, and there are plenty of ways that applications can influence or take control of their own memory using various system calls, many of which overlap or are redundant to some degree. It doesn't require a microkernel, and it's not new.
Right. There are lots of state of the art algorithms in Linux. If there was a published solution that was obviously much better, you can bet it would be implemented.
I don't follow the kernel discussions, but since scheduling algorithms always create winners and losers, you can imagine why there'd be resistance to change.
In the case of scheduling we have the bizarre situtation where we have pluggable IO schedulers for different workloads, but a blank refusal to allow the same for CPU scheduling.
Well don't just post this here. Post on the Linux Kernel Mailing List. If you can explain to them how to improve their ad-hoc hack, I'm sure they'd love to hear from you. It would almost certainly be hugely valuable to companies like IBM or Google to have better memory management on their servers. In fact, it could probably boost Linux performance on the Supercomputer list!
Keep in mind that never, ever having really bad things happen is first priority for an algorithm of that sort and only then can it think about being efficient in a really clever way.
I'd imagine that from the statistical analysis perspective, all memory managers seem like "ad-hoc hacks" but I believe what happens is they are really smart hacks tested with multiple use cases.
Creating a statistical analysis/machine-learning/etc program that can deal quickly with heterogeneous data robustly in real-time would be an incredible achievement.
If anyone knows of a situation where serious machine learning has been applied at such a low level, I would love to hear about it (the only thing vaguely similar I recall is that use of neural networks for branch prediction, something that's been studied but not implemented).
"Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to protect frequently used
pages and the ability to detect frequently used pages. This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching."
I wonder if they could have worked around the issue by doing a mmap() on the first file followed by reading it in, then letting the workload on it finish, followed by calling madvise(..., ..., MADV_DONTNEED) and then doing the same for the second file afterwards.
Obviously the kernel fix is the right thing to do but until that's vetted may be something like the above can help.
For the simple wc example we used in the blog post, that could have worked. (We used this wc example to reduce the problem to its simplest form.)
In practice tough, we use PostgreSQL, and we don't have any control over how Postgres reads its pages. So for our customers, we'd still have this problem.
Well with coreutils like wc etc. you get low level control over such things, but with postgres there would be less flexibility. For illustration one could use GNU dd to take advantage of the page cache only for readahead purposes like:
A more common technique is to bypass the page cache altogether and is often use to avoid the many unfortunate characteristics of the current Linux VM. This is done usually with directIO:
Now postgres might be able to use directIO as an option?
Another related problem with too much caching when writing
to slow device can be seen in this thread:
http://thread.gmane.org/gmane.linux.kernel.mm/108708
That thread actually describes two problems.
1. That Linux waits too long before writing
2. When it does write large amounts to a slow device it locks out everything else
Do you know how madvise(MADV_DONTNEED) is treated if the pages were already in cache from another process? I can't see any easy solution for how this should be handled.
Also, there seems to be confusion as to whether MADV_DONTNEED can be 'destructive'. I think this is a difference between anonymous and file-backed mmap(). Do you know what the actual case is?
Another process caching the same file - it follows UNIX permission model. If both processes have permission to the file - then I think the other process will be able to madvise() and drop it from page cache.
If the app tells the kernel it is done with the range - it is telling that it doesn't care about the data in that range anymore. So MADV_DONTNEED will not flush dirty pages to backing file store without msync() - if you access that range again it will reload the pages from the backing file or zero-filled ones for the anon case.
That doesn't sound right at all. If both process have permission to the mapping, then sure. But permissions on the file skids have nothing to do with it.
There is no such thing as "permission to the mapping" - you can only MAP_SHARED or MAP_PRIVATE - in the latter case the updates are not visible to other processes mapping the same file. Whether you can mmap() a file solely depends on permissions of the file - the one you pass in to mmap using the "int fd" parameter. So if two processes both have READ access to a file, both can mmap with PROT_READ and if both can madvise() then the kernel zaps the page ranges.
Do you see anything in the Linux kernel code that says otherwise?
MAP_SHARED and MAP_PRIVATE are indeed permissions (not particularly fine-grained) to the mapping. Perhaps they don't come into play here, but saying "there is no such thing as 'permission to the mapping'" is false.
I really doubt that if two processes are mapping the same file, and one calls madvise(MADV_DONTNEED), it'll drop the pages from memory entirely. That seems like a great way to let one process DoS another. If the other process has marked it MADV_WILLNEED, that would be especially bad.
If they've both mapped it MAP_PRIVATE, then the mappings should be entirely separate anyway (though copy-on-write semantics are presumably used), and a madvise() on one shouldn't affect the other.
If ARC could be implemented in an efficient manner, it would. I'm guessing integrating these algorithms into existing kernel code is no small feat though.
Prior to 2.6.31, the kernel simply used two lists. New pages were brought into the recency list, and if they were referenced for a second time, these pages were promoted to the frequency list.
Similarly, if the recency list needed more memory, pages in the frequency list were demoted to it (giving these pages a second chance), and eventually removed from the cache.
In 2.6.31, a change was checked in to better handle cases such as nightly backups evicting interactive user program data. With this change, the frequency list's minimum size was fixed, and if the size dropped down to 50%, pages were no longer demoted from it.
Here's my high level understanding of how Windows handles this.
Windows uses a set of LRU lists to manage its page cache, where each LRU list corresponds to a priority. In its simplest form then, your pages are evicted according to an LRU eviction policy.
Now, the user can also set priorities on their processes. Let's say they set priority #2. The pages referenced by this process are then inserted into the corresponding LRU cache #2. When a page needs to be evicted from the cache, the LRU list with the lowest priority is consulted first (LRU cache #0), and entries are evicted from it.
So let's get this right. We want to use wc -l to count how many lines there are in a file? and we're using this to benchmark the Linux VM?
The only thing I will say is; stop beating up your hard drives asking them to read the whole file and count how many lines there are in it. Use what the filesystems provide already.
stat -c%s filename
Benchmarking with wc -l is filled with problems; and this article is unfortunately has more flaws than this but I'll stop now.
The article wasn't about how to get the size of a file. It was about how linux caches large files. The article just used wc -l as a simple way of loading the entire file into memory.
Unless you disable this; ext filesystems normally store this and can be pulled out of 'stat' easily. Sure you can read the whole file and count every line; or you can trust what the filesystem's metadata says it is.