Hacker News new | past | comments | ask | show | jobs | submit login
Linux memory manager and your big data (citusdata.com)
121 points by ozgune on Nov 27, 2013 | hide | past | web | favorite | 41 comments

A "50/50" rule such as the one described is a clear signal that the memory manager's cache eviction policy is an ad-hoc hack. This sort of problem would benefit from some old-fashioned operations research-style statistical and mathematical modeling. What's the probability that a page will be needed? What's the cost of keeping it around? A proper model would be able to answer these questions. A "recency cache" and "frequency cache" is missing the bigger picture.

> What's the probability that a page will be needed? What's the cost of keeping it around?

The answer is pretty simple: It depends. What's your workload? There's no general answer.

So, Linux picked a heuristic. They could probably have picked a better one, but short of some sort all knowing of machine learning algorithm which can predict future workloads, the heuristic they pick will be suboptimal in some regard.

> The answer is pretty simple: It depends. What's your workload? There's no general answer.

Which is, ironically, an argument for microkernels. In some microkernel designs you could, in principle, allow for applications to provide their own disk and memory managers.

To a degree. Both memory and disk bandwidth are shared resources. It's hard to prevent one algorithm from stomping on another without giving it exclusive control. So while you may be able to swap in a memory manager that doesn't have this pathological case, it might negatively affect other applications.

To give an extreme example, imagine that an application sets up a custom memory manager that takes all available RAM, and makes it exclusive to that application. In reality, it would probably be less insane, but you would still get regressions with multiple memory managers. And if you decide you have to have one true memory manager... you're back to the situation that exists today, only potentially slightly easier to configure away from bad behavior.

Definitely, but my argument is that OSes must account for the shared-server case, but in practice many servers are dedicated to a single service.

Supposing you built a memory manager exclusively tuned for PostgreSQL, with the assumption that it has the machine to itself. That memory manager would have a different performance profile from the kind of "veil of ignorance" that a general-purpose memory manager has to cope with.

There is nothing stopping the memory manager having tunable parameters from being added. Making core functions pluggable doesn't seem to be the way that the Linux developers want to do things, and I think it's probably justified (see the CFS debacle with Con Kolivas and his ideas around pluggable schedulers in the Linux kernel [1]).

Certainly it looks like there is some ideas that Johannes Weiner is considering for the future, and is actively working on with the set of patches he has merged:


  Right now we have a fixed ratio (50:50) between inactive and active
  list but we already have complaints about working sets exceeding half
  of memory being pushed out of the cache by simple streaming in the
  background.  Ultimately, we want to adjust this ratio and allow for a
  much smaller inactive list.  These patches are an essential step in
  this direction because they decouple the VMs ability to detect working
  set changes from the inactive list size.  This would allow us to base
  the inactive list size on the combined readahead window size for
  example and potentially protect a much bigger working set.

  Another possibility of having thrashing information would be to
  revisit the idea of local reclaim in the form of zero-config memory
  control groups.  Instead of having allocating tasks go straight to
  global reclaim, they could try to reclaim the pages in the memcg they
  are part of first, as long as the group is not thrashing.  This would
  allow a user to drop e.g. a back-up job in an otherwise unconfigured
  memcg and it would only inflate (and possibly do global reclaim) until
  it has enough memory to do proper readahead.  But once it reaches that
  point and stops thrashing it would just recycle its own used-once
  pages without kicking out the cache of any other tasks in the system
  more than necessary. [2]
1. http://apcmag.com/why_i_quit_kernel_developer_con_kolivas.ht...

2. https://lkml.org/lkml/2013/11/24/133

Applications can already manage their own memory in Linux. O_DIRECT bypasses the buffer cache when doing I/O, and there are plenty of ways that applications can influence or take control of their own memory using various system calls, many of which overlap or are redundant to some degree. It doesn't require a microkernel, and it's not new.

Right. There are lots of state of the art algorithms in Linux. If there was a published solution that was obviously much better, you can bet it would be implemented.

You'd like to think so, but some of the intransigence around scheduling and tracing would suggest that's not always the case.

I don't follow the kernel discussions, but since scheduling algorithms always create winners and losers, you can imagine why there'd be resistance to change.

In the case of scheduling we have the bizarre situtation where we have pluggable IO schedulers for different workloads, but a blank refusal to allow the same for CPU scheduling.

Well don't just post this here. Post on the Linux Kernel Mailing List. If you can explain to them how to improve their ad-hoc hack, I'm sure they'd love to hear from you. It would almost certainly be hugely valuable to companies like IBM or Google to have better memory management on their servers. In fact, it could probably boost Linux performance on the Supercomputer list!

And they'll be super friendly about it.

I would pay money to watch it.


Keep in mind that never, ever having really bad things happen is first priority for an algorithm of that sort and only then can it think about being efficient in a really clever way.

I'd imagine that from the statistical analysis perspective, all memory managers seem like "ad-hoc hacks" but I believe what happens is they are really smart hacks tested with multiple use cases.

Creating a statistical analysis/machine-learning/etc program that can deal quickly with heterogeneous data robustly in real-time would be an incredible achievement.

If anyone knows of a situation where serious machine learning has been applied at such a low level, I would love to hear about it (the only thing vaguely similar I recall is that use of neural networks for branch prediction, something that's been studied but not implemented).

You should look up some links in the article (proposed solution https://lkml.org/lkml/2013/11/24/133)

"Currently, the VM aims for a 1:1 ratio between the lists, which is the "perfect" trade-off between the ability to protect frequently used pages and the ability to detect frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching."

I was under the impression that this code has to be really fast. I wonder how much computation they can get away with.

I wonder if they could have worked around the issue by doing a mmap() on the first file followed by reading it in, then letting the workload on it finish, followed by calling madvise(..., ..., MADV_DONTNEED) and then doing the same for the second file afterwards.

Obviously the kernel fix is the right thing to do but until that's vetted may be something like the above can help.

For the simple wc example we used in the blog post, that could have worked. (We used this wc example to reduce the problem to its simplest form.)

In practice tough, we use PostgreSQL, and we don't have any control over how Postgres reads its pages. So for our customers, we'd still have this problem.

Ah. I had a feeling that would be the case! It's never that simple in The Real Life(TM) :)

Postgres could do this though if they detect broken kernel version and the right workload and many users might auto benefit from that.

Well with coreutils like wc etc. you get low level control over such things, but with postgres there would be less flexibility. For illustration one could use GNU dd to take advantage of the page cache only for readahead purposes like:

    dd if=clickstream.csv.1 iflag=nocache bs=1M | wc -l
A more common technique is to bypass the page cache altogether and is often use to avoid the many unfortunate characteristics of the current Linux VM. This is done usually with directIO:

    dd if=clickstream.csv.1 iflag=direct bs=1M | wc -l
Now postgres might be able to use directIO as an option?

Another related problem with too much caching when writing to slow device can be seen in this thread: http://thread.gmane.org/gmane.linux.kernel.mm/108708 That thread actually describes two problems. 1. That Linux waits too long before writing 2. When it does write large amounts to a slow device it locks out everything else

Do you know how madvise(MADV_DONTNEED) is treated if the pages were already in cache from another process? I can't see any easy solution for how this should be handled.

Also, there seems to be confusion as to whether MADV_DONTNEED can be 'destructive'. I think this is a difference between anonymous and file-backed mmap(). Do you know what the actual case is?

Another process caching the same file - it follows UNIX permission model. If both processes have permission to the file - then I think the other process will be able to madvise() and drop it from page cache.

If the app tells the kernel it is done with the range - it is telling that it doesn't care about the data in that range anymore. So MADV_DONTNEED will not flush dirty pages to backing file store without msync() - if you access that range again it will reload the pages from the backing file or zero-filled ones for the anon case.

That doesn't sound right at all. If both process have permission to the mapping, then sure. But permissions on the file skids have nothing to do with it.

There is no such thing as "permission to the mapping" - you can only MAP_SHARED or MAP_PRIVATE - in the latter case the updates are not visible to other processes mapping the same file. Whether you can mmap() a file solely depends on permissions of the file - the one you pass in to mmap using the "int fd" parameter. So if two processes both have READ access to a file, both can mmap with PROT_READ and if both can madvise() then the kernel zaps the page ranges.

Do you see anything in the Linux kernel code that says otherwise?

MAP_SHARED and MAP_PRIVATE are indeed permissions (not particularly fine-grained) to the mapping. Perhaps they don't come into play here, but saying "there is no such thing as 'permission to the mapping'" is false.

I really doubt that if two processes are mapping the same file, and one calls madvise(MADV_DONTNEED), it'll drop the pages from memory entirely. That seems like a great way to let one process DoS another. If the other process has marked it MADV_WILLNEED, that would be especially bad.

If they've both mapped it MAP_PRIVATE, then the mappings should be entirely separate anyway (though copy-on-write semantics are presumably used), and a madvise() on one shouldn't affect the other.

Would something like an adaptive replacement cache, where there are two caching mechanisms (LRU & LFU) but one list, help mitigate this?


If ARC could be implemented in an efficient manner, it would. I'm guessing integrating these algorithms into existing kernel code is no small feat though.

Also, ARC may need to be tweaked so that its cost doesn't become prohibitive. That's one reason why the current memory manager isn't using LRU, but an approximation of it: http://en.wikipedia.org/wiki/Page_replacement_algorithm#Leas...

ZFS uses an ARC.

Indeed, and ZFS has quite a high CPU overhead. There's always tradeoffs.

The post says kernel versions newer than 2.6.31. Did this happen on older versions as well?

My understanding is that it didn't.

Prior to 2.6.31, the kernel simply used two lists. New pages were brought into the recency list, and if they were referenced for a second time, these pages were promoted to the frequency list.

Similarly, if the recency list needed more memory, pages in the frequency list were demoted to it (giving these pages a second chance), and eventually removed from the cache.

In 2.6.31, a change was checked in to better handle cases such as nightly backups evicting interactive user program data. With this change, the frequency list's minimum size was fixed, and if the size dropped down to 50%, pages were no longer demoted from it.

I think an assessment of how Windows would perform here is appropriate.

Here's my high level understanding of how Windows handles this.

Windows uses a set of LRU lists to manage its page cache, where each LRU list corresponds to a priority. In its simplest form then, your pages are evicted according to an LRU eviction policy.

Now, the user can also set priorities on their processes. Let's say they set priority #2. The pages referenced by this process are then inserted into the corresponding LRU cache #2. When a page needs to be evicted from the cache, the LRU list with the lowest priority is consulted first (LRU cache #0), and entries are evicted from it.

So let's get this right. We want to use wc -l to count how many lines there are in a file? and we're using this to benchmark the Linux VM?

The only thing I will say is; stop beating up your hard drives asking them to read the whole file and count how many lines there are in it. Use what the filesystems provide already.

stat -c%s filename

Benchmarking with wc -l is filled with problems; and this article is unfortunately has more flaws than this but I'll stop now.

The article wasn't about how to get the size of a file. It was about how linux caches large files. The article just used wc -l as a simple way of loading the entire file into memory.

Ignoring that, stat -c%s gives the size of the file in bytes, not a count of newlines.

It's always funny to see how the wrong are often so cocksure.

Actually, I'm a little baffled how someone who knows about stat has such an odd idea about how files work.

How could you possibly know how many newline characters are in a file without looking in the file?

Unless you disable this; ext filesystems normally store this and can be pulled out of 'stat' easily. Sure you can read the whole file and count every line; or you can trust what the filesystem's metadata says it is.

What version of coreutils are we talking about here? What filesystem mount options do you have?

I would think cat would do the same no? or reading the file in $lang?

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact