
Linux memory manager and your big data - ozgune
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data
======
EvanMiller
A "50/50" rule such as the one described is a clear signal that the memory
manager's cache eviction policy is an ad-hoc hack. This sort of problem would
benefit from some old-fashioned operations research-style statistical and
mathematical modeling. What's the probability that a page will be needed?
What's the cost of keeping it around? A proper model would be able to answer
these questions. A "recency cache" and "frequency cache" is missing the bigger
picture.

~~~
ori_b
> What's the probability that a page will be needed? What's the cost of
> keeping it around?

The answer is pretty simple: It depends. What's your workload? There's no
general answer.

So, Linux picked a heuristic. They could probably have picked a better one,
but short of some sort all knowing of machine learning algorithm which can
predict future workloads, the heuristic they pick will be suboptimal in some
regard.

~~~
jacques_chester
> _The answer is pretty simple: It depends. What 's your workload? There's no
> general answer._

Which is, ironically, an argument for microkernels. In some microkernel
designs you could, in principle, allow for applications to provide their own
disk and memory managers.

~~~
ori_b
To a degree. Both memory and disk bandwidth are shared resources. It's hard to
prevent one algorithm from stomping on another without giving it exclusive
control. So while you may be able to swap in a memory manager that doesn't
have this pathological case, it might negatively affect other applications.

To give an extreme example, imagine that an application sets up a custom
memory manager that takes all available RAM, and makes it exclusive to that
application. In reality, it would probably be less insane, but you would still
get regressions with multiple memory managers. And if you decide you have to
have one true memory manager... you're back to the situation that exists
today, only potentially slightly easier to configure away from bad behavior.

~~~
jacques_chester
Definitely, but my argument is that OSes must account for the shared-server
case, but in practice many servers are dedicated to a single service.

Supposing you built a memory manager exclusively tuned for PostgreSQL, with
the assumption that it has the machine to itself. That memory manager would
have a different performance profile from the kind of "veil of ignorance" that
a general-purpose memory manager has to cope with.

------
blinkingled
I wonder if they could have worked around the issue by doing a mmap() on the
first file followed by reading it in, then letting the workload on it finish,
followed by calling madvise(..., ..., MADV_DONTNEED) and then doing the same
for the second file afterwards.

Obviously the kernel fix is the right thing to do but until that's vetted may
be something like the above can help.

~~~
ozgune
For the simple wc example we used in the blog post, that could have worked.
(We used this wc example to reduce the problem to its simplest form.)

In practice tough, we use PostgreSQL, and we don't have any control over how
Postgres reads its pages. So for our customers, we'd still have this problem.

~~~
blinkingled
Ah. I had a feeling that would be the case! It's never that simple in The Real
Life(TM) :)

Postgres could do this though if they detect broken kernel version and the
right workload and many users might auto benefit from that.

------
fiatmoney
Would something like an adaptive replacement cache, where there are two
caching mechanisms (LRU & LFU) but one list, help mitigate this?

[http://en.wikipedia.org/wiki/Adaptive_replacement_cache](http://en.wikipedia.org/wiki/Adaptive_replacement_cache)

~~~
ozgune
If ARC could be implemented in an efficient manner, it would. I'm guessing
integrating these algorithms into existing kernel code is no small feat
though.

Also, ARC may need to be tweaked so that its cost doesn't become prohibitive.
That's one reason why the current memory manager isn't using LRU, but an
approximation of it:
[http://en.wikipedia.org/wiki/Page_replacement_algorithm#Leas...](http://en.wikipedia.org/wiki/Page_replacement_algorithm#Least_recently_used)

~~~
dap
ZFS uses an ARC.

~~~
rodgerd
Indeed, and ZFS has quite a high CPU overhead. There's always tradeoffs.

------
Nikiforos79
The post says kernel versions newer than 2.6.31. Did this happen on older
versions as well?

~~~
ozgune
My understanding is that it didn't.

Prior to 2.6.31, the kernel simply used two lists. New pages were brought into
the recency list, and if they were referenced for a second time, these pages
were promoted to the frequency list.

Similarly, if the recency list needed more memory, pages in the frequency list
were demoted to it (giving these pages a second chance), and eventually
removed from the cache.

In 2.6.31, a change was checked in to better handle cases such as nightly
backups evicting interactive user program data. With this change, the
frequency list's minimum size was fixed, and if the size dropped down to 50%,
pages were no longer demoted from it.

------
wfunction
I think an assessment of how Windows would perform here is appropriate.

~~~
ozgune
Here's my high level understanding of how Windows handles this.

Windows uses a set of LRU lists to manage its page cache, where each LRU list
corresponds to a priority. In its simplest form then, your pages are evicted
according to an LRU eviction policy.

Now, the user can also set priorities on their processes. Let's say they set
priority #2. The pages referenced by this process are then inserted into the
corresponding LRU cache #2. When a page needs to be evicted from the cache,
the LRU list with the lowest priority is consulted first (LRU cache #0), and
entries are evicted from it.

------
damm
So let's get this right. We want to use wc -l to count how many lines there
are in a file? and we're using this to benchmark the Linux VM?

The only thing I will say is; stop beating up your hard drives asking them to
read the whole file and count how many lines there are in it. Use what the
filesystems provide already.

stat -c%s filename

Benchmarking with wc -l is filled with problems; and this article is
unfortunately has more flaws than this but I'll stop now.

~~~
axblount
The article wasn't about how to get the size of a file. It was about how linux
caches large files. The article just used wc -l as a simple way of loading the
entire file into memory.

~~~
owenmarshall
Ignoring that, stat -c%s gives the size of the file in bytes, not a count of
newlines.

It's always funny to see how the wrong are often so cocksure.

~~~
phaemon
Actually, I'm a little baffled how someone who knows about stat has such an
odd idea about how files work.

How could you possibly know how many newline characters are in a file without
looking in the file?

~~~
damm
Unless you disable this; ext filesystems normally store this and can be pulled
out of 'stat' easily. Sure you can read the whole file and count every line;
or you can trust what the filesystem's metadata says it is.

