The answer is pretty simple: It depends. What's your workload? There's no general answer.
So, Linux picked a heuristic. They could probably have picked a better one, but short of some sort all knowing of machine learning algorithm which can predict future workloads, the heuristic they pick will be suboptimal in some regard.
Which is, ironically, an argument for microkernels. In some microkernel designs you could, in principle, allow for applications to provide their own disk and memory managers.
To give an extreme example, imagine that an application sets up a custom memory manager that takes all available RAM, and makes it exclusive to that application. In reality, it would probably be less insane, but you would still get regressions with multiple memory managers. And if you decide you have to have one true memory manager... you're back to the situation that exists today, only potentially slightly easier to configure away from bad behavior.
Supposing you built a memory manager exclusively tuned for PostgreSQL, with the assumption that it has the machine to itself. That memory manager would have a different performance profile from the kind of "veil of ignorance" that a general-purpose memory manager has to cope with.
Certainly it looks like there is some ideas that Johannes Weiner is considering for the future, and is actively working on with the set of patches he has merged:
Right now we have a fixed ratio (50:50) between inactive and active
list but we already have complaints about working sets exceeding half
of memory being pushed out of the cache by simple streaming in the
background. Ultimately, we want to adjust this ratio and allow for a
much smaller inactive list. These patches are an essential step in
this direction because they decouple the VMs ability to detect working
set changes from the inactive list size. This would allow us to base
the inactive list size on the combined readahead window size for
example and potentially protect a much bigger working set.
Another possibility of having thrashing information would be to
revisit the idea of local reclaim in the form of zero-config memory
control groups. Instead of having allocating tasks go straight to
global reclaim, they could try to reclaim the pages in the memcg they
are part of first, as long as the group is not thrashing. This would
allow a user to drop e.g. a back-up job in an otherwise unconfigured
memcg and it would only inflate (and possibly do global reclaim) until
it has enough memory to do proper readahead. But once it reaches that
point and stops thrashing it would just recycle its own used-once
pages without kicking out the cache of any other tasks in the system
more than necessary. 
Keep in mind that never, ever having really bad things happen is first priority for an algorithm of that sort and only then can it think about being efficient in a really clever way.
I'd imagine that from the statistical analysis perspective, all memory managers seem like "ad-hoc hacks" but I believe what happens is they are really smart hacks tested with multiple use cases.
Creating a statistical analysis/machine-learning/etc program that can deal quickly with heterogeneous data robustly in real-time would be an incredible achievement.
If anyone knows of a situation where serious machine learning has been applied at such a low level, I would love to hear about it (the only thing vaguely similar I recall is that use of neural networks for branch prediction, something that's been studied but not implemented).
"Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to protect frequently used
pages and the ability to detect frequently used pages. This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching."
Obviously the kernel fix is the right thing to do but until that's vetted may be something like the above can help.
In practice tough, we use PostgreSQL, and we don't have any control over how Postgres reads its pages. So for our customers, we'd still have this problem.
Postgres could do this though if they detect broken kernel version and the right workload and many users might auto benefit from that.
dd if=clickstream.csv.1 iflag=nocache bs=1M | wc -l
dd if=clickstream.csv.1 iflag=direct bs=1M | wc -l
Another related problem with too much caching when writing
to slow device can be seen in this thread:
That thread actually describes two problems.
1. That Linux waits too long before writing
2. When it does write large amounts to a slow device it locks out everything else
Also, there seems to be confusion as to whether MADV_DONTNEED can be 'destructive'. I think this is a difference between anonymous and file-backed mmap(). Do you know what the actual case is?
If the app tells the kernel it is done with the range - it is telling that it doesn't care about the data in that range anymore. So MADV_DONTNEED will not flush dirty pages to backing file store without msync() - if you access that range again it will reload the pages from the backing file or zero-filled ones for the anon case.
Do you see anything in the Linux kernel code that says otherwise?
I really doubt that if two processes are mapping the same file, and one calls madvise(MADV_DONTNEED), it'll drop the pages from memory entirely. That seems like a great way to let one process DoS another. If the other process has marked it MADV_WILLNEED, that would be especially bad.
If they've both mapped it MAP_PRIVATE, then the mappings should be entirely separate anyway (though copy-on-write semantics are presumably used), and a madvise() on one shouldn't affect the other.
Also, ARC may need to be tweaked so that its cost doesn't become prohibitive. That's one reason why the current memory manager isn't using LRU, but an approximation of it: http://en.wikipedia.org/wiki/Page_replacement_algorithm#Leas...
Prior to 2.6.31, the kernel simply used two lists. New pages were brought into the recency list, and if they were referenced for a second time, these pages were promoted to the frequency list.
Similarly, if the recency list needed more memory, pages in the frequency list were demoted to it (giving these pages a second chance), and eventually removed from the cache.
In 2.6.31, a change was checked in to better handle cases such as nightly backups evicting interactive user program data. With this change, the frequency list's minimum size was fixed, and if the size dropped down to 50%, pages were no longer demoted from it.
Windows uses a set of LRU lists to manage its page cache, where each LRU list corresponds to a priority. In its simplest form then, your pages are evicted according to an LRU eviction policy.
Now, the user can also set priorities on their processes. Let's say they set priority #2. The pages referenced by this process are then inserted into the corresponding LRU cache #2. When a page needs to be evicted from the cache, the LRU list with the lowest priority is consulted first (LRU cache #0), and entries are evicted from it.
The only thing I will say is; stop beating up your hard drives asking them to read the whole file and count how many lines there are in it. Use what the filesystems provide already.
stat -c%s filename
Benchmarking with wc -l is filled with problems; and this article is unfortunately has more flaws than this but I'll stop now.
It's always funny to see how the wrong are often so cocksure.
How could you possibly know how many newline characters are in a file without looking in the file?