

Virtual memory overhead for trees - zippie
https://github.com/johnj/llds

======
epistasis
I was playing around with some stuff that required a 48GB hash table and, to
the very best of my ability to understand this stuff, the run time was
completely dominated TLB misses. I say this because, based on my throughput,
every lookp was taking the time of about 3 memory accesses on average; i.e.
there were page table lookups for every single memory access I made. I don't
know the tools that would let me actually monitor the true number of TLB
misses.

Had I pursued it further, it seems that using a hugepages interface could
alleviate this, but hugepages are a royal pain in the ass to get going as they
require kernel parameters, rebooting, and special memory allocation routines,
and praying that your memory doesn't get fragmented. Of course I was doing
this in C, and if my application had been in any other language it may have
been extremely difficult to get this to work.

My use case may have been unusual, but as we store more and more data in RAM
it's going to become less unusual. When we care deeply about the latency it
seems that virtual memory pagesize is going to be a big problem, and already
it seems that there are few use cases where 4kb pages are large enough.

~~~
neopallium
You could try to turn on transparent hugepages:

echo always >/sys/kernel/mm/transparent_hugepage/defrag

echo always >/sys/kernel/mm/transparent_hugepage/enabled

For more details see: <http://lwn.net/Articles/423584/>

~~~
Tuna-Fish
The transparent huge page code still only gets you 2M/4M pages, which will
still fault most of the time when you handle multi-gigabyte in-memory
structures. To avoid ever faulting, you really need to use the 1G pages. And
no-one is crazy enough to build support for making _them_ transparent. :)

~~~
Andys
We are not really asking for them to be transparent - TFA is talking about
low-level kernel fiddling to achieve what would take a couple of lines for a
custom hugepage size?

~~~
Tuna-Fish
I was specifically replying to neopallium, who suggested turning on
transparent hugepages, which, however, would not really do any good at all in
this situation. Using (non-transparent) 1G large pages is the correct approach
here.

------
derefr
I was pondering, a while ago, an operating system that--as well as exposing a
raw "allocate me a block of memory" function--exposed a managed, typed key-
value representation of virtual memory (picture, say, a Redis kernel module),
from which one could allocate hashes, trees, linked-lists, and so forth. Given
a NUMA architecture, this K-V store could then just be _clustered_ between
each memory pool in the same system in exactly the same way (save
optimizations) one would cluster it against remote systems.

------
zippie
Just some background - the solution/benchmark spawned from an index lookup
latency issue. In our search engine, we generate enormous b-tree indexes and
store them in memory (rsync from master then mmap). After adding more logic,
intersects, and unions the search engine started to miss SLA.

Eventually, we traced the problem back to the additional latency in the
vmalloc code path. The get_free_page* API code path had much lower latency and
llds was born (llds uses k*alloc which is a wrapper around GFP).

Additional use cases where llds is being used is in low-energy compute
environments (like SeaMicro machines) where every CPU cycle is expensive due
to increased hardware latency.

------
justincormack
I remember when there was a webserver in the Linux kernel. However it was
considered a bug that you could not do equal performance from userspace and
eventually it was removed.

Should also be possible to fix for this type of case. Making a kernel module
is the easy solution and gives a benchmark though.

------
gwern
Reminds me of exokernels. Being able to freely roll or adapt your own virtual
memory management system tuned to your application was one of the signature
uses.

------
chimmy
i am not able to fully understand what it is shooting for. The README, says it
avoids the VM layer (which seems impossible in a pure software solution). The
code suggests its merely doing a kmem_cache_zalloc. am i missing something?

its true that vm is an overhead now, with infinite/very large memory, the
concept of virtual memory is outdated. TLB misses are too high and huge pages
just don't cut it. this has been repeated over and over but we need to re-
design the VM/hardware to support TLBless access for a portion of memory of
the working set size of your primary application.

