
Disable transparent hugepages - wheresvic3
https://blog.nelhage.com/post/transparent-hugepages/
======
markjdb
Please be aware that the article describes a problem with a specific
implementation of THP. Other operating systems implement it differently and
don't suffer from the same caveats (though any implementation will of course
have its own disadvantages, since THP support requires making various
tradeoffs and policy decisions). FreeBSD's implementation (based on [1]) is
more conservative and works by opportunistically reserving physically
contiguous ranges of memory in a way that allows THP promotion if the
application (or kernel) actually makes use of all the pages backed by the
large mapping. It's tied in to the page allocator in a way that avoids the
"leaks" described in the article, and doesn't make use of expensive scans.
Moreover, the reservation system enables other optimizations in the memory
management subsystem.

[1]
[https://www.cs.rice.edu/~druschel/publications/superpages.pd...](https://www.cs.rice.edu/~druschel/publications/superpages.pdf)

~~~
loeg
It's worth pointing out that the FreeBSD implementation (on AMD64) only
promotes 4kB pages to 2MB pages and doesn't transparently promote to 1GB
pages.

Given alc@ was an author on the paper (and the paper's FreeBSD 4.x
implementation supported multiple superpage sizes), I'm not really sure why
FreeBSD's pmap doesn't have support for 1GB page promotions.

~~~
kev009
F5/LineRate did it but it got NACKed in a fairly underhanded and unfortunate
way on the mailing lists :/ [https://github.com/Seb-
LineRate/freebsd/commits/seb/stable-1...](https://github.com/Seb-
LineRate/freebsd/commits/seb/stable-10/1-gig-pages)

~~~
markjdb
That patch set does not implement transparent creation of 1GB mappings. It
also contains dubious things like this, which make me think the branch was a
WIP: [https://github.com/Seb-
LineRate/freebsd/commit/66a8d3474d410...](https://github.com/Seb-
LineRate/freebsd/commit/66a8d3474d41030d4da5bfa2042aa573ff1b281f)

The only mailing list thread I see regarding this is here, and it doesn't seem
particularly underhanded to me: [https://lists.freebsd.org/pipermail/freebsd-
hackers/2014-Nov...](https://lists.freebsd.org/pipermail/freebsd-
hackers/2014-November/046541.html)

~~~
kev009
All the technical critique seems fair but it seemed like they (both as an
individual and as a company) were a first time contributor and no outreach was
really done to pull them in further. I guess LineRate imploded within F5 so
there could have been structural problems inside there prevented them from
doing a fully baked contribution anyway.

------
lorenzhs
I've had a really bad run-in with transparent hugepage defragmentation. In a
workload consisting of many small-ish reductions, my programme spent over 80%
of its total running time in _pageblock_pfn_to_page_ (this was on a 4.4
kernel,
[https://github.com/torvalds/linux/blob/v4.4/mm/compaction.c#...](https://github.com/torvalds/linux/blob/v4.4/mm/compaction.c#L74-L115))
and a total of 97% of the total time in hugepage compaction kernel code.
Disabling hugepage defrag with _echo never >
/sys/kernel/mm/transparent_hugepage/defrag_ lead to an instant 30x performance
improvement.

There's been some work to improve performance (e.g.
[https://github.com/torvalds/linux/commit/7cf91a98e607c2f935d...](https://github.com/torvalds/linux/commit/7cf91a98e607c2f935dbcc177d70011e95b8faff)
in 4.6) but I haven't tried if this fixes my workload.

~~~
mlrtime
Did you try allocating hugepages statically at startup? This will also remove
the fragmentation.

~~~
lorenzhs
The algorithm was implemented in a big data framework that handles the
allocations, so I would have needed to significantly adapt its memory
subsystem to change this. I've talked to the authors, though, and it's not
easy to change. Easier to disable transparent hugepage defrag, especially when
there's a paper deadline to meet :)

------
xchaotic
So glad this is on the front page of HN. A good 30% of perf problems for our
clients are low level misconfigurations such as this. For databases: huge
pages - good THP - bad

------
reza_n
Not to mention that there was a race condition in the implementation which
would cause random memory corruption under high memory load. Varnish Cache
would consistently hit this. Recently fixed:

[https://access.redhat.com/documentation/en-
us/red_hat_enterp...](https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/7/html-single/7.2_release_notes/index#kernel)

------
mnw21cam
Agreed. Found this to be a problem and fixed it by switching it off three
years ago. Seems to be a bigger problem on larger systems than small systems.
We had a 64-core server with 384GB RAM, and running too many JVMs made the
khugepaged go into overdrive and basically cripple the server entirely -
unresponsive, getting 1% the work done, etc.

------
fps_doug
I stumbled upon this feature when some Windows VMs running 3D accelerated
programs exhibited freezes of multiple seconds every now and then. We quickly
discovered khugepaged would hog the CPU completely during these hangs.
Disabling THP solved any performance issues.

~~~
jlgaddis
KVM?

~~~
fps_doug
VMware 12.5.x

~~~
awalton
<on-the-clock>Do you mind opening a support ticket for this with VMware? You
can't be the only person seeing this, and it'd be great for us to check for
this specifically when dealing with mystery-meat "bad perf in XYZ VM"
bugs.</on-the-clock>

------
mwolff
Bad advise... The following article is much better at actually measuring the
impact:

[https://alexandrnikitin.github.io/blog/transparent-
hugepages...](https://alexandrnikitin.github.io/blog/transparent-hugepages-
measuring-the-performance-impact/)

Especially the conclusion is noteworthy:

> Do not blindly follow any recommendation on the Internet, please! Measure,
> measure and measure again!

~~~
antirez
I do not agree much with this conclusion. If you can't measure very well, the
safe bet is to disable THP because they are capable of improving of a given
percentage the performance on _certain_ use cases, but can totally destroy
other use cases. So when there is not enough information the potential
gain/lose ratio is terrible... So I would say "blindly disable THP", unless
you can really go to use-case-specific costly measurement activities and are
able to prove yourself that in your use case THP are beneficial.

~~~
mfukar
If you can't measure (very well?), how would you know the improvement in a
certain use-case exists or not?

~~~
userbinator
Indeed. If you can't really measure the difference, then I'd say setting it
either way probably doesn't matter anyway.

~~~
dboreham
More like : if you can't measure the difference then definitely turn it off
because if it is on there is a non-zero chance of significant instability
events in your future.

------
lunixbochs
Transparent hugepages causes a massive slowdown on one of my systems. It has
64GB of RAM, but it seems the kernel allocator fragments under my workload
after a couple of days, resulting in very few >2MB regions free (as per proc
buddyinfo) even with >30GB of free ram. This slowed down my KVM boots
dramatically (10s -> minutes), and perf top looked like the allocator was
spending a lot of cycles repeatedly trying and failing to allocate huge pages.

(I don't want to preallocate hugepages because KVM is only a small part of my
workload.)

------
phkahler
Shouldn't huge pages be used automatically if you malloc() large amounts of
memory at once? Wouldn't that cover some of the applications that benefit from
it?

~~~
zaarn
malloc() is higher level than what the kernel does.

At the lower syscall level you move the BRK address, which is the highest
memory address you're allowed to use. By default this is just after the
statically initialized memory.

malloc() is just a library that manages this memory for you.

Linux has no idea if you will use the memory you just allocated, usually this
happens dynamically; when you access a memory region for the first time, it is
allocated in memory for real.

~~~
myst
Modern allocators use mmap(2).

~~~
aduitsis
May add, if you mmap(2) a memory segment (even a very large one, say 1Tb),
nothing happens with regard to page mapping. It is not uncommon to see java
tomcat processes allocating north of 80 Gb. But only a small percentage of
these is actually used.

~~~
loeg
Well, the page tables are initialized. Those aren't totally free, especially
if a large mapping uses 4k pages.

------
brazzledazzle
Brendan Gregg's presentation at re:Invent today reflected this advice. Netflix
saw good and bad perf so switched back to madvise.

------
vectorEQ
good article, though as other posters suggest, just use it if you obsolutely
must, and measure / test the results for any issues!

------
hossbeast
What's the recommendation on a desktop for gaming / browsing / compiling with
32gb of ram ?

~~~
takeda
Leave it on (or whatever it is the default).

The issue happens on specific workloads (databases, hadoop etc) and (this is
often not mentioned) after the system is running uninterrupted for a quite of
while. The slow down comes due that the workloads mentioned cause memory to be
fragmented and when kernel tries to defragment the memory (unsuccessfully) on
each allocation.

Since the workload you mentioned looks like it is for a workstation that won't
be running a database 24/7 over months/years, you are very unlikely to run
into it.

