
Everything I Ever Learned About JVM Performance Tuning At Twitter - koevet
http://www.slideshare.net/aszegedi/everything-i-ever-learned-about-jvm-performance-tuning-twitter
======
sehugg
A couple other things:

-XX:+UseCompressedOops converts many 64-bit references into 32-bit, trading a bit of CPU for a lot of memory

If you use a lot of threads, lower your stack size (-Xss) (no real method
here, just drive some trucks over the bridge until it breaks, and that's your
weight limit :P )

You don't want to swap! A good rule of thumb is to make your max heap (-Xmx)
about 60% of total RAM if you only have 1 JVM running

Adaptive GC tuning algorithms can become unstable in long-running processes. I
like to use -XX:GCTimeRatio for the throughput collector which effectively
turns off the adaptive stuff.
CMSInitiatingOccupancyFraction/UseCMSInitiatingOccupancyOnlydoes the same
thing for the concurrent mark-sweep colletor. Keep the ice pick handy :P

~~~
VladRussian
>You don't want to swap! A good rule of thumb is to make your max heap (-Xmx)
about 60% of total RAM if you only have 1 JVM running

i'd prepend the advise with "use large pages, Luke". Set Xmx to almost 100% of
large pages, while the total of large pages set to 60% of the total RAM in
your case or whatever your application specific is. For non-filesystem-IO
application (middleware between DB and client applications) in my case the 14G
on 16G machine works fine.

~~~
wolf550e
Have you tried transparent hugepages in the newer kernel?

~~~
VladRussian
nope. Enterprise software, not on the bleeding edge.

------
xentronium
Law of leaky abstractions at its best (or worst): you decide to use a vm and
gc to abstract from manual memory management only to deal with gory details of
memory padding and internals of jvm later. Ouch.

~~~
gchpaco
The stock JVM is exceptionally bizarre about this because of the insane
decision to pad all fields to memory boundaries (which means on a 32 bit
machine a class with 10 individual bit fields consumes 40 bytes of memory
instead of something sane, like, say, 2). It also has an enormous amount of
header overhead; a smallint on a competent Lisp or Smalltalk image takes up,
say, one 4 byte word and can use 30 bits of that for data. Java's Integer
class which is similar in intent is (in stock JVM) somewhere between 12 and 32
depending on VM settings. I've never seen another VM quite like it. In a
previous job we got burned pretty hard by the amount of sheer overhead in this
and I had to spend about six months converting everything over to use raw int
arrays and copy-on-write objects so we could fit a reasonable but not really
huge dataset into memory.

Despite this, Hotspot is very good and the stock JVM GC is very useful and it
is reasonably straightforward to get visibility into the state of the heap and
what is going on with the GC. I've never worked in a professional capacity
that has anything comparable to VisualVM, although I gather Smalltalk and
commercial Lisps have comparable offerings, and you only have to program in
Ruby for a while to learn to appreciate the GC quality.

~~~
jules
The bits can't be in the same memory word for concurrency reasons. Setting a
bit is not an atomic operation on most processors.

------
moonboots
Good presentation - I hate slideshare though, randomly skips slides, requires
flash, needs facebook login to download, etc.

~~~
koevet
I have watched the presentation with my iPad so, no, Slideshare does not
require Flash. I do agree on the suckiness of the FB marriage.

------
bfrog
Funnily enough erlang does not have these horrible GC issues people are seeing
in .NET and the HotSpot JVM.

~~~
wglb
You have my curiosity piqued. Why is that?

~~~
derleth
Apparently, the secret of Erlang garbage collection is that it's done per-
thread, and the whole Erlang model is based on spawning many threads. As a
result, each thread only has a few K of RAM to collect, as opposed to a bigger
pool per thread for Java.

<http://prog21.dadgum.com/16.html>

~~~
simcop2387
I imagine that some of it is also the almost entirely strictly functional
design of the language. Some of the ways it makes you write things should make
certain tasks like that easier for the compiler. It should be almost always
known at compile time when it'll be able to reap things.

~~~
derleth
Right. For one thing, nothing in Erlang is mutable: To change something, you
copy it and change the copy. This sounds like it should make generational
garbage collection easier (to a first approximation, nothing in the older
generation is being used) but I don't know.

------
ananthrk
Does anyone have a PDF copy of this? The slideshare download is broken for
this presentation.

~~~
js2
dropbox: [http://dl.dropbox.com/u/2138120/twitter-
jvmperformancetuning...](http://dl.dropbox.com/u/2138120/twitter-
jvmperformancetuning.pdf)

google docs: <http://goo.gl/g7y02>

------
cshesse
My understanding is that some people just restart the JVM periodically to
avoid GC pauses. How does that compare for web applications?

~~~
wolf550e
While the JVM is down because you have restarted it, presumably your load
balancer is routing requests to another JVM/machine. Then the new JVM starts
up and warms up until it's ready to receive requsts.

A restart transformed a JVM with a fragmented old gen into a JVM with fresh
old gen using X1 wall-clock seconds and Y1 cpu time/watts.

A full GC transforms a JVM with a fragmented old gen into a JVM with fresh old
gen using X2 wall-clock seconds and Y2 cpu time/watts.

Have you ever compared X1,Y1 vs. X2,Y2?

~~~
jbri
Does the JVM finish up outstanding requests and ask the load balancer to
please-don't-send-me-anything-for-a-bit before it does a full GC?

~~~
wolf550e
If you trigger a full GC "from outside", the way you would trigger a restart,
you'd perform whatever you do for a restart (stop accepting requests, finish
accepted requests, be unavailable, start accepting requests).

My point was that a full GC does not need to re-profile the code, re-JIT the
code and warm data caches, like a newly restarted JVM does. Sending a user
request to a JVM to be run in an interpreter with cold caches is not good for
user satisfaction, so a newly restarted JVM needs to be given mock requests to
warm it up (like a script for PGO static compilation). A JVM doing a full GC
does not need all that to become fully ready, and if it has enough free RAM to
defragment quickly, the process should be much more efficient.

------
jlarocco
"A research project had to load the full follower graph in memory"

Really? It _had_ to? There was no possible alternative?

As often as I get the "fail whale" page on Twitter, I'm always skeptical
seeing them present stuff like this and release code.

Maybe I'm just not grasping how large and popular Twitter is, but of the
popular web services I use, Twitter fails more than all the others combined.

Though it was an interesting read, it seems suspect that they're having so
many more problems than everybody else.

~~~
jQueryIsAwesome
> Though it was an interesting read, it seems suspect that they're having so
> many more problems than everybody else.

So true; i don't even know what a "facebook overload" looks like because i
have never had one.

------
rubashov
If you're fighting with the JVM this much, why not just use C++? The problems
were garbage collection, lack of control over memory layout, and bloated
types. Tool for the job, no?

~~~
strlen
> If you're fighting with the JVM this much, why not just use C++? The
> problems were garbage collection, lack of control over memory layout, and
> bloated types. Tool for the job, no?

Here's why. Compare these two:

[http://hg.openjdk.java.net/jdk7/hotspot/jdk/file/9b8c96f96a0...](http://hg.openjdk.java.net/jdk7/hotspot/jdk/file/9b8c96f96a0f/src/share/classes/java/util/concurrent/ConcurrentLinkedQueue.java)
<\-- Doug Lea's java.util.concurrent.ConcurrentLinkedQueue

[https://github.com/afeinberg/lockfree/blob/master/src/lockfr...](https://github.com/afeinberg/lockfree/blob/master/src/lockfree_queue.cc)
<\-- my port of above to C++0x

[https://github.com/afeinberg/lockfree/blob/master/src/hazard...](https://github.com/afeinberg/lockfree/blob/master/src/hazard_ptr_rec.cc)
<\-- essentially an implementation of garbage collection that's needed to to
work around the ABA problem

tl;dr Shared memory concurrency is surprisingly hard with manual memory
management. Not impossible, not infeasible, not impractical. Just _hard_.

C++ is still a valid choice for many products: JVM isolates you from the
underlying OS, the VM subsystem, and there are cases where the cost of garbage
collection is prohibitive.

However, I'll argue that vast majority of a site like Twitter (or LinkedIn,
another high-scale JVM powered property) is best served by runtime like the
JVM or CLR. Erlang is another great option, but it's more of something you'll
have _along with_ JVM/CLR and C/C++: Erlang's model is (highly efficient, well
abstracted) concurrency with message passing -- which is awesome, but not a
full substitute for shared memory concurrency -- i.e., it's a great tool for
some jobs, but not others.

C++ makes more sense for things like a B+Tree implementation: I've been using
a pure Java B+Tree implementation -- BerkeleyDB Java Edition, and can
certainly mention the negatives of that approach.

On the other hand, look at something like the routing layer of Voldemort,
multi-Paxos implementation in ZooKeeper, or (as an example of another high
scale, Java based service) the modified Paxos implementation in Google's
MegaStore, non-storage parts of Amazon's SimpleDB (written at Java at one
point, in Erlang at another -- not sure what it's written in now).

I'll also argue that I'd rather use a less verbose and more functional
language than Java -- and honestly, in some cases C++ far outdoes Java in
terms of expressiveness. Go, Scala, languages in the ML family and Erlang
(especially with tools like dialyzer and quick check, to get around the
dynamic typing) are the right way forward for building highly concurrent user-
land "systems-y" software (think more databases or distributed middleware than
an OS) -- the parts where memory layout is critical and which tend to produce
a lot of garbage can always be implemented in C/C++.

tl;dr Programming language choice involves trade offs, Java is too verbose,
but garbage collected languages/runtimes have their role in building highly
scalable applications or service

~~~
saurik
To be fair, the ConcurrentLinkedQueue class you linked to was apparently
already so complex to implement that the developers had to go off the edge of
type safety into the land of sun.misc.Unsafe; given how much research you
already need to understand to feel confident that this algorithm works at all,
going the extra distance to understand hazard pointers (which I, at least,
find refreshingly easy to understand in juxtaposition), seems minuscule. ;P

That said, since I've now taken the time to read your code, I have a
suggestion that would drastically clean up the parts that are making the
manual memory allocation feel "overtly manual": rather than allocating memory
and then using the default placement new to construct your object, you could
define a placement new/delete that operates over the allocator.

This not only would remove all of the reinterpret_cast<>s and sizeof()s from
the queue code, but it would also allow a future modification where the queue
was able to store values other than void *, while retaining exception safety
(as when you template the Node by way of the Queue itself, its more complex
constructor would suddenly be allowed to throw, and the memory would need to
be collected).

~~~
strlen
Well, you could also perform CAS using AtomicReference -- the examples in
Maurice Herlihy's The Art of Multiprocessor Programming [1] and Brian Goetz'
Java Concurrency In Practice [2] do that. So you don't _really_ need to use
sun.misc.Unsafe in your own code (of course, you need CAS to implement
AtomicReference in the first place).

You're also completely correct about placement new: I am working on a cleaned
up version of this class, this was essentially a first pass to get myself more
familiar with concurrency in c++0x. What complicates things a bit is that
allocators are (much like all else in STL) meant to be used as class template
arguments, which makes main separate compilation impossible -- hence the need
of an adapter from an STL-style allocator to a pure virtual class. Separate
compilation is why I also made a void* version of this initially.

I have a _much cleaned up_ version in the work that will handle more than void
_. There's an implementation I call it ConcurrentLinkedQueueImpl that handles
just void_ , that is compiled separately -- and there is generic version
ConcurrentLinkedQueue that is specialized for void * (ends up just proxying
the calls to ConcurrentLinkedQueueImpl), with the generic version (in turn)
using ConcurrentLinkedQueue<void *> and placement new to hold any type.

Once again, the version posted there was a rough pass to get myself familiar
with 0x concurrency constructs and hazard pointers -- the code is fairly
messy.

[1] [http://www.amazon.com/Art-Multiprocessor-Programming-
Maurice...](http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-
Herlihy/dp/0123705916) [2] Everyone should read this book cover to cover --
<http://jcip.net/>

