

How I sped up my server by 6x (with 1 linux cmd) - zinxq
http://mailinator.blogspot.com/2010/02/how-i-sped-up-my-server-by-factor-of-6.html

======
gizmo
Article is better than expected. Because it looks like the application itself
isn't doing much at all (receive message over socket, touch some memory)
you're probably better off with a simple thread pool and some lock free data
structures if you're really going for raw performance. On the other hand, it
does serve as another data point that a Java solution can be fast enough.

Anyway, I think that thinking in terms of "message throughput" is really
harmful. The author starts with a throughput of 120.000 messages per second
and ends up with 800.000 messages per second which gives the advertised 6
times speedup. But essentially at 120.000 messages per second you have a
"process overhead" of a mere 0.008 milliseconds. That's 8 microseconds.
Microsecond!

I'll just repeat that. Overhead of 0.008 milliseconds per message in the SLOW
case. So if you want to write the message to file, or do any kind of hash
table lookup or want to do any kind of back end processing, whether you have
0.008 milliseconds or 0.002 milliseconds overhead per message is not going to
be that big of a deal.

~~~
alextingle
Just because you don't need to process hundreds-of-thousands of messages per
second doesn't mean that everybody is so fortunate.

~~~
gridspy
Usually the 'real work' required to process the message would drown this
overhead in the noise. The moment the message requires you to establish an
outgoing connection, open a file, hit the database or something similar it is
going to completely blow this throughput.

Also, it seems that the overhead here is poor use of shared memory or
threading semantics, something we cannot debug without the source (and the
required time)

~~~
viraptor
> _it seems that the overhead here is poor use of shared memory or threading
> semantics_

Worth noting on the graphs - even though CPUs aren't maxed out, they add up to
(roughly) just above 100%. That just shouts "lock contention / thread
switching".

------
larsberg
My advice is to find the bottleneck first. Watch performance counters. How do
your cache hit and miss rates vary at the levels at different CPU counts? For
example, poorly designed GCs don't scale well on multiple processors. You
could very well be paying for a single shared allocator lock or have a young
generation shared across multiple packages.

When are your threads idle? Can you produce log events for when your threads
block on locks to see how often your threads are live versus sitting around?

Do you have bad thread migration? Even if your GC is careful and you only
allocate thread-local data, if the OS keeps moving your threads between
packages every timeslice, you'll pay a huge hit.

Analyzing performance data for multi-core program execution is hard,
particularly in a virtual machine environment. It's even worse with some of
the latest 4 and 6 core processors, where even measuring cache miss rates can
have a noticeable impact on performance. (shameless plug: these issues and
parallel language features are what our research group works on --
<http://manticore.cs.uchicago.edu>)

~~~
runT1ME
>For example, poorly designed GCs don't scale well on multiple processors. You
could very well be paying for a single shared allocator lock or have a young
generation shared across multiple packages.

Java's GC is anything but poorly designed, and it definitely scales to spades.

>When are your threads idle? Can you produce log events for when your threads
block on locks to see how often your threads are live versus sitting around?

A good profiler should be able to show this. I've had a lot of luck with
Yourkit, but others may work just as well.

~~~
larsberg
Poorly designed was a harsh choice of words. A better way of putting it would
be "not yet optimized for single program execution processor counts > 6 or
so." The allocation wall
(<http://portal.acm.org/citation.cfm?id=1639949.1640116>) is something a lot
of folks are seeing in practice across a variety of virtual machine-based
languages (including Java) when scaling to about 8 cores.

~~~
runT1ME
Hrm, interesting. I would have thought the way allocation is done with
threadlocal variables negate a lot of the scaling for allocs, but I could be
wrong. Do you have a direct link to the PDF, or mind putting it somewhere? I
don't have an ACM login.

~~~
larsberg
Sorry, but unfortunately I can't. The authors are permitted to put it online
(where scholar.google and CiteSeerX usually immediately find them) but in this
case they don't appear to have done so.

------
spudlyo
While htop is a cool program with some nice features, the standard Linux top
can show you everything you need to know. You can hit 'H' to show all the
threads in a process. You can add the 'j' column which will show you the
processes last known CPU assignment. You already get %cpu utilization on the
process, and you can cut down what top shows you by hitting 'i' to only see
processes/threads in the run queue. I like to add 'z' for cool color, and '1'
to see SMP CPU details. While standard top doesn't show you cool CPU graphs,
it's still pretty great.

------
jacquesm
It's a pity there is no source with the article, at a guess I would say there
is some resource contention going on that drops away when assigning a task to
a single core (effectively that stops that task from real multithreading, the
only thing that will happen is that the threads are run in separate
timeslices, but no longer concurrent).

Chances are that if you were to really inspect the code that you'd be able to
pinpoint that single resource that's causing the threads to block.

After the single core is dealing with that process all the threads will be
able to run to the completion of their slice, and so more messages will be
handled per second.

On another note, I'm impressed with the authors focus on speed but I'd work on
releasing something rather than optimizing something that already does an
insane number of messages per second, unless there is a really important
reason for this, otherwise it's just premature optimization.

~~~
ntoshev
The article does link source you can use to reproduce the problem:
<http://www.mailinator.com/VariableStorage.java>

------
zinxq
Its a 64bit VM.

The interesting part (to me) - is that regardless of the cause of this issue -
it is likely that there are other (maybe many other) applications out there
that could "benefit" from running on only a single core.

The server favors synchronized blocks to the use of volatiles (i.e. Atomics)
at the moment. I tend to play things conservative until time that I find a
problem or see an obvious opportunity. And, I hope I can say I'm not "over
synchronizing" but I did synchronize as often as I determined it was needed
(modulo the normal issues for humans not being very good at that).

Again however, even if the server is written in any silly manner you can
imagine. The idea still seems to expose a an issue (and roundabout solution)
I've never heard of before.

~~~
rbanffy
I heard a lot about processor affinity tools under HP/UX. Not sure how it
compares with taskset and tuna, but I heard high praise about it.

------
kmod
This seems to be a cache performance issue:

\- The difference in performance is too big to be a memory bus issue

\- Since the number of threads doesn't change, just the number of processes,
it's not an un-scalability issue of the program.

\- It's most likely not a issue with which CPUs he chose; in a hyperthreaded
system there can be a pretty large performance hit if he pinned the server to
logical cpus that were on the same core. I don't know how Intel numbers its
cores in hyperthreaded processors, but their general method is "make the cores
that are closest to each other have the biggest difference in cpu ids", which
would mean that ex cpus 1 and 5 are on the same core.

\- I would guess that it's not a thread migration issue, since I wouldn't
expect the Linux scheduler to be that bad.

Poor cache performance is the most likely cause. As a minimal example of what
the author concludes, if you have two threads that sit in a loop incrementing
the same counter, it is much faster to run those two threads on the same cpu
than on their own cpus. This is because when running on separate cpus, that
particular cache line will continually be ping-pong'ed between the two. Even
if threads don't access the same addresses, if they access addresses that are
in the same cache line, you'll still get ping-ponging ("false sharing"). This
explanation is also supported by the fact that it disappears on i7 processors;
one of the largest improvements with the Nehalem line is the improved cache
architecture.

Poor cache behavior is usually the culprit in non-cpu-bound and non-io-bound
systems.

------
vicaya
He should really be using dstat (or something like that) to show number of
context switches per second.

His benchmark is mostly a measurement of Linux kernel thread context
switching, which is a lot slower across cores/cpus. I suspect that if he
tweaked the Linux kernel HZ to 1000, he'll see another 3-4x improvement.

In most real world servers, the threads are actually doing something real like
updating some complex data structure and doing some transformation
(encoding/decoding.), etc., which would made context switching overhead less
of a problem.

However the benchmark does illustrate the importance of context switching
performance for certain workloads.

------
pilif
does anybody have any idea what's going on here?

If I had to guess I'd say that the code is doing too much synchronization
between threads that, in the end, only one thread really gets to run at a time
anyways at which point moving the process from CPU to CPU is the only thing
going on besides running that one thread on the server, hence it's way faster
if it doesn't have to do any moving around.

IMHO this is an indication of a bug in the code and not generic advice. Or
does somebody see another issue that might cause this?

~~~
lincolnq
Sort of agree, although my intuition tells me not that there's an excessively-
coarse-locking problem, but that there's most likely a contention problem.

The essence of a contention problem, for those who don't know, is that some
shared structure (maybe a counter, or the head of a queue, or something) is
polled and updated for every message that needs to be processed. Since
processors don't (usually) share their caches, updating a value in shared
memory causes that cache line on the other processors to be invalidated, so
the next time the other processor needs to poll that value it has to retrieve
it from memory, which takes a lot of extra time compared to getting it off the
L1.

This problem often arises with shared work queues and counters. For work
queues you can solve it by giving each processor its own work queue and having
it steal work from other processors when it finishes its own queue.

A textbook by Herlihy and Shavit, The Art of Multiprocessor Programming,
discusses this stuff in detail.

~~~
gipyro
This is pretty much correct. Another solution to this is to have an
array[#CPU] of counters/structures and you only update the counter/structure
for that specific CPU.

~~~
kmod
Not quite -- you need to make sure that the counters are on separate cache
lines, otherwise you'll still get the same issue.

------
nailer
As an alternative to taskset, you can use 'tuna' (kernel.org git) which:

\- doesn't require bitmask calculations

\- handles IRQs as well as processes

\- can also set realtime priorities of those processes if you'd like.

------
raggi
Guys he's only pushing 120mb/s over the loopback. That's totally lame. Sorry,
but there's something wrong in the software stack. All the comments about
hardware, caches, blah blah are just FUD. Locked to a single CPU, he should be
pushing well over this (like nearing an order of magnitude) if the software /
stack was doing the correct thing. Do your math, or read my comment when it
arrives on the article.

You can reproduce this bandwidth usage in slow ass languages like ruby, and
I've done so many times. Btw. it's ~120MB/s he's peaked at. DDR2 peaks at what
in the region of 10GB/s? i7's architecture should totally murder that, iirc.

raggi@mbk: ~/volatile % time dd if=/dev/zero of=bigfile bs=1048576 count=500

500+0 records in

500+0 records out

524288000 bytes transferred in 1.527235 secs (343292283 bytes/sec)

real 0m1.532s

user 0m0.003s

sys 0m0.796s

raggi@mbk: ~/volatile % time sh -c "cat bigfile | cat - > bigfile2"

real 0m2.846s

user 0m0.117s

sys 0m2.185s

raggi@mbk: ~/volatile % time sh -c "cat bigfile | cat - > bigfile2"

real 0m2.363s

user 0m0.116s

sys 0m2.161s

------
zephjc
sudo killall -9 emacs?

 _ducks_ :-)

~~~
rbanffy
The up-to-date version would be "sudo kill -9 eclipse" or something to that
effect, although a "sudo killall firefox-bin" will do just fine in most cases.

It's not impossible emacs takes up less space and cycles than gedit...

