
How to get C like performance in Java  - Garbage
http://vanillajava.blogspot.com/2011/05/how-to-get-c-like-performance-in-java.html
======
nvarsj
"One way to get C-like performance is to use C via JNI for key sections of
code. If you want to avoid using C or JNI there are still ways you can get the
performance you want."

JNI has huge latency. Write code in Java to get "C-like" performance.

"One area Java can be slower is array access. This is because it implicitly
does bounds checking."

Which gets optimized away by hotspot.

"However this doesn't mean you can't pre-allocate your objects, use Direct
ByteBuffers and Object recycling techniques to minimise your object creation."

Please don't write your own object pools, this isn't 1995.

~~~
srean
> Which gets optimized away by hotspot.

I believe it would do so automatically only when it can prove that the bound
will not be violated. That wouldnt be very often in a program written in an
imperative style.

Lest it be mistaken for a swipe at imperative style, functional languages
don't have it much easier either. I think the problem is that for the most
part, array use patterns are not very amenable to this kind of analysis. So
even in functional languages you have to explicitly ask the runtime or the
compiler to avoid bound-checks.

If I remember correctly, there were a few languages designed to handle this
case, i.e. use compile time type-checking to ensure that all the arrays are of
the right shape and that none are accessed un-safely. Fish I think is what it
was called. Dont know what happened to it or the idea. Seemed pretty handy to
have. I think the first place to check now would be SAC (single assignment C).
But I have digressed far from the topic of this thread !

> Please don't write your own object pools, this isn't 1995.

The snark was unnecessary. It is frowned upon here.

~~~
davidtgoldblatt
On the contrary, for almost all types of common array code the compiler is
really good at optimizing away the bounds checks (see
[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.96...](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.9658)
). Things only get tricky when you have, for instance, arrays of indices into
other arrays, or when the next array position to examine depends on the
contents of the one being examined. Even there though (and even in array-heavy
code), bounds checks don't typically take up a ton of program execution time.

~~~
srean
For sparse matrix operations that _is_ the most common pattern of access
though. For linear algebraic operations on dense matrices, BLAS and LAPACK
takes care of most of the issues and is super fast. I think the times I do not
use BLAS/LAPACK (directly or indirectly) are when I have to apply operations
on matrices that are not the typical linear algebraic operations. This comes
up fairly often too in my line of work which is machine learning.

------
jshen
"Most systems can handle 1K-10K threads efficiently. If you need more
connections than that, buy another server, a cheap one cost about $500."

Shouldn't the title be, How to NOT get C like performance in Java ;)

~~~
wladimir
I also found that particular sentence really strange. "How to get C-like
performance in java? Buy more servers!". Yeah, sure, but that approach will
work for any language. With async you could handle even more connections with
that larger amount of servers...

~~~
pkl
You can handle a much larger number of connections provided they don't do
anything useful. However if you expect to do anything useful, you are likely
to need some horizontal scalability once you have thousands of connections.

AdWords charge about $1 per click. Lets say your hardware budget is only $1
per connection. With 10K connections you still have enough money for 10
servers if you wanted.

~~~
jshen
"You can handle a much larger number of connections provided they don't do
anything useful."

Define useful? I think you mean, computationally complex, but that isn't the
same thing as useful. At my last job I worked on a web crawler, and fetching a
page, then persisting it in our grid, was useful work. It was useful to have
more than 10k connections in that context.

One of my pet peeves is this idea of throwing servers at problems. We
currently have enormous problems with power and heat dissipation in data
centers. At my last job, where I did the web crawler, we couldn't put any more
servers in our data center because we didn't have enough power. Doing more,
with fewer servers can be very important.

I also wonder how much larger the carbon footprint of computing is because of
such philosophies.

------
astine
"Note: Most of these suggestions only work for standalone applications rather
than applets."

Whoa! That made me double-take. Applets have been an effectively abandoned
technology (by the browser makers, not necessarily Sun,) for what, almost 10
years now? It's amazing this guy would feel the need to specify this.

Then again, had applets ever taken off... we might be coding client-side code
in Lisp now, (or your favorite JVM language :)) Crazy.

~~~
gruseom
You can code client-side in Lisp now (or your favorite JS-targeted language).
I do it every day.

~~~
astine
Parenscript isn't quite the same thing.

~~~
gruseom
That's true. You're not encapsulated from JS, so you have to be aware of JS
issues like null-undefined-false-0 and all the rest of them. So it feels like
you're writing in two languages at once. On the other hand, that has some
benefits too.

------
speckledjim
"Most systems can handle 1K-10K threads efficiently. If you need more
connections than that, buy another server, a cheap one cost about $500."

Terrible advice. Use NIO and do it properly. Creating 1 thread per connection
is for unskilled programmers.

In my experience once you get up to about 1k threads in Java you'll be wasting
most of the CPU up switching between them.

For comparison, it's trivial for any skilled programmer to use NIO to get up
100k+ connections without any real CPU usage.

~~~
papaf
_Creating 1 thread per connection is for unskilled programmers._

Its not as clear cut or obvious as it first seems:

[http://mailinator.blogspot.com/2008/02/kill-myth-please-
nio-...](http://mailinator.blogspot.com/2008/02/kill-myth-please-nio-is-not-
faster-than.html)

Edit: The original presentation is more interesting:
<http://www.mailinator.com/tymaPaulMultithreaded.pdf>

~~~
speckledjim
The article is from back in 2008 so I'd firstly dispute its accuracy with
modern JVMs.

It's not just speed (The speed obviously changes as you change number of
connections - 100k threads isn't going to beat 100k NIO connections). It's
also that it's more scalable, better code, more maintainable, less prone to
bugs, and you don't have to worry as much about concurrency issues. I'd argue
that the lines of code is likely to be less also. And memory usage is likely
to be less.

~~~
gojomo
For better or worse, JVMs don't change that fast. (In 2008, the latest
official release was JDK6, reaching 'update 12' in December. In July 2011,
until the official release of JDK7 next week, the latest official JVM is still
JDK6, 'update 26'.)

Or do you know of specific NIO performance improvements between JDK6 in 2008
and JDK6u26, or in JDK7?

 _100k threads isn't going to beat 100k NIO connections_

That may be true but I wouldn't assume it with certainty without evidence.
Threading has been improving, too. Tyma's 2008 presentation mentioned a JVM
limit at the time of 16000 threads, but that may have changed as well.

A single-threaded NIO implementation, using only one core, could very
plausibly lose to a 100K-threaded implementation on a multicore machine. And
once you split your NIO over a few threads to use cores effectively, you have
most of the same concurrency worries as with connection-per-thread approaches.

------
fauigerzigerk
I've been working very hard to make Java use less memory using techniques like
these and others more. The problem is, if you really want to specify exactly
how non-trivial structured objects are packed into memory, you end up writing
your own garbage collector and memory allocator, which is a rather large
project all by itself. Also, I have found that cutting down on memory usage
like that hurts performance a lot.

At the end of the day, going down that road makes you less productive than
working in C++ and the result will still be subpar compared to a C or C++
solution. So I'm back with C++ right now even though I'm painfully aware of
its weaknesses.

~~~
sliverstorm
So what you're saying is, if you want C-like performance in Java, switch to C?

~~~
fauigerzigerk
Exactly, because otherwise you get assembler-like productivity. Specifying
memory layouts in Java is much more low level than C or C++ programming.

------
smcl
The title is a little misleading, there's only a few techniques and it seems
that one of the more interesting ones ("Unsafe" class) is only available in
some JVMs, one doesn't help with speed in many cases (compressed strings) and
a couple of the others boil down to "use DirectByteBuffers".

~~~
peter_lawrey
Does your comment suggest that it can't be that simple?

------
sehugg
For servers, memory is usually the long tent pole. For me it boils down to:

* Use less memory.

* Don't use a lot of threads.

* Use fewer objects.

Worst-case GC is a big issue. For just a 2 GB heap you are looking at up to 7
second GC pauses with the Java 6 collector. You can use concurrent GC but it
adds significant CPU and memory overhead. Direct buffers don't live in the
heap so they don't contribute to GC time.

Threads in Java are _extremely_ heavyweight in memory, if not in CPU. A rule
of thumb is about 1 MB per thread. You can make it better by decreasing the
stack size and/or using 32-bit addresses.

~~~
peter_lawrey
If you create low enough garbage and your eden size is large enough, you can
avoid having any minor collections at all. I have a full GC which is triggered
once per night as maintainence.

------
Roboprog
Most of these tips seem to be variations of the "flyweight" design pattern
(<http://en.wikipedia.org/wiki/Flyweight_pattern>), to avoid around having
many objects for the GC to keep checking, as well as all the other overhead
for each individual object.

------
peter_lawrey
The answer to why would you still use Java is here "Java makes high level
programming simpler and easier at the cost of making low level programming
much harder. Fortunately most applications follow the rule of thumb that you
spend 90% of your time in 10% of the code. This implies you are better off 90%
of the time, worse off 10% of the time. ;)" Why would you develop mroe than a
small portion of your application in a low level language unless every part of
your application is performance critcal?

------
jyperion
where does one find a 16 gb server for 1k?

~~~
krakensden
I don't know, but I believe it- Fry's has an ad out now for 16GB for $100
(after $50 mail-in-rebate).

Sure you might want ECC, or a brand name, but the floor is pretty damn low.

------
known
no pointers = no speed

~~~
peter_lawrey
Can you provide any benchmarks which support this as I can provide plenty
which don't?

~~~
known
[http://shootout.alioth.debian.org/u32/benchmark.php?test=all...](http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=all&box=1)

~~~
peter_lawrey
That's a good link. The n-body and spectral-norm benchmarks were fastest in
Fortran. The fannkuch benchmark was fastest in Ada. The Mandlebrot benchmark
was fastest in ATS.

In each case, the language has pointers, however they were not used in the
core part of the benchmark. i.e. they didn't get their speed advantage from
pointers.

Haskell was the fastest for the thread-ring benchmark, it doesn't have
pointers.

If Java had pointers as it does in the sun.misc.Unsafe utility class some code
would be faster, but I don't believe this explains why Java is not the fastest
in each benchmark. ;)

Object orientated programming is used in a significant portion of all
programs, it would be interesting how C/C++ performs in such benchmarks. ;)

~~~
igouy
>>That's a good link.<<

Those measurements for programs forced to use just one core on a quad-core
machine, you may also be interested in the measurements made when the programs
are allowed to use all the cores -

[http://shootout.alioth.debian.org/u32q/which-programming-
lan...](http://shootout.alioth.debian.org/u32q/which-programming-languages-
are-fastest.php)

and also measurements made on x64

[http://shootout.alioth.debian.org/u64q/which-programming-
lan...](http://shootout.alioth.debian.org/u64q/which-programming-languages-
are-fastest.php)

>>The n-body and spectral-norm benchmarks were fastest in Fortran.<<

Maybe we should wonder if the Intel C++ compiler would produce programs
comparable to those from the Intel Fortran compiler ;-)

>>Object orientated programming is used in a significant portion of all
programs, it would be interesting how C/C++ performs in such benchmarks.<<

Maybe you're suggesting that OO C++ programs wouldn't perform as well as Java
programs? But doesn't that highlight the advantage of being able to write
programs in different styles?

~~~
peter_lawrey
I agree, I am suggesting that "no pointer = no speed" is not demonstrated by
the benchmarks provided. The developers who wrote the benchmarks which were
faster than the C/C++ ones didn't appear to agree. i.e. when they has the
optional to use them, they didn't and were still faster.

------
gaius
If you have to fight the language like this, would you not be better off with
a different language? Scala maybe.

~~~
jshen
for most of his points you'd have to do the same things in scala, right? How
do you do arbitrary memory access in scala? How do you use space efficient
strings in scala? etc, etc.

~~~
lucian1900
While the parent is generally wrong, it has been shown that Scala programs
have the tendency to generate slightly (5-10%) less garbage. This is
especially interesting on Dalvik.

