I write a lot of hand-optimized Java code, and have found that while Java as a language can be pretty fast, and its HotSpot team is to be commended, Sun's libraries are often atrocious. For example, until Java 1.5 (I think, maybe 1.4) ArrayList's get(), set(), and add() methods were not inline-able, and a one-line fix to the problem languished for years on Sun's bug forums. Sun's decision to go with type erasure was also a huge mistake in my opinion: as a result we have hilariously inefficient boxing and unboxing operations in poorly written Java code. Sun's push towards iterators hasn't helped things either.
If you eschew all this and write strongly optimized code, I think the single biggest spot where Java is unquestionably slower than C/C++ is in array accesses. In Java, to set a slot in a two-dimensional array Java must first test to see if the array is non-null, then test that the X bounds are correct, then test to see if the appropriate Y subarray is non-null, then test to see if the Y bounds are correct, then finally set the value.
In C the compiler does a multiply and an add and sets the slot.
It's worth noting that proposals, like JSR 83 (http://jcp.org/en/jsr/detail?id=83), to address Java's multi-dimensional array shortcomings have been around for quite some time. Of course, JSR 83 just sat around for five years before being abandoned; maybe Java folks don't think efficient multi-dimensional arrays are useful and/or common.
I didn't read through the link fully but it sounds to me like after changing the way you program you can get C++ to deliver the kind of performance a JVM gives you without doing anything special? Or are you saying with data-oriented programming C++ becomes significantly faster than Java?
To the extent that most people aren't going to change their programming methodology just to avoid Java and C++ still has its niche usages - I don't think the article is naive in any way.
Data-oriented programming produces programs that are a magnitude faster than object-oriented programs. It's the same difference as stateful vs stateless code. And it's the same difference as RESTful APIs vs RPC.
If you care about performance and scalability then you write stateless code and use RESTful interfaces. You also choose to write data-oriented code rather than object-oriented code.
Data-oriented code is not possible in Java because you can't create complex value types and you can't control when and where memory gets allocated and deallocated.
There are obviously techniques in data-oriented code that aren't possible in Java, but a lot of the key insight is applicable in just about every language. Structures-of-arrays, defining the data in objects based on usage patterns instead of responsibilities and "model-the-world" categorisation...
Java doesn't have a `sizeof` operator, and objects probably don't tend to store their object member variables by value, and it's not always obvious which function calls cost how much... Problems, to be sure, but if you really want data-oriented code you can usually contort yourself far enough to get it.
That's a fascinating article, and you may be right that little more need be said, but the author wanders WAY off the subject when he gets to the benefits of Java. Claiming that you can use the time saved by having garbage collection to make your program more efficient in other ways does not make a convincing argument when you're discussing fine points of performance.
In case that wasn't clear, he claims not that garbage collection is more efficient, but the saved programmer time can be used in some hand-waving way to improve efficiency. Likewise he wanders off into wooly territory with the claim that very large Java programs can be written more quickly. These would be perfectly fine if he were posting the reasons he thinks Java is better, but in a technical discussion of all-out performance it's just offtopic.
Well, note how the article said "naive" memory allocation and even then it was specific to a microbenchmark. I've written very high performing memory allocation routines for multi-threaded scenarios and if you know the types of allocations you will be making you can leave any GC or standard memory allocation routines in the dust!
In C or C++ you can even allocate objects on the stack, which has only the cost of increasing the stack pointer (meaning it is very cheap), and also you have precise control over the object's lifetime.
Note that by using a technique known as escape analysis, a virtual machine like the JVM can detect that the lifetime of an object is such that it doesn't leave the context of a particular method and then stack-allocate the object implicitly. That may seem like a limited optimization, but when you combine it with method inlining it gets a lot more useful. I believe the optimization is turned off by default in the Sun/Oracle JVM but can be enabled via the -XX:+DoEscapeAnalysis option. For programs that allocate large numbers of short-lived objects, it can make a significant performance difference.
Often that allocation should be free - when you do a "proper" function call you'll be increasing that pointer already, and the total size of stack-bound objects in many functions can be determined at compile time.
Exactly. What a lot of dynamic VM languages provide are very clean semantics and the ability to prototype very quickly. Competently tuned GC with a modern VM will get you beyond the ability of a bad or mediocre C coder, calendar-time faster, though at the price of more machine resource use.
I'm waiting for technology that makes this trade-off obsolete. One should be able to transition from fast prototyping to solid and optimal production code -- incrementally and without great pain. We as an industry are just about ready for this.
You can easily get generational GC to be an order of magnitude slower than it should be by changing just one or two settings.
There is some way to get the GC to NOT ever run?
Yes, this actually came up at Smalltalk Solutions many years ago. One presenter was using Squeak as an advanced debugger at a company producing a FPS game. If all your functionality happens between frames, you can rig it so you never GC. You just throw away all of your memory outside of "perm" space every time.
With VisualWorks Smalltalk, you can change the settings so that the bulk of your GC work happens using incremental GC. It's not uncommon to get to the point where GC never takes up more than a few milliseconds. That's plenty good for most people. Admittedly that's not so good if your "light" request load is well over 1000 transactions per server per second.
And even if you contrive your app to use memory very very carefully, with a shared runtime with 100 other apps (e.g. on a server) not everybody is as nice and GC still runs/stalls the system.
When you need enough virtual hosts to be running 100 server processes -- that's likely when you need to be transitioning out of "rapid prototyping" mode and onto processes with just a little more rigor.
What I'm advocating is a language where both the rapid prototyping and running optimized mature code efficiently is possible. Not only possible, but easy to transition between.
The default malloc and free, depending on platform, can be ridiculously slow operations (on the range of 50-100µs). Those numbers are old, they are from some testing I did nearly a decade ago. I'd hope it would have improved right now.
But at that time, I wrote a memory manager which had malloc and free calls which were a magnitude faster, at least ten times.
When you need it to be fast, it can be fast. If you don't want it to be fast, or you have to resort to tricks like preventing the compiler from inlining (wtf?), then you really are doing it wrong.
It is clearly visible where Java Hotspot shines, but in most cases C++ wins by a factor of two. It also mentions a technique how you can do profiling analysis with GCC that makes similar optimizations than Hotspot does.
>> "Value Types, such as a 'Complex' type require a full object in Java. This has both code speed and memory overheads."
I have programmed in C# and C++ (never Java), but I found the above to be the key issue why my programs would run significantly slower with C#. Here's the C# language discussion thread where I posted details about my issue:
Is the test in the article a valid test? In the Java case the wouldn't the program quit before the garbage collector gets a chance to run? Wouldn't that would be kind of like running the C program without free()?
Incremental GC is yet another JVM trade-off. It trashes the CPU cache all the time and with 100ns slowdown for each cache miss the gains I've seen on real life code are not even remotely close to what's advertised.
And Java objects in real life tend to be big and not very local, from what I've seen.
Actually, OCaml never generates objects like this on the stack - value creation is always taken literally. The OCaml compiler is really allocating all those records. It's just that their lifetime is almost zero, so the cost to clean them up is zero (as none get promoted from the minor heap to the major heap). And the cost to create them is almost zero, just a += and a compare.
This is not true. The runtime environment for Java is capable of changing the program on the fly in response to dynamically determined bottlenecks. To implement such a system in C would be to re-implement Java.
What you cannot do in C/C++ that you can in Java is respond to program inefficiencies at runtime, with runtime knowledge.
That's true, but most programs don't encounter situations where a whole new class of unforeseen optimizations are needed right now, /and/ it's possible to make a significant improvement without changing any algorithms.
For everything else, there is Profile Guided Optimization.
Are you talking about profile guided optimization or a new Intel compiler that produces and application that is capable of dynamically rebuilding its own binary code on the fly? Presumably in the event that it can detect a better way to execute a chunk of code.
There are JIT libraries for C++ for building applications which dynamically compile stuff, I'm not aware of any that dynamically recompile the C++ though.
Seriously though this is misguided. PGO is clearly a step in the right direction but its simply deferred static optimization. Either you optimize in C at the compile time, or with PGO by running a few statistically representative executions, but these are not dynamic.
The only way to make truly dynamic optimizations is by having a runtime environment and code that is interpreted.
The argument that its not often necessary is a different argument and with only a few moments thought can be seen to be untrue. Take for example and strstr like operation that you naively implement using by walking through the source string. Now let's say you do this a lot, all is good because the input is short, now you receive (in a parser for example) something much much bigger. A JIT has privy information and can make determinations that building a index on the source string to make repeated finds faster is better, C does not, even with PGO because you may never have run a test that encounters this situation.
I saw benchmarks posted comparing a Java supercompiler vs C, and Java + supercompiler actually beat raw C for quite a few test cases. I can't seem to find the study I saw, but if you search for "supercompilers" there's lots of information available.