

Why git is so fast (or why Java is not as fast as C) - ssp
http://marc.info/?l=git&m=124111702609723&w=2

======
DarkShikari
This has always been true and will continue to be true.

It's possible to build a high level language that's as fast as C? Sure--but
only if you restrict the programmer to the same amount of _effort_ in both
languages. If your application is one where it's worthwhile applying a great
deal of extra programmer time in order to improve performance, a low level
language will _always win_ because it exposes more of the native machine.

The goal of a high-performance high-level language should be to provide C-like
performance _for a reasonably unoptimized application_. Once one starts to
optimize a program down to the last instruction, staying on par with a low
level language becomes simply impossible for a high level language.

In my experience, every level of abstraction one creates away from the machine
limits your performance by some amount. This can be demonstrated without even
leaving assembly language!

Fastest possible: raw assembly code in a NASM-like assembler. With this, you
can write basically any code possible with no limitations, at the cost of
extremely high programmer time costs.

Shortcut: Use inline assembler instead of NASM to simplify calling convention
and other niceties.

Cost: There's now a whole bunch of stuff, like calling convention optimization
and computed jumps, which you can no longer do.

Shortcut: Use compiler intrinsics instead of raw assembly.

Cost: You can no longer tweak your algorithm to minimize register spills
because you aren't directly controlling spills anymore.

Shortcut: Use a set of macros (like my project does) for handling calling
convention, MMX/SSE abstraction, and other such simplifications.

Cost: You tend to overlook optimizations that apply to one possible output of
the abstraction and not others, resulting in either messes of ifdefs or
suboptimal code--the former of which is of course violating the abstraction.

Shortcut: Use a framework like liboil to write SIMD assembly instead of native
code.

Cost: By using generic SIMD operators, you lose access to specialized
architecture-specific operations, along with the aforementioned issue of
register spills.

Here we _haven't even gotten beyond assembler_ and we're already losing
performance. Now scale this up to C and beyond: abstraction inherently comes
at a performance cost. It isn't even merely a function of language:
abstractions within a language reduce performance as well.

~~~
Locke1689
Couldn't agree more. One thing to remember though: "Premature optimization is
the root of all evil." C Git is optimized and fast. Mercurial, with a
combination of Python and C, is almost as fast. Maybe not quite as fast, but
fast enough for me. A lot of the time those optimizations just don't pay off
in terms of programmer time.

~~~
DarkShikari
Slightly tangential, but I think "Premature optimization is the root of all
evil." is the most misquoted statement in the history of computer science.
Most importantly, people tend to misinterpret "premature" to mean "any time
before the software is done" or similar.

I can't even count the number of times where I've worked on or watched a
project where performance is important and the following series of events
occurs:

1\. Module is designed with that quote in mind: it must be simple to implement
and performance is unimportant. Note that this isn't a prototype: this is a
plan for the actual final product.

2\. Module is built and finished, then submitted for a review. The reviewers,
knowing the importance of performance for this module, point out a number of
ways that the module is extremely suboptimal.

3\. Because of the assumptions so heavily ingrained into the module,
implementing these performance improvements--which could have been foreseen as
necessary far earlier--requires a near-complete rewrite of a large part of the
module.

It's exactly as if we finished our software and our customer decided that his
requirements were actually totally different--except in this situation, it's
entirely our fault, because performance _was_ a requirement and we ignored it
because we didn't want to optimize "prematurely". If you want performance, you
have to design with performance in mind; if you don't, and then decide later
you want performance, the cost of your earlier decision is magnified a
hundredfold.

I have almost never seen that quote used correctly.

~~~
JulianMorrison
I always though that what that one quote _meant_ was: measure, then optimize.
It's no use tuning the hell out of a function that hardly features in your
profiler results.

~~~
silentbicycle
Agreed, though it would be clearer if the quote went, " _Uninformed_
optimization..." or "Doing optimization _without measuring first_...".

------
barrkel
I find it interesting to read that every single one of his criticisms of JGit
would not need to apply to a C# implementation. Unsigned types, unmanaged
memory allocation (Marshal.Alloc __), unsafe code with pointers, native
P/Invoke access, value types (e.g. fixed arrays for his SHA-1 point), and an
IDisposable convention for managing all the unmanaged trickery, C# has pretty
much all the needed features.

When optimizing C# for space use in particular (which ends up being a time
optimization for I/O heavy loads), I've leaned heavily on arrays of structs,
and even bit-packed arrays (e.g. an Int32 array storing 14-bit integers,
packed so that they don't align on index boundaries).

Programming at the level of C in C# loses most of the benefits, of course, but
at least you can hide the optimizations behind pretty APIs, and write the non-
critical parts in terms of an easier to work with lower layer.

~~~
j_baker
I actually believe they're accessible in Java as well through JNI and JNA. But
I'm hardly a Java expert.

~~~
barrkel
Not at the same level of abstraction, IMO. To take a representative example,
pointer arithmetic simply isn't really available to you from Java, unless you
pretend your integers are pointers and go through JNI functions for
indirection, etc., which is going to be slow enough to defeat the purpose.

------
thibaut_barrere
From the last line:

"But, JGit performs reasonably well; well enough that we use internally at
Google as a git server."

Rings the 'good enough' bell to me.

~~~
axod
Agree. If your bottleneck is your source control tool, you're doing something
wrong.

~~~
Deestan
_Anything_ that is part of the programming routine should be severely
optimized.

Every little bit not only adds up, it often _multiplies_ in big enough
projects.

In one of our big products, we have several build steps that touch upon the
source control systems. 10 second VCS delay X 6 build steps, 1 second lazily
written c++-header slowdown X 300 files, etc. and before you know it something
that _should_ take 10-15 minutes for a clean build now takes ~1 hour. True
story. :(

Luckily, enough of us got so annoyed about this that it got fixed after a few
weeks of dedicated effort. We now have a constant conscious effort to keep
build times down.

~~~
berntb
Thank you for reminding me about compilation time.

Now I won't be tempted to go back from scripting language country for a few
more years... :-)

Edit: I should add that I have very fond memories about my youth and C, too.

~~~
mrkurt
Most sane projects I've experienced do "builds" even with scripting languages.
It's every bit as much of an annoyance to have a bunch of needlessly slow
tests as it is to deal with slow compile times.

~~~
berntb
OK, a point; the pain with compiling languages is the compile cycle where you
edit-and-run just local tests, not for the whole system.

------
mhansen
Good article, but it seems to ignore the elephant in the room: the JVM's cold
start times.

For me, git's use case is being called from the command line, often
interactively, between code editing sessions. It needs to start fast and
finish fast, to not interrupt my workflow.

The JVM's cold start times are huge. On the order of a second. Noticeably
slow. So slow, that in the common case of committing a few files, the JVM
startup time would totally dwarf the time spent actually doing work.

~~~
jorgeortiz85
JGit isn't meant to be a replacement for git on the command-line. It's meant
to be used inside IDEs and web servers that 1) are already running on a JVM,
2) need to interact with git, and 3) don't want to pay JNI costs. JVM start-up
times are a non-issue for these use cases.

~~~
DarkShikari
Is the cost of JNI really that high? I'm curious--I haven't used it myself,
but a client of mine is using it in a very latency-sensitive Java applet that
uses JNI to call libavcodec to decode a video stream and then immediately
display it through Java's own native libraries, so I would think the cost
isn't _that_ high.

~~~
mey
The real cost of JNI is when you hotspot crash and take out the entire JVM
instance (say your servlet and all your Tomcat instance as well)

I don't like anything calling out to JNI unless it's the only sane way and
it's been heavily tested.

------
j_baker
In strictly language terms, C will always be faster than Java. However, I
would dispute that one could say on a general basis that C programs will
always be faster than Java programs.

In fact, I would argue that _all things being equal_ , the Java program is
more likely to be faster (and yes, I realize there are a metric ton of
potential caveats to what I just said). Why? Because Java allows you to focus
on the "big picture" optimizations that really make all the difference.

On the other hand, given an infinite amount of development time and
experienced developers, the C program will likely be much faster. In some
cases this is necessary. But for most cases, I personally would rather just
_ship_ something than try to squeeze every ounce of efficiency out of it.

------
jrockway
They use the word "high-level languages" and then use Java as the example.
This is going to lead to some very wrong conclusions; the problems listed in
the article are problems with Java, not problems with high-level languages.

I imagine a Git-in-Haskell would be very close in performance to the C git.
(Then why is Darcs so slow? Because it uses an icky imperative, mutable model,
whereas git uses a immutable functional model.)

------
fauigerzigerk
Of the three issues mentioned in the post Java's lack of value types is the
important one in my view. That's what causes Java programs to use hugely more
memory than C, C++, C# or Go programs. Using more memory translates into an
orders of magnitude drop in performance for memory intensive applications.

~~~
bad_user
Not only memory, but there's also a performance penalty related to
boxing/unboxing of primitives. That's why the author of that email described
how he hasn't used the standard data-structures from Java, like HashMap ...
preferring to write his own specialized implementation.

------
richardw
While I'd never claim Java is comparable to C, I'd like to note that the C
codebase has, by their admission, 4 years of work in it. I'd be surprised if
the JGit codebase isn't much faster in 3/4 years.

~~~
bad_user
The Git codebase has been heavily-optimized from the start. Its speed when
branching/merging was its main selling point, ever since Linus started talking
about it.

The only way JGit would ever be equal-to or faster than Git is by using better
algorithms. And somehow, I don't see this happening ;)

~~~
richardw
Nobody's claiming equal-to or faster - currently JGit is twice as slow so
there's a lot of room for improvement. Faster than now, not faster than C.

~~~
ajross
The linked article actually shows quite a few non-trivial optimizations
already being done in JGit (e.g. storing an SHA-1 in 5 ints to avoid the extra
heap block). I'd be surprised if they got much farther than they are,
honestly.

------
houseabsolute
One thing I have been wondering as a non-git user: is git really CPU-bound in
most operations? Maybe someone who is a regular git user can answer me this
question. If it's not, I wouldn't expect Java to matter too much to its
performance.

------
ErrantX
Im not wholly convinced that version control is an area where microsecond
speed advantages matter. Rather versatility, portability and compatibility :)

------
axod
>> "why Java is not as fast as C"

Not actually true in reality. Modern JVMs can make on the fly optimizations
based on the runtime profile, which would need to be done by hand in C.

As said elsewhere though, Java excels when used for long running tasks -
servers - backends etc where it can optimize for the long term. It doesn't
excel when you try and start up the jvm loads of times for quick individual
jobs.

I don't understand why people are optimizing source control :/ fast vs fast
meh I don't think it's an issue really.

~~~
jerf
You're bringing theory to a shootout based on facts. Hypothesizing that Java
might be faster with a strong enough headwind doesn't make the actual, factual
JGit run faster or C-git run slower.

I mention this because programmers seem prone to this, and letting theory
trump fact is a great way to make sure that you never learn anything.

~~~
axod
I was countering the headline, which wasn't correct. Java isn't slower than C
in general. It may be in specific cases, for specific programmers, etc but
that's not what the headline stated.

The title takes one specific application in a particular niche, used in a
certain way which a particular programmer can't get to run fast enough, and
concludes that Java isn't fast enough. Faulty logic.

The argument about JGit vs C-git for me isn't something I feel you can learn
anything from.

~~~
wvenable
> Java isn't slower than C in general.

Really? In general? That statements seams worse than the original headline. Is
Java faster than C for _anything_?

~~~
axod
It completely depends. Both are silly generalizations.

Yes Java can be faster than C in some situations.

Can C ever be faster than assembly? Well, in theory no. But in practice,
sometimes.

Depends if you're a master assembly programmer who hand optimizes everything.

~~~
rbanffy
"Yes Java can be faster than C in some situations."

Unless the person writing the C implementation is grossly incompetent, I would
like to see even one example for that.

~~~
axod
You can't see a situation where the JVM can optimize things, inline code, etc
at runtime and beat a general C implementation? :/

~~~
rbanffy
I would say that any C programmer worth his salt (unlike me) knows how to
profile the code under a meaningful workload and how to optimize it to be
faster.

I remember I used the shortest possible integers and the register keyword in
the late 80's far too many times to count.

The point of somewhat higher-level languages like Java (I refuse to say Java
is a high-level language - it would be one in the late 80s, but not today) is
not to make programs run faster, but to make it easier to make them run
correctly.

