Hacker News new | comments | show | ask | jobs | submit login
Why git is so fast (or why Java is not as fast as C) (marc.info)
139 points by ssp 2702 days ago | hide | past | web | 107 comments | favorite



I find it interesting to read that every single one of his criticisms of JGit would not need to apply to a C# implementation. Unsigned types, unmanaged memory allocation (Marshal.Alloc), unsafe code with pointers, native P/Invoke access, value types (e.g. fixed arrays for his SHA-1 point), and an IDisposable convention for managing all the unmanaged trickery, C# has pretty much all the needed features.

When optimizing C# for space use in particular (which ends up being a time optimization for I/O heavy loads), I've leaned heavily on arrays of structs, and even bit-packed arrays (e.g. an Int32 array storing 14-bit integers, packed so that they don't align on index boundaries).

Programming at the level of C in C# loses most of the benefits, of course, but at least you can hide the optimizations behind pretty APIs, and write the non-critical parts in terms of an easier to work with lower layer.


I actually believe they're accessible in Java as well through JNI and JNA. But I'm hardly a Java expert.


Not at the same level of abstraction, IMO. To take a representative example, pointer arithmetic simply isn't really available to you from Java, unless you pretend your integers are pointers and go through JNI functions for indirection, etc., which is going to be slow enough to defeat the purpose.


Good article, but it seems to ignore the elephant in the room: the JVM's cold start times.

For me, git's use case is being called from the command line, often interactively, between code editing sessions. It needs to start fast and finish fast, to not interrupt my workflow.

The JVM's cold start times are huge. On the order of a second. Noticeably slow. So slow, that in the common case of committing a few files, the JVM startup time would totally dwarf the time spent actually doing work.


The point of JGit was to provide a Git plugin for Eclipse. In this case the start-time is "outsourced" to Eclipse.

Seems like there are many other potential uses for this: other Java IDEs, web applications, integration into Java SCM tools - places where Git would be very useful and start-time can be neglected.


Run the java app as a server.

I did this with javac, and went from 5000ms (ant) to 200ms (averages). It gets faster each time it runs, enormous speedup after the first run; but still very significant for the next 10; and continuing at a slower rate from then on.

The client needs to be non-java, to avoid the startup cost there. I run it as a console, so that is the "client" (wrapped in rlwrap, for filename completion/editing etc). My workflow is to compile in a separate shell; that's why I did it this way.


JGit isn't meant to be a replacement for git on the command-line. It's meant to be used inside IDEs and web servers that 1) are already running on a JVM, 2) need to interact with git, and 3) don't want to pay JNI costs. JVM start-up times are a non-issue for these use cases.


Is the cost of JNI really that high? I'm curious--I haven't used it myself, but a client of mine is using it in a very latency-sensitive Java applet that uses JNI to call libavcodec to decode a video stream and then immediately display it through Java's own native libraries, so I would think the cost isn't that high.


The real cost of JNI is when you hotspot crash and take out the entire JVM instance (say your servlet and all your Tomcat instance as well)

I don't like anything calling out to JNI unless it's the only sane way and it's been heavily tested.


Check out nailgun if you're interested in using the JVM as a command-line tool without a cold-start.

http://martiansoftware.com/nailgun/index.html


I was looking at this the other day and am concerned that it hasn't been updated since February, 2005. That's practically ancient in terms of software.


This looks unofficial, but Github has a version that has updates over the past few years:

http://github.com/ato/nailgun/tree/master/nailgun/


and yet it functions perfectly with no problems when I use it in the VimClojure plugin. Tex hasn't been "updated" in ages either yet its fully functional. Sometimes software is just done and no further activity is needed. Judge it on whether it works not the date of the last commit.


Well as the github repo posted above you attests, there are (unsurprisingly) some non-trivial bugs in the last official release.

Unmaintained code is just not something I would ever use in a production capacity/rely on for my daily work. Especially considering the changes which have come to java and the jvm since 2005.


Disclaimer: I'm very much not a Java/C# guy, so this might be totally barking up the wrong tree. :-)

Would it be a speed win to keep some JVM processes "half-started" if load and memory use allowed? The next started Java app will use one of the already half started JVMs.

You wouldn't need to do anything fancy to keep the JVMs around; if a half-started JVM wasn't used for so many seconds that it might be swapped out, you half-start another JVM and throw away the timed out one.

(Hmm.. I don't remember, but another process could take over stdin/-out/-err? Or you could just limit this to non-console applications.)

How much of the work time is in initializing the JVM and how much is specific for initializing the application? Could the JVM data structures (JIT etc) for an app be cached between runs?

Edit: Made a little clearer. Moved disclaimer to the top, so people know they can stop reading if I'm in the wrong forest..

Edit 2: I probably have read and unconsciously copied a similar idea that some Perl guy implemented?


Apache in pre-forking mode might have inspired you. High availability Java might be relevant.


This has always been true and will continue to be true.

It's possible to build a high level language that's as fast as C? Sure--but only if you restrict the programmer to the same amount of effort in both languages. If your application is one where it's worthwhile applying a great deal of extra programmer time in order to improve performance, a low level language will always win because it exposes more of the native machine.

The goal of a high-performance high-level language should be to provide C-like performance for a reasonably unoptimized application. Once one starts to optimize a program down to the last instruction, staying on par with a low level language becomes simply impossible for a high level language.

In my experience, every level of abstraction one creates away from the machine limits your performance by some amount. This can be demonstrated without even leaving assembly language!

Fastest possible: raw assembly code in a NASM-like assembler. With this, you can write basically any code possible with no limitations, at the cost of extremely high programmer time costs.

Shortcut: Use inline assembler instead of NASM to simplify calling convention and other niceties.

Cost: There's now a whole bunch of stuff, like calling convention optimization and computed jumps, which you can no longer do.

Shortcut: Use compiler intrinsics instead of raw assembly.

Cost: You can no longer tweak your algorithm to minimize register spills because you aren't directly controlling spills anymore.

Shortcut: Use a set of macros (like my project does) for handling calling convention, MMX/SSE abstraction, and other such simplifications.

Cost: You tend to overlook optimizations that apply to one possible output of the abstraction and not others, resulting in either messes of ifdefs or suboptimal code--the former of which is of course violating the abstraction.

Shortcut: Use a framework like liboil to write SIMD assembly instead of native code.

Cost: By using generic SIMD operators, you lose access to specialized architecture-specific operations, along with the aforementioned issue of register spills.

Here we haven't even gotten beyond assembler and we're already losing performance. Now scale this up to C and beyond: abstraction inherently comes at a performance cost. It isn't even merely a function of language: abstractions within a language reduce performance as well.


Of course, managed languages can buy you other things, such as inlining indirect method calls by rewriting the call site (polymorphic inline caching), dynamic profile-guided optimization, not to mention the fact that garbage collection is often faster than manual deallocation, depending on the nature of the problem and the pattern of allocations. (Of course, design of managed applications can benefit greatly from awareness of how the GC works).

WRT SIMD operators, I'll point you to Mono.Simd.


I found that I had to read your comment carefully to realize that I agreed with you, because I'd already flipped the bozo bit 3 words in when you unnecessarily used parochial microsoft nomenclature.


What word would you recommend I use to cover everything from Ruby to Java?

FWIW, I think you'll make the same mistake pretty often if you continue to think of folks who play in the MS pond as bozos. There are reasons MS is a leader in many markets.


Anything but 'managed' -- while conceptually it covers things quite nicely, in practice it's used almost exclusively to mean "runs on the CLR", by those living solely within the Microsoft ecosystem whose knowledge of programming languages comes from Microsoft marketing materials. It's a loaded term.

There's plenty of nomenclature abuse in that ecosystem, like using 'assemblies' to refer to packages or 'blittable' for SHM.

'Garbage Collected' (or GCed) works fine instead of 'managed', especially since 'runs on a VM' is pretty nebulous, and the whole C++/CLI thing adds some confusion.


I don't use it to exclusively mean CLR-hosted, and to be frank, I think the meaning is pretty clear outside of MS contexts. I believe "GCed" focuses too much on the memory allocation strategy; which while important for many of the practical advantages of managed languages, I think it ends up lopsided.

Finally, more people using a word in an alternative way is the path to controlling its meaning, rather than avoiding the word altogether, cf "queer" and the like.


What meaning does "managed" have to you? I wasn't aware of a generic meaning. A virtual-machine-based language that happens to include garbage collection?


Generally it means type-safe and memory-safe, at least by default; is not usually precompiled (in practice precompilation is avoided for linking flexibility); and has a runtime which is at least nominally independent of the programs that run on it, e.g. that would need separate installation if not packaged as part of a larger environment.

The combination of type and memory safety usually requires a garbage collector, though it's not strictly necessary.


> Anything but 'managed' -- while conceptually it covers things quite nicely, in practice it's used almost exclusively to mean "runs on the CLR"

Evidence? I know at least one Mozilla dev who calls JavaScript "managed". I call such languages "managed" too, and I write much more non-CLR code than CLR code.


Couldn't agree more. One thing to remember though: "Premature optimization is the root of all evil." C Git is optimized and fast. Mercurial, with a combination of Python and C, is almost as fast. Maybe not quite as fast, but fast enough for me. A lot of the time those optimizations just don't pay off in terms of programmer time.


Slightly tangential, but I think "Premature optimization is the root of all evil." is the most misquoted statement in the history of computer science. Most importantly, people tend to misinterpret "premature" to mean "any time before the software is done" or similar.

I can't even count the number of times where I've worked on or watched a project where performance is important and the following series of events occurs:

1. Module is designed with that quote in mind: it must be simple to implement and performance is unimportant. Note that this isn't a prototype: this is a plan for the actual final product.

2. Module is built and finished, then submitted for a review. The reviewers, knowing the importance of performance for this module, point out a number of ways that the module is extremely suboptimal.

3. Because of the assumptions so heavily ingrained into the module, implementing these performance improvements--which could have been foreseen as necessary far earlier--requires a near-complete rewrite of a large part of the module.

It's exactly as if we finished our software and our customer decided that his requirements were actually totally different--except in this situation, it's entirely our fault, because performance was a requirement and we ignored it because we didn't want to optimize "prematurely". If you want performance, you have to design with performance in mind; if you don't, and then decide later you want performance, the cost of your earlier decision is magnified a hundredfold.

I have almost never seen that quote used correctly.


Well, the fact that it says that something "premature" is causing a problem seems to me to mean that it's actually a tautology. It's like saying kids are immature; but of course, what is more of the nature of an immature person than a child?

So yes, optimizations that are made before they should have been made, should not be made. It tells us little about optimizations other than that there is a time for which they are premature, and doing them at that time is premature.


I think the important bit is "the root of all evil": it also tells us that doing optimisations prematurely has the potential to destroy your project. Compare with something like, "premature optimisation will take a little longer to perform, and perhaps be not quite as optimal, but isn't that big a deal in the long run, really."


I always though that what that one quote meant was: measure, then optimize. It's no use tuning the hell out of a function that hardly features in your profiler results.


Agreed, though it would be clearer if the quote went, "Uninformed optimization..." or "Doing optimization without measuring first...".


That, and also optimizing code that later gets thrown out entirely because requirements or other factors change. There's no benefit to optimizing code that goes in the bit bucket.


Yes, or even just figure out HOW (you're going) to measure it.


You have a good point, but I don't think I asserted that any of those points are correct. I mostly agree with JulianMorrison on this one, although I think sometimes you can see performance bottlenecks coming. If you know an inner loop needs to be faster than your interpreted language can normally do, you might as well write that segment in C or ASM from the beginning. Mercurial still seems to me like a good example of optimization done correctly.


Wow, did you ever hit the nail on the head. I've been saying this for a long time (edit: e.g. http://news.ycombinator.com/item?id=96186, http://news.ycombinator.com/item?id=173949), wondering why I'd never run across anybody making what seems an obvious point. At last I have!

The important decisions around performance have to do with design, not optimization. Knuth's original principle was quite explicit: forget about most small efficiencies. But people commonly use it to justify not thinking about large ones. "Don't micro-optimize prematurely" somehow got changed into "don't think about performance until the end of the project".


Maybe someone should come up with a clearer statement, "Don't optimize beyond need", for example. At least it's suggesting that there can be a need for optimization, and that too much optimization can be a problem.


There is an intermediate state where you make it "just work" but design it so that it's easy to speed up the algorithms later. Premature optimization makes you waste time on dumb stuff, but leaving yourself a way out is a lot less costly than optimizing everything and lets you avoid the problem you mentioned.


That is why the rewording I use is "First make it right, then make it fast". If you make it fast first, then that is premature optomization.


But that's just saying the same thing. The point being made above is that working "correctly" simply isn't enough if performance is a design goal too (or rather, correct operation requires that the software meet the performance goals at design time). So if you start out working on only half the problem, you'll still fail.


My experience has shown me that it is very difficult to know where the performance bottlenecks are without measurement. You can't measure it if you haven't built it, and if it is not correct, why bother?

I have always taken this to mean what the original "premature optimization" quote is all about. I felt that making the statement a little simpler helps my focus.


...Maybe not quite as fast, but fast enough for me...

Given how fast computers are these days, the powerful standard libraries high level languages give you and the fact that bugs per line is constant regardless of language - high level languages start to look pretty good. Programs don't need to be fast as possible, they just need to be fast enough.


"Computers these days" have always been fast; if that were an argument for anything, performance would not matter at all anymore.

If computing power is fungible, then a 10% performance improvement translates into a 10% saving on the hardware budget. This matters for data centers, and it matters for embedded systems where using a slightly cheaper component in a million devices might mean saving a million dollars.

I think the fallacy of the fast, modern computer comes from the fact that PCs are not very fungible at all; they come in discrete performance steps. And since your application is pretty unlikely to be the most demanding one, it will probably run just fine even without much optimization.


> the fact that bugs per line is constant regardless of language

"fact"?

More like an observation that must be re-tested and re-established with each change in tooling.

Is Java bugs per line the same for IntelliJ IDEA and Emacs?

(And of course let's remember - bugs per line when? During development? In product?)


That is what the study said. It holds constant across languages and tools.

Now, if I could find a link for you, that'd be awesome. My google-fu is weak today.


Another thing to keep in mind before optimizing is the 80/20 rule. Usually only a small part of the program (20%) needs optimizing.

Of course one needs actual performance numbers (via profiling for example) before knowing what to optimize. There is nothing worse than faith-based performance assumptions. I have seen months being wasted re-writing code in a lower level language to "make things faster" only to find out that it didn't make a difference, the code was disk or network bound and not CPU bound.


This completely misses the point. The reason JGit cannot compete with Git is not that it is using an extra cycle or two to make function calls. The problem is that Java cripples you in the sort of data-representations you can use -- no mmap, no value types -- forcing lots of extra copying. There's no reason a language can't provide these features, and I don't know that they require significant extra programmer effort. C# does it, etc.


From the last line:

"But, JGit performs reasonably well; well enough that we use internally at Google as a git server."

Rings the 'good enough' bell to me.


I'm actually a bit curious as to why they're using JGit as a git server. Is there something lacking in git-daemon, etc.?


Most likely they can drop it into their existing systems for deploying Java apps.


As well as use it as a library API, and integrate it with their other libraries/services/etc.


Indeed. Git may be fast, but if you ever want to wrap a reliable program around it, good luck.


They use it in Gerrit — http://code.google.com/p/gerrit/ — which is an open-source code review system built to mimic Google Mondrian, but which uses Git rather than Perforce. Shawn Pearce, the author of the linked email, is also the primary maintainer of Gerrit.

-sq


Agree. If your bottleneck is your source control tool, you're doing something wrong.


Speed changes your workflow. With slow or fussy source control, you're more likely to do "branches" by just not checking in your changes for awhile. With near-instant source control, you don't feel burdened by making a feature branch, switching, editing and committing a few times, and merging back.


True, I don't really like using branching/merging as a workflow unless absolutely necessary. So sure, if you're constantly branching and merging all day, then I can see it being a problem. But I'm not convinced branching+merging is really good for your productivity.

The other point is that with a distributed system like git, all of the load is on your local machine. With a centralized system like subversion, the server could do the heavy lifting (Wether this happens in practice is another matter). My point is, if you want lightening fast rev control for some reason, probably better to not use distributed.


The flip side is that, with most VC systems, queries about changes or repository history have network latency added. The feedback is near-instant from distributed systems, which keep their info locally. I find it makes a tremendous difference when, say, bisecting an error.


But I'm not convinced branching+merging is really good for your productivity.

Because branching and merging are slow? I can see this argument getting stuck in something of a rut.


No. I prefer to keep things simple, always make sure the source builds, and just develop linearly where possible. I think it makes for cleaner workflow.


The advantage of having lots of small branches is that you isolate work. You never have the problem of wanting to fix a bug or add some small feature, but having the code all in a mess halfway through a major reorganization.

Also it's a lot easier to try something experimental on a branch. If it works, merge it in. If it fails, abandon the branch.

Git makes branching and merging fast enough that using them as the default workflow makes sense.


Yeah I just don't work that way. I work linearly :/


What type of work do you do that you never have urgent bug fixes and/or never have features that take more than a day to complete?

Sure it's optimal to work linearly, but in the real world things come up. Now if you are looking for a reason to tell stakeholders to suck it up and wait, then poor branching support may help you, but that's a technical solution to a managerial problem. If you're interested in having the most flexibility to solve real world problems, then git's power is indispensable. Hell, you can work linearly all the time, and retroactively convert anything into a branch. Even if you don't need branches. Even if you are the sole developer.

Sorry if that's a little harsh, but statements like "My point is, if you want lightening fast rev control for some reason, probably better to not use distributed." smack of willful ignorance. Go use git for a while and then come back and try to say that with a straight face.


I have many features that take more than a day to complete, but I prefer to do them in such a way that it doesn't break other stuff. That's just the way I prefer to work. If I'm working through a big arch change, I'll do it in sections, each time making sure everything builds and works and is sane. I checkin often, once I hit something that builds+works.

That's just the way I work.

Similarly, I don't use threads in programs unless I absolutely categorically have to. It's a similar sort of thing. Call me crazy :/

I did use git for a bit, and thought 'meh'. It didn't solve any real world problem I had.


What type of work do you do that you never have urgent bug fixes and/or never have features that take more than a day to complete?

I've always worked to constrain non-linear development to the absolute minimum required because the technical costs are easily dwarfed by the human communication overhead and inherent organizational complexity engendered in multiple disparate branches of development.


This still feels like the sort of statement that comes from having not worked with a (D)VCS that does branching well. There is no "inherent organizational complexity" in me having a private branch that I share with no one and which is only used for an hour or two while I work on a critical bug fix.

What you're talking about seems to be multiple, long-running development or release branches. Those are hard to manage and generally a bad idea in any VCS. But, with a system like Git, you don't tend to have long-lived divergent development branches. Typical branch lifetimes are more like hours or days, rather than weeks or months. There are exceptions, but by-and-large, branches are just used differently in git.


This still feels like the sort of statement that comes from having not worked with a (D)VCS that does branching well.

I've worked with both git and hg.

There is no "inherent organizational complexity" in me having a private branch that I share with no one and which is only used for an hour or two while I work on a critical bug fix.

What's the value of this private branch beyond simply committing to the actual branch?

If it's just a bug fix, how big can it be?

If it's more than a bug fix, why am I hiding this code from the team by implementing it on a private, local, non-backed up, non-code-reviewed, non-centralized branch?

Typical branch lifetimes are more like hours or days, rather than weeks or months. There are exceptions, but by-and-large, branches are just used differently in git.

Why wouldn't I just commit this work incrementally to the actual upstream branch, rather than hiding it for "hours or days" from the rest of the team?


Reading this comment and your other comments in this discussion, I think you have some issues which have nothing to do with what version control system is being used. While you keep talking about "hiding code" and "hindering communication" and "cowboy coding", I think about keeping the history of our codebase well-organized so that you can understand the evolution of a single feature, so that commits are logically ordered, so that unrelated changes don't get lumped together because it was the easy thing to do.

To quickly comment on a couple of your other concerns, my local hard-drive is backed up, so that's irrelevant; and I personally feel that code review at the level of individual commits has very little value, and you should instead be reviewing complete feature implementations.


Of course, but that's irrelevant to the utility of git branches. In subversion, yes, you don't go through the hassle of creating a branch for all these reasons. In git though, branches are most commonly used within one developer's workflow. The vast majority of branches are never seen by more than one developer, they are simply an organizational tool to be used at your discretion without imposing any overhead on anyone unless you have good reason to.


In subversion, yes, you don't go through the hassle of creating a branch for all these reasons.

svn cp ^/trunk ^/branches/tentonova-bugfix-x

svn co ^/branches/tentonova ~/branch

I wouldn't call this a technical "hassle", and I'm not sure what organizational issues would arise here.

The vast majority of branches are never seen by more than one developer, they are simply an organizational tool to be used at your discretion without imposing any overhead on anyone unless you have good reason to.

Hiding your development branches on a shared codebase often incurs either communication overhead, or the costs of lack of communication.


I wouldn't call this a technical "hassle", and I'm not sure what organizational issues would arise here.

Merging.

Hiding your development branches on a shared codebase often incurs either communication overhead, or the costs of lack of communication.

If you're using the limitations of your VCS to manage team communication then you have bigger problems. The obvious analog to the "problem" you mention is people not checking in code because it's not ready yet. Maybe you think this is better because you only want deployable software in your main branch, but for large features that makes the history opaque and leaves your developer effectively without any of the benefits of version control while they are working on the large feature.

Frankly, a lot of the arguments against DVCS smack of the same sort of ignorance that the Java zealots were leveraging against Ruby back when Rails started picking up steam in 2005/2006. There's this fear that powerful features will lead to chaos and are in effect too powerful to be used safely. And the reality is that yes, in environments truly incompetent programmers work, there's definitely a strong argument to be made for limiting the damage they can do. But I think the past few years have borne out the fact that mediocre and merely-competent programmers can make strong use of these tools without leading to disaster.


Merging.

Subversion 1.5, released in June of 2008, supports merge tracking.

If you're using the limitations of your VCS to manage team communication then you have bigger problems.

A simple but sufficiently powerful solution leads to simplified communication. If you're using the complexity of your VCS to hinder team communication and support cowboy coding, then you have bigger problems.

Frankly, a lot of the arguments against DVCS smack of the same sort of ignorance that the Java zealots were leveraging against Ruby back when Rails started picking up steam in 2005/2006 ... But I think the past few years have borne out the fact that mediocre and merely-competent programmers can make strong use of these tools without leading to disaster.

Nobody (intelligent) said there'd be disaster because of the "powerful features", just that operating in that manner would be more expensive than the much simpler alternatives.

Expending more effort with more powerful tools isn't actually an improvement, it's just busy-work -- constantly working on your muscle car instead of driving it.


If you're using the complexity of your VCS to hinder team communication and support cowboy coding, then you have bigger problems.

This is a blub attitude plain and simple.


We do that, and I will say that svnmerge.py doesn't work very well with older versions of svn. A common failure mode is when you merge from trunk to rebase a dev branch and find a conflict, your edits to resolve it are automatically ignored and not reflected back to trunk, and any overlapping edits you make later will eventually cause new conflicts when you finally go to trunk.

Apparently the new svn:mergeinfo property helps, but we haven't migrated at work so I don't know how well.


Anything that is part of the programming routine should be severely optimized.

Every little bit not only adds up, it often multiplies in big enough projects.

In one of our big products, we have several build steps that touch upon the source control systems. 10 second VCS delay X 6 build steps, 1 second lazily written c++-header slowdown X 300 files, etc. and before you know it something that should take 10-15 minutes for a clean build now takes ~1 hour. True story. :(

Luckily, enough of us got so annoyed about this that it got fixed after a few weeks of dedicated effort. We now have a constant conscious effort to keep build times down.


As Deestan notes, delays have direct multipliers. Beyond that, delays often have indirect multipliers that are even worse.

XKCD "Compiling" http://xkcd.com/303/ is funny because we have all done it. If there is a noticeable lag in a development step (VCS operations, compiling, running the program, etc.), it is very likely that the developer will lose concentration and switch to reading email, browsing HN, reading /., BSing with co-workers, etc.

This takes the direct delays and turns them into exponential multipliers. The result is that 30 seconds of lag has a high risk of becoming 30 minutes of wasted time.


Thank you for reminding me about compilation time.

Now I won't be tempted to go back from scripting language country for a few more years... :-)

Edit: I should add that I have very fond memories about my youth and C, too.


Most sane projects I've experienced do "builds" even with scripting languages. It's every bit as much of an annoyance to have a bunch of needlessly slow tests as it is to deal with slow compile times.


OK, a point; the pain with compiling languages is the compile cycle where you edit-and-run just local tests, not for the whole system.


C++ compiles much slower than many other popular compiled languages, and the more of its "fancy" features you use the slower it compiles.

C & Java for example both compile much faster than c++.


Discussing bottlenecks oversimplifies, it assumes everything goes down one pipeline. Small changes can make big differences. If your tool feels a little bit snappier, you might make two commits instead of one, then six months down the line someone saves half an hour tracking down a bug because the blame messages are more relevant.


In strictly language terms, C will always be faster than Java. However, I would dispute that one could say on a general basis that C programs will always be faster than Java programs.

In fact, I would argue that all things being equal, the Java program is more likely to be faster (and yes, I realize there are a metric ton of potential caveats to what I just said). Why? Because Java allows you to focus on the "big picture" optimizations that really make all the difference.

On the other hand, given an infinite amount of development time and experienced developers, the C program will likely be much faster. In some cases this is necessary. But for most cases, I personally would rather just ship something than try to squeeze every ounce of efficiency out of it.


Of the three issues mentioned in the post Java's lack of value types is the important one in my view. That's what causes Java programs to use hugely more memory than C, C++, C# or Go programs. Using more memory translates into an orders of magnitude drop in performance for memory intensive applications.


Not only memory, but there's also a performance penalty related to boxing/unboxing of primitives. That's why the author of that email described how he hasn't used the standard data-structures from Java, like HashMap ... preferring to write his own specialized implementation.


They use the word "high-level languages" and then use Java as the example. This is going to lead to some very wrong conclusions; the problems listed in the article are problems with Java, not problems with high-level languages.

I imagine a Git-in-Haskell would be very close in performance to the C git. (Then why is Darcs so slow? Because it uses an icky imperative, mutable model, whereas git uses a immutable functional model.)


While I'd never claim Java is comparable to C, I'd like to note that the C codebase has, by their admission, 4 years of work in it. I'd be surprised if the JGit codebase isn't much faster in 3/4 years.


The Git codebase has been heavily-optimized from the start. Its speed when branching/merging was its main selling point, ever since Linus started talking about it.

The only way JGit would ever be equal-to or faster than Git is by using better algorithms. And somehow, I don't see this happening ;)


Nobody's claiming equal-to or faster - currently JGit is twice as slow so there's a lot of room for improvement. Faster than now, not faster than C.


The linked article actually shows quite a few non-trivial optimizations already being done in JGit (e.g. storing an SHA-1 in 5 ints to avoid the extra heap block). I'd be surprised if they got much farther than they are, honestly.


One thing I have been wondering as a non-git user: is git really CPU-bound in most operations? Maybe someone who is a regular git user can answer me this question. If it's not, I wouldn't expect Java to matter too much to its performance.


Im not wholly convinced that version control is an area where microsecond speed advantages matter. Rather versatility, portability and compatibility :)


>> "why Java is not as fast as C"

Not actually true in reality. Modern JVMs can make on the fly optimizations based on the runtime profile, which would need to be done by hand in C.

As said elsewhere though, Java excels when used for long running tasks - servers - backends etc where it can optimize for the long term. It doesn't excel when you try and start up the jvm loads of times for quick individual jobs.

I don't understand why people are optimizing source control :/ fast vs fast meh I don't think it's an issue really.


You're bringing theory to a shootout based on facts. Hypothesizing that Java might be faster with a strong enough headwind doesn't make the actual, factual JGit run faster or C-git run slower.

I mention this because programmers seem prone to this, and letting theory trump fact is a great way to make sure that you never learn anything.


It's a pretty specific shootout, of course; the reasons brought up for JGit's relative slowness are not issues present in most applications, they're specific to certain realms of system. Unsigned number support is a major headache about Java when doing low-level work, in particular.

I think it's also important to note that the slowness is relative to an incredibly fast C implementation - taking twice the time sounds painful, but in reality the pauses caused by Git are barely perceptible as-is, and double of that is not likely to be very annoying.


I was countering the headline, which wasn't correct. Java isn't slower than C in general. It may be in specific cases, for specific programmers, etc but that's not what the headline stated.

The title takes one specific application in a particular niche, used in a certain way which a particular programmer can't get to run fast enough, and concludes that Java isn't fast enough. Faulty logic.

The argument about JGit vs C-git for me isn't something I feel you can learn anything from.


> Java isn't slower than C in general.

Really? In general? That statements seams worse than the original headline. Is Java faster than C for anything?


It completely depends. Both are silly generalizations.

Yes Java can be faster than C in some situations.

Can C ever be faster than assembly? Well, in theory no. But in practice, sometimes.

Depends if you're a master assembly programmer who hand optimizes everything.


"Yes Java can be faster than C in some situations."

Unless the person writing the C implementation is grossly incompetent, I would like to see even one example for that.


You can't see a situation where the JVM can optimize things, inline code, etc at runtime and beat a general C implementation? :/


I would say that any C programmer worth his salt (unlike me) knows how to profile the code under a meaningful workload and how to optimize it to be faster.

I remember I used the shortest possible integers and the register keyword in the late 80's far too many times to count.

The point of somewhat higher-level languages like Java (I refuse to say Java is a high-level language - it would be one in the late 80s, but not today) is not to make programs run faster, but to make it easier to make them run correctly.


A Java application will invariably use more much memory than C implementation of almost any non-trivial application. You can inline code all you want, but it'll never change that. The JVM isn't magical -- it's not going to solve all your performance problems. When you're allocating a million tiny little objects with all the necessary accounting you can't optimize that away.

BTW, if it's so obvious why haven't you still provided at least one non-trivial concrete example of Java faster than C?


The post specifically mentiones Java's implementation of memory mapped files, Java's lack of value types and Java's lack of unsigned types. These are very different areas but none of them can be solved by on the fly optimizations or longer running processes.


wrt value types, shouldn't it be partially solved by escape analisys? Cfr: http://weblogs.java.net/blog/forax/archive/2009/10/06/jdk7-d...

not that it would matter much in the case anyway, I guess.


Yes I think you're right that it should help a little in some cases. But the big issue with the lack of value types is memory usage (not in the case of jGit), which isn't solved by escape analysis.


'Modern JVMs can make on the fly optimizations based on the runtime profile'

That's something I've heard ever since hotspot came out, but it's not a silver bullet to beating C/C++ etc. Here are some server tests. Some are comparable, some are definitely not: http://shootout.alioth.debian.org/u64q/java.php


"Sometimes no-warmup is important (see comments above about pacemakers), but more often a short warmup period is irrelevant to the overall application. If I'm using an IDE, I expect a largish loading period... but then I'm using the IDE all day. I don't use an IDE for 0.1 sec. If warmup is NOT important to the application, then allow the JVM a warmup period before comparing performance numbers. Many of the benchmarks in the language shootout at http://shootout.alioth.debian.org fall into this camp: too short to measure something interesting. Nearly all complete in 1 minute or less." http://blogs.azulsystems.com/cliff/2009/09/java-vs-c-perform...


The replies I post never seem to appear on that blog so here they are:

http://www.reddit.com/r/programming/comments/9ic0x/java_vs_c...

And measurements:

http://shootout.alioth.debian.org/help.php#java


What you link to are tests using the server VM but without warmup. You need to select "Java 6 steady-state" to get the warmup.

[Edit] Actually the state-state one averages over multiples restarts including JITed and non-JITed runs.


See how little difference that "warmup" makes for these tiny programs - with the obvious exception of the binary-trees program -

http://shootout.alioth.debian.org/help.php#java




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: