At the same time, things that are expensive today may not be so in the near future. Haswell is supposed to make uncontended synchronization operations almost free. It'll make a lot of algorithms, particularly lock-free algorithms, much more practical than they are today. For example, C++ smart pointers are slow in multi-threaded situations because the reference count manipulation needs to be done under a lock. But the lock is almost never needed (except when it is). Cheap synchronization should make it much more practical to use reference counting more pervasively in C++ programs.
Several architectures support atomic increment and decrement operations, so there's no need for any lock.
Libaries like OpenSceneGraph use them for their reference counting.
Also Haskell isn't magical in this regard. Their lock free
implementations will also presumably just use safe atomic operations.
When I said it was a "locked memory transaction of some sort" I was including the effect of a serializing instruction which prevents concurrent load/store operations even if they hit the cache. I wasn't trying to imply a full round trip to DRAM.
So relying heavily on them in multithreaded cases with heavy contention can actually reduce performance over a more complex lockless scheme.
Go to http://stackoverflow.com/questions sort by votes and read from top if you like.
This particular post is a good example: I have seen this interesting SO tidbit before but learned a few things from the HN discussion.
Sometimes I wonder how old 'news' can be and still climb up to the top of HN. Is there somewhere a search engine which can tell the author if something was on HN already? Or if it is about good websites an not about news, why nobody posted a link to google.com during the last weeks? ;-)
What value is there to the people who are finding this content interesting today in rejecting it because some other subset of the HN community saw it at some arbitrarily distant time in the past?
Reposted interesting technical content beats the hell out of Boring industry naval-gazing news story #82795 and Android vs Apple "news" intended to provoke a fanbo flamewar thread #762483101.
If you're spending so much time on HN that you not only recognize reposts but that you see them so often it bothers you enough to complain in the comments, consider doing something more productive with your time.
This argument (that the VM new best) was put forward by Bill Joy early on in the development of Java as the reason Java would eventually be faster than compiled C code. We discovered that in practice 'friction'  between hot spot compilation and instruction set architectures nearly always dominated this equation (leaving Java code slower).
Architectures where the friction was reduced (like the Jazelle instruction set on ARM) could swing the balance the other way.
 Friction in this case is the cost to performance of the difference of expressibility between a byte coded instruction and the actual processor architecture.
A good example is iteration over non trivial, multi-dimensional structures: if you iterate over a single dimension, then some variables become actual constant, but this can only be known once you know the iterator parameters (i.e. even profile-guided optimization would not work).
You could obviously implement a runtime-based optimizer in C, but then the assertion that you can never beat C is a tautology.
Another possibility is that using a higher level language it is possible to write more complicated code, with smarter optimizations, because the programmer doesn't have to take care of all the details. For example, the Pypy project they has a Phyton interpreter written in a combination of RPhyton an C that is faster than the standard Phyton interpreter that is written directly in C.
You can also have a JIT compiler remove branches altogether based on experimental data, something completely impossible in C.
Don't point at one thing and claim it makes a language class faster or slower.
However, that's not to say that upfront compilation is doomed never to compete.
Profiling C compilers can mimic this behaviour to an extent, although they can only profile for the situtations they experience during their training phase so they can't adapt to unforeseen scenarios or dramatic shifts in the input stream.
C compilers could be extended to implement several/many variants of each function based on different possible code paths and, using information gained from on the fly timings it could switch between different variants in order to optimise execution. Combinatorics would place a low limit on the number of percentage of possible variants implemented though.
Finally, no-one ever said that C has to be a compiled language. There's nothing stopping someone writing a JIT C interpreter and getting the best of both worlds. (Ugh!)
C++/CLI probably comes close to "C with JIT"
There is nothing that stops you from JIT'ing C/C++.
LLVM includes a JIT that functions just fine on C/C++ programs.
Both the C compiler and the JVM JIT target machine code. I don't buy that one is automatically & universally superior to the other. You can equally argue that the JIT will always be superior because it's run-time aware.
But it's entirely another question whether you, as a mere mortal human, are able to do a better job at producing assembly code in practice, for non-trivial programs, than a compiler.
It's pretty much demonstrated by the update to the piece -- certain compliers & compiler flags did produce the right optimization, whereas the original author did not.
Consider a benchmark implementing a common situation in real programs--a performance intensive loop calling some code located in a plugin that isn't known until runtime. The JVM will wipe the floor with a C compiler in that situation--since the JVM can do inlining at runtime between module boundaries, while a C compiler cannot do inlining at runtime at all.
Or, when memory management is involved, they always want to test situations where everything can be easily pool-allocated, instead of complex situations involving lots of dynamic allocation. But in situations where code is really allocating lots of short-lived objects (e.g. functional code for optimizing compiler trees), the GC in a JVM is going to wipe the floor with any malloc() implementation.
See "branch misprediction" vs. "fetch from main memory" on norvig's rough timings list: http://norvig.com/21-days.html#answers .
Even an arena allocator?
Specifically, for arena allocation to work, you have to have points at which you know all objects in an arena will become garbage. There are many situations where you know that most objects will quickly become garbage (say 98%), but can only guarantee at a very large scope that all objects will become garbage.
Consider something like a compiler that optimizes a tree by replacing sets of nodes with simpler ones. Where do you put the pool delete operation? You know lots of nodes will become garbage after each optimization pass, but you know some will survive. But you can't statically segregate which ones. So you can put the pool delete at the very end after code generation, when you know you don't need the tree anymore. But then your memory use skyrockets--you don't reclaim any memory until the very end of the process.
GCC, for example, used to use a pool allocation functionality called obstack's. They ended up switching to a home-rolled GC because it became too hard to manage that system.
The general purpose malloc(3) has a very hard problem to solve that most allocations in most programs probably don't really need it to work so hard to solve.
Not exactly apples to apples.
Lower level languages like C can include ASM code with little overhead, they can also swap code around at run time just like the JIT. Granted, the vast majority of CRUD code has no need for such capability's, but as JVM languages have more constraints and absolutely no new capability's when it comes to the ASM they produce they fall behind when performance is most critical.
That said, the vast majority of programmers are never get any where near these limits.
I'm not sure why you'd want to get "all the juice" out of higher level languages, though. I thought the whole point was to abstract away from machine optimization? For example, to learn to write more efficient Perl I picked up "Mastering Algorithms with Perl" and it had loads of helpful tips.
In practice it's significantly more difficult to optimize all your code based on deep knowledge of architectures, compilers, and even compiler versions than it is to just let a JIT compiler optimize at runtime.
Looking at benchmarks, I can see this is factually incorrect.
Also... show me where C allows you to see the pipeline directly. It's hard to optimize something you can't see.
C is an abstraction layer. So is assembly. Both are relatively thick when you're talking about things like branch prediction and pipelining.
Total upvotes: 29.
Total discussion: 0.
No wonder people try to game the system and find optimal times to submit things.
No wonder people think HN is broken.
4167834: zmanji 105 days ago
4170972: VeXocide 104 days ago
4185226: moobirubi 101 days ago
4355548: ColinWright 63 days ago
Especially given your fight against duplicated stories:
"My intention in doing what I did was always to try to create value by reducing duplication (and thereby reducing wasted effort)"
My duplicate detector found the initial pair almost immediately. I did nothing, because there was no discussion, so I didn't know which one to point to. Then the third came. Still no discussion. I watched in considerable dismay as a submission that I thought was absolutely excellent, and deserving of significant discussion, sank without trace.
Was it just unlucky? Did no one read it? There were a few desultory upvotes - a few had read it. But was I so completely wrong about what HN would find interesting?
I couldn't believe it, so I submitted it again over a month later. Still no upvotes, still no discussion. It seemed I was wrong about the audience here.
And now I can see that I was right about HN being interested in deeply technical issues like this, and I'm happy. I'm less happy that a brilliant item like this took four submissions, sorry, five submissions before it got noticed.
That's why now I generally try to point out duplicates in two circumstances. One is when they come close together, and that's to help stop a split discussion. The other is when there was a significant discussion on an earlier occasion, and I want people to benefit from previous comments.
So no, I don't think it's a case of glass houses.
I don't think so.
Whether a good submission gets noticed and upvoted early depends quite a bit on what time of day it is submitted. I have formed the opinion that 0700 eastern might be the best time, as the east coast sees it, UK sees it, and if something gets a few upvotes within a few minutes, it is likely to hit the front page, at which point it much more likely to be evaluated on its merits.
Keep in mind that upvotes come from two sources. First is if someone is viewing an HN page, perhaps /newest and likes the article, clicks uparrow. The other way is if another user submits the same link, it gets an automatic update. So if a popular twit is noticed and submitted, the chance of a handful of submissions in a few minutes is higher.
And I have a suspicion that if one were to tally the number of times the uparrow was hit on the front page as opposed to the /newest page, the front page would be vastly higher.
This whole story generated a pretty great thread; why look for reasons to have a problem with that?
I don't have a problem with duplicates. I think a lot of things are worth reading/seeing/doing multiple times. Moreover I do no think or expect anyone to be perfect.
I do have a problem with Colin's crusade against duplicates. I don't think it sets the right tone for HN and it is a disincentive for people to submit things. I would rather have people feel free/comfortable submitting things and then let the up voting system separate the chaff. If you go through Colin's comments you can see hundreds of discussion ending comments about duplicate stories. Given his distaste for duplicates I found it disingenuous that he would bring up the problem of duplicates given his was the third duplicate in the list.
If its a recent exact duplicate the submission is merely another upvote and which is a neat function of the system (this happened to me recently with a story I submitted about correlation/causation that you had already submitted) If its a duplicate and boring it will be flagged or at the very least not up voted. If its a duplicate and people find it interesting it will be up voted. If a duplicate is up voted that I have read before I can choose to ignore it or read the discussion to see if there are new opinions. Duplicates are not a problem for me, reducing the likelihood of people contributing to HN is a problem.
You should re-read his comments about duplicates more carefully if you think he is on a crusade.
Here is an excerpt from his most recent comment:
Those who are interested in this subject and want to
discuss it, you may be interested in seeing the points -
good and bad - that have been made on previous occasions
that this has been submitted. Here are a few previous
"Failure" is a perfectly serviceable word and the cutesy "fail" is just annoying in a technical discussion like this.
I do not believe the people who plod through a huge multi-page, multi-code-sample question at StackOverflow are just casual dabblers who are likely to leap away from the discussion when they discover the top-rated answer features the word "failure."
I am sorry but you know what is REALLY annoying? Nit-picking at what must be the atomic level considering the depth and detail of this perfectly brilliant reply.
Technical discussions need a LOT more people like Alexander Yee aka "Mysticial" - and technical discussions sure as hell do not need ANY people like you complaining about the choice of words or slang as long as it is perfectly clear and understandable for the target audience and on that account he more than delivered.
Can't have it both ways. The more depth and detail a discussion tries to present, the more important it is to get the details right. The medium (StackExchange) even recognizes this and has editing, correcting, and collaborative fixing built into its DNA, for that very reason.
The more famous this response becomes, the more ESL readers there will be. Why trip them up this way? There's just no reason to use "fail" like this here.
"Branch prediction fail" is not actually a very clear or precise phrase. It could mean the failure of a single branch prediction attempt; it could mean the failure of the technique in a particular case, and it could refer to an author's opinion that it was a failure in general. Why muddy the waters?
You do not have a single reason to complain and the sooner you realize that, the better it will be for your own sake.
1) I am therefore worse than the Nazis
2) I'm the one who's being unreasonable
3) If something is popular it follows that it is also perfect and cannot be improved in any way.
Thanks for clearing that up.
It's definitely gaming the system, but it's not really suspicious (in the sense of "guy asks a question and immediately puts a response he had pre-written in a text editor" suspicious)
Sure, some people might be motivated only by receiving high scores, but it seems that the benefits outweigh the potential for abuse
Stop splitting hairs.
In order to offset increasing CAS latencies (largely due to the increasing number of channels) modern CPU's are greedy (in a good sense) and fetch more pages per memory controller read request than are necessary.
If your data is laid out continuously (in this case - sorted), chances of the pages you need next being primed via prefetch are greatly increased and further read requests to the memory controller may be reduced or avoided.
I benchmarked the effect of data prefetching on modern processors with large data structures (usually limited to L1d cache lines):
A plausible analogy for CPU prefetching would be "readahead" mechanisms implemented by modern filesystems and operating systems.
The main challenge for a computation of such a size, is that both software and hardware are pushed beyond their limits. For such a long computation and with so much hardware, failure is not just a probability. It is a given.
Assuming the median value in the array is 128 then for the sorted case the first half of the array will have one cache miss and the second half will have one cache miss.
If the OP optimized their code, sorted/filtered the array and then summed only those values greater than 128 it would be obvious why sorting is faster.
Caching of the data may also play a part on similar cases (but not this one - or better, the effect is negligible)
Here's a little test that can be tried. Make the test result (inside the if) be true or false alternately (for example, sum only if the index is even), see how long it takes.
Spoiler: modern branch predictors can detect and predict cyclic conditions
Since the array is sorted,
if (data[c] >= 128)
† (CMP is actually setting carry, overflow, sign, zero, &c; "truth" or "falsity" is decided by the specific conditional jump, here JL, which checks if sign != overflow).
"I was wrong, but not in the way that you(tptacek) understood me to be."
if (data[c] >= 128)
sum += data[c];
Branch mispredictions are the only likely cause for the runtime difference. It would be interesting to see whether using multiplexing (sum += data[c] > 128 ? data[c] : 0) or conditional assignments would make any difference. Some CPU's have opcodes for these that don't require branching, but I'm not sure whether they use the branch prediction logic.