Modern architectures really are quite complex, and it's important to keep your internal machine model up to date with what's happening. Examples: It used to be on a 486 even correctly-predicted taken branches had a two-cycle penalty. It used to be important whether you arranged your code so the most commonly taken code was on the fall through path or not. Today, correctly predicted branches are free--the processor stitches together the current path and the new path so there are no bubbles in the pipeline. Virtual function calls also used to be more expensive back in the day, because CPU's didn't try to predict the target address of the branch. Today, CPU's have a buffer that maps from a program counter to a target address, so if your virtual function call predictably hits the same target each time, the only overhead will be the extra memory accesses to indirect through the v-table.
At the same time, things that are expensive today may not be so in the near future. Haswell is supposed to make uncontended synchronization operations almost free. It'll make a lot of algorithms, particularly lock-free algorithms, much more practical than they are today. For example, C++ smart pointers are slow in multi-threaded situations because the reference count manipulation needs to be done under a lock. But the lock is almost never needed (except when it is). Cheap synchronization should make it much more practical to use reference counting more pervasively in C++ programs.
Atomic increment/decrement operations still generate a locked memory transaction of some form or the other at the processor level. Haswell is Intel's new CPU that is supposed to implement some level of hardware transactional memory, leveraging the infrastructure used for cache coherency. In theory, this will make transactional operations (atomic increment/decrement is just a tiny transaction) as cheap as a regular memory operation so long as there is no contention.
Note the classic 'bus lock' is a worst case nowadays: caches in SMP machines have their own protocols for invalidating and locking individual lines, so the entirety of the machine's memory isn't necessarily serialized, though I'm not sure when this kind of locking applies.
Sort of, but it's not really a "lock". If, as is very likely in performance-sensitive-but-uncondended multithreaded code, the CPU already has the cache line in "exclusive" (in the MESI sense) mode then the atomic operation needs to be no faster or slower than a standard read/write sequence. On x86 it is serializing, however, which can have performance impacts for some workloads.
An atomic operation on a cacheline owned exclusively by the local CPU avoids any bus lock operations, but still serializes the pipeline to avoid conflicts caused by memory operations on the same CPU. This basically destroys your memory parallelism if you use it in a situation where the atomic instruction happens often (e.g. every time an object reference is loaded or stored).
Isn't that exactly what I said? Note that "destroy memory parallelism" is often not a high penalty in typical workloads where all in-flight accesses are hitting L1 cache only. It's not nearly as high as the "full round trip to DRAM" latency implied by calling it a "lock".
I'm not really disagreeing with you, but I think we're using the terminology differently. To me, "memory" is the whole memory pipeline--everything from the load/store units, through the load/store buffers, the caches, to DRAM.
When I said it was a "locked memory transaction of some sort" I was including the effect of a serializing instruction which prevents concurrent load/store operations even if they hit the cache. I wasn't trying to imply a full round trip to DRAM.
This is totally missing the point of a major part of HN's (and other communities') value: The actual links may or may not be important, it's the community discussion that has the more value. In many cases, I find myself not even clicking on the links on HN (either the topic is too mundane or I've already seen it from another source) but go straight for comments.
This particular post is a good example: I have seen this interesting SO tidbit before but learned a few things from the HN discussion.
The results show that the 4 other links receive very low attention (3 links with 2 points, 1 with 21 points, all of them with no comments). Given the HN audience is spread across the globe with consistent interest in programming articles, I think there must be inference of posting time. Poster karma is not a big factor here because ColinWright has 33711 karma. Anyone wants to conduct such a research?
I also saw it a few weeks ago here on HN and was wondering why it is here again... At least I wasn't the only one who thought so ;-)
Sometimes I wonder how old 'news' can be and still climb up to the top of HN. Is there somewhere a search engine which can tell the author if something was on HN already? Or if it is about good websites an not about news, why nobody posted a link to google.com during the last weeks? ;-)
> Sometimes I wonder how old 'news' can be and still climb up to the top of HN. Is there somewhere a search engine which can tell the author if something was on HN already?
What value is there to the people who are finding this content interesting today in rejecting it because some other subset of the HN community saw it at some arbitrarily distant time in the past?
Reposted interesting technical content beats the hell out of Boring industry naval-gazing news story #82795 and Android vs Apple "news" intended to provoke a fanbo flamewar thread #762483101.
If you're spending so much time on HN that you not only recognize reposts but that you see them so often it bothers you enough to complain in the comments, consider doing something more productive with your time.
One take away point from this for non-c coders is that higher level languages will always be significantly slower than C because it's far easier to make these kinds of fiddly pipeline optimizations when you don't have another abstraction layer in between mucking things up. There's few java programmers for instance that both intuitevly understand how both the machine architecture, and the JVM compiler will affect the code they write. There's too many quirks. While in theory it's possible to write comparably fast java code(assuming your willing to do sketchy things for memory manipulation related tasks). In practice it's significantly more difficult to get all the juice out with higher level languages.
The counter argument is that if you've compiled into a semantic description (bytecodes for a virtual machine) then the underlying VM implementation can optimize a semantically correct pipeline based on features available to it. Thus programmer time is saved (they wrote their code once) at the expense of a higher burden on the VM to do the right thing.
This argument (that the VM new best) was put forward by Bill Joy early on in the development of Java as the reason Java would eventually be faster than compiled C code. We discovered that in practice 'friction'  between hot spot compilation and instruction set architectures nearly always dominated this equation (leaving Java code slower).
Architectures where the friction was reduced (like the Jazelle instruction set on ARM) could swing the balance the other way.
 Friction in this case is the cost to performance of the difference of expressibility between a byte coded instruction and the actual processor architecture.
What the JVM compiler does to your code isn't really that much harder to understand than what GCC or LLVM does to your code. Assuming equally skilled programmers, I think it's often easier to write fast code in a higher level language. The primary bottleneck for performance is usually developer time. If you can get the non-performance critical parts written in half the time, you have way more time to optimize. Higher level languages also generally have much better profiling and introspection tools, which make debugging much easier. Debugging tightly optimized code is, of course, one of the big time sucks in optimizing code.
This is not correct: there are things that can be faster than C because of the compilation aspect (the compiler can only optimize with the information it has, some of them can only be known at runtime).
A good example is iteration over non trivial, multi-dimensional structures: if you iterate over a single dimension, then some variables become actual constant, but this can only be known once you know the iterator parameters (i.e. even profile-guided optimization would not work).
You could obviously implement a runtime-based optimizer in C, but then the assertion that you can never beat C is a tautology.
On the other hand, sometime in higher level languages the compiler could prove more constrains and then apply more aggressive optimizations. (I don't remember a specific case now, perhaps we still need a few years for an interesting example.)
Another possibility is that using a higher level language it is possible to write more complicated code, with smarter optimizations, because the programmer doesn't have to take care of all the details. For example, the Pypy project they has a Phyton interpreter written in a combination of RPhyton an C that is faster than the standard Phyton interpreter that is written directly in C.
Fortran defaults to strict aliasing and C defaults to loose aliasing, leading to naive Fortran programs sometimes being optimized better than equivalent naive C programs. Or as some people say, "Fortran is faster than C".
Both the C compiler and the JVM JIT target machine code. I don't buy that one is automatically & universally superior to the other. You can equally argue that the JIT will always be superior because it's run-time aware.
What kind of benchmarks? The benchmarks C folks always want to use are the ones that favor static compilation (matrix multiplication, etc). The best any language can do in that situation is tie--it's basically just an exercise of how much money has gone into the compiler's low-level optimizer.
Consider a benchmark implementing a common situation in real programs--a performance intensive loop calling some code located in a plugin that isn't known until runtime. The JVM will wipe the floor with a C compiler in that situation--since the JVM can do inlining at runtime between module boundaries, while a C compiler cannot do inlining at runtime at all.
Or, when memory management is involved, they always want to test situations where everything can be easily pool-allocated, instead of complex situations involving lots of dynamic allocation. But in situations where code is really allocating lots of short-lived objects (e.g. functional code for optimizing compiler trees), the GC in a JVM is going to wipe the floor with any malloc() implementation.
But in situations where code is really allocating lots of short-lived objects (e.g. functional code for optimizing compiler trees), the GC in a JVM is going to wipe the floor with any malloc() implementation.
Arena allocation is feasible only for particular types of allocation patterns. If that weren't true, every malloc() implementation would just be based on an arena allocator.
Specifically, for arena allocation to work, you have to have points at which you know all objects in an arena will become garbage. There are many situations where you know that most objects will quickly become garbage (say 98%), but can only guarantee at a very large scope that all objects will become garbage.
Consider something like a compiler that optimizes a tree by replacing sets of nodes with simpler ones. Where do you put the pool delete operation? You know lots of nodes will become garbage after each optimization pass, but you know some will survive. But you can't statically segregate which ones. So you can put the pool delete at the very end after code generation, when you know you don't need the tree anymore. But then your memory use skyrockets--you don't reclaim any memory until the very end of the process.
GCC, for example, used to use a pool allocation functionality called obstack's. They ended up switching to a home-rolled GC because it became too hard to manage that system.
This gets into a can of worms, because while much of the performance benefit of an arena is the fact that you don't need to do per-object bookkeeping, the other benefit is that it's fast to allocate, track, and free data when you know all the sizes in advance, and that is a benefit you can get from a pool allocator, which isn't much more complicated than an arena.
The general purpose malloc(3) has a very hard problem to solve that most allocations in most programs probably don't really need it to work so hard to solve.
I think the best benchmark for this is to look at the industries where performance matters and see what they are using. Take high-frequency trading where lowest latency makes the most money. I don't see many using the JVM. AFAIK it's all assembly, C and C++. If they could use the JVM and have faster code, they would be doing so.
Lower level languages like C can include ASM code with little overhead, they can also swap code around at run time just like the JIT. Granted, the vast majority of CRUD code has no need for such capability's, but as JVM languages have more constraints and absolutely no new capability's when it comes to the ASM they produce they fall behind when performance is most critical.
That said, the vast majority of programmers are never get any where near these limits.
Indeed, a JIT compiler will always be able to take advantage of something that isn't possible for something that is compiled once and set in stone.
However, that's not to say that upfront compilation is doomed never to compete.
Profiling C compilers can mimic this behaviour to an extent, although they can only profile for the situtations they experience during their training phase so they can't adapt to unforeseen scenarios or dramatic shifts in the input stream.
C compilers could be extended to implement several/many variants of each function based on different possible code paths and, using information gained from on the fly timings it could switch between different variants in order to optimise execution. Combinatorics would place a low limit on the number of percentage of possible variants implemented though.
Finally, no-one ever said that C has to be a compiled language. There's nothing stopping someone writing a JIT C interpreter and getting the best of both worlds. (Ugh!)
True, I should have been clearer that I meant only C under the mainstream implementation style. If you get creative there are so many wonderful options. I know there's already TCC and a variety of REPLs.
Yes, there are some critical algorithms which will be always be faster in C.
In practice it's significantly more difficult to optimize all your code based on deep knowledge of architectures, compilers, and even compiler versions than it is to just let a JIT compiler optimize at runtime.
If you need to super-optimize execution you can build C or ASM code and call it via a library or inline it into your high-level language. Java can call functions in shared libraries (or use JNI) and Perl can inline straight C or ASM.
I'm not sure why you'd want to get "all the juice" out of higher level languages, though. I thought the whole point was to abstract away from machine optimization? For example, to learn to write more efficient Perl I picked up "Mastering Algorithms with Perl" and it had loads of helpful tips.
Several months old, but still worth a read - especially for the end of Mystical's comment, where he discusses the different optimizations performed by different compilers. Particularly interesting is the Intel Compiler, which effectively optimizes out the benchmark itself: something to keep in mind when testing your own code, if you want your results to make sense.
Mystical mentions that Intel's compiler uses loop interchange to gain an extraordinary speedup. There's actually a great answer further down the page that details this and was great as I wasn't aware of how it worked:
When I see colloquialisms used in technical discussions I generally assume that the author is attempting to convey the topic of conversation in a more casual and approachable way so as to maximize the amount of of people who read and think about what is being said. You may not be aware of this but pretentiousness does turn a lot of people off. This stops said people from inquiring further about things they may actually discover they care about. The downside of all this pretentiousness is that the world is as a whole less educated because the people who possess the knowledge and know-how refuse to accommodate their audiences with simple word changes such as this.
There's nothing pretentious about the word "failure."
I do not believe the people who plod through a huge multi-page, multi-code-sample question at StackOverflow are just casual dabblers who are likely to leap away from the discussion when they discover the top-rated answer features the word "failure."
I suspect that you also believe that you are not being pretentious and pedantic by raising this point. I emphatically disagree. The latter half of the criticism raised in my comment was directed towards your initial comment and comments or explanations like yours that are prevalent in advanced fields. I apologize for not being clear.
> "fail" is just annoying in a technical discussion
I am sorry but you know what is REALLY annoying? Nit-picking at what must be the atomic level considering the depth and detail of this perfectly brilliant reply.
Technical discussions need a LOT more people like Alexander Yee aka "Mysticial" - and technical discussions sure as hell do not need ANY people like you complaining about the choice of words or slang as long as it is perfectly clear and understandable for the target audience and on that account he more than delivered.
Nit-picking at what must be the atomic level considering the depth and detail of this perfectly brilliant reply.
Can't have it both ways. The more depth and detail a discussion tries to present, the more important it is to get the details right. The medium (StackExchange) even recognizes this and has editing, correcting, and collaborative fixing built into its DNA, for that very reason.
The more famous this response becomes, the more ESL readers there will be. Why trip them up this way? There's just no reason to use "fail" like this here.
"Branch prediction fail" is not actually a very clear or precise phrase. It could mean the failure of a single branch prediction attempt; it could mean the failure of the technique in a particular case, and it could refer to an author's opinion that it was a failure in general. Why muddy the waters?
You are being pedantic to an extreme I haven't even encountered amongst the worst "grammar nazis" and you are using nothing but FUD arguments. So his reply was "muddied" up by that? The amount of upvotes and comments and people linking to the reply very strongly beg to differ. It is a brilliant reply and I am not a native speaker, even I understood it perfectly well too so he must have done something right.
You do not have a single reason to complain and the sooner you realize that, the better it will be for your own sake.
I was going to reply privately, but could find no contact details. I see you espouse HN Notify, so I'm hoping you'll see this response.
My duplicate detector found the initial pair almost immediately. I did nothing, because there was no discussion, so I didn't know which one to point to. Then the third came. Still no discussion. I watched in considerable dismay as a submission that I thought was absolutely excellent, and deserving of significant discussion, sank without trace.
Was it just unlucky? Did no one read it? There were a few desultory upvotes - a few had read it. But was I so completely wrong about what HN would find interesting?
I couldn't believe it, so I submitted it again over a month later. Still no upvotes, still no discussion. It seemed I was wrong about the audience here.
And now I can see that I was right about HN being interested in deeply technical issues like this, and I'm happy. I'm less happy that a brilliant item like this took four submissions, sorry, five submissions before it got noticed.
That's why now I generally try to point out duplicates in two circumstances. One is when they come close together, and that's to help stop a split discussion. The other is when there was a significant discussion on an earlier occasion, and I want people to benefit from previous comments.
> But was I so completely wrong about what HN would find interesting?
I don't think so.
Whether a good submission gets noticed and upvoted early depends quite a bit on what time of day it is submitted. I have formed the opinion that 0700 eastern might be the best time, as the east coast sees it, UK sees it, and if something gets a few upvotes within a few minutes, it is likely to hit the front page, at which point it much more likely to be evaluated on its merits.
Keep in mind that upvotes come from two sources. First is if someone is viewing an HN page, perhaps /newest and likes the article, clicks uparrow. The other way is if another user submits the same link, it gets an automatic update. So if a popular twit is noticed and submitted, the chance of a handful of submissions in a few minutes is higher.
And I have a suspicion that if one were to tally the number of times the uparrow was hit on the front page as opposed to the /newest page, the front page would be vastly higher.
I realize that your comment was mainly rhetorical but...
I don't have a problem with duplicates. I think a lot of things are worth reading/seeing/doing multiple times. Moreover I do no think or expect anyone to be perfect.
I do have a problem with Colin's crusade against duplicates. I don't think it sets the right tone for HN and it is a disincentive for people to submit things. I would rather have people feel free/comfortable submitting things and then let the up voting system separate the chaff. If you go through Colin's comments you can see hundreds of discussion ending comments about duplicate stories. Given his distaste for duplicates I found it disingenuous that he would bring up the problem of duplicates given his was the third duplicate in the list.
If its a recent exact duplicate the submission is merely another upvote and which is a neat function of the system (this happened to me recently with a story I submitted about correlation/causation that you had already submitted) If its a duplicate and boring it will be flagged or at the very least not up voted. If its a duplicate and people find it interesting it will be up voted. If a duplicate is up voted that I have read before I can choose to ignore it or read the discussion to see if there are new opinions. Duplicates are not a problem for me, reducing the likelihood of people contributing to HN is a problem.
I do have a problem with Colin's crusade against duplicates.
You should re-read his comments about duplicates more carefully if you think he is on a crusade.
Here is an excerpt from his most recent comment:
Those who are interested in this subject and want to
discuss it, you may be interested in seeing the points -
good and bad - that have been made on previous occasions
that this has been submitted. Here are a few previous
That doesn't look like a crusade against duplicates to me. It looks helpful and I'm glad it's there.
Not really, a very common pattern on SO is to jump on a question to which you know you have a good answer, put something very short (but correct) and then edit in a better, longer response. This avoids "losing" initial votes from people looking at responses while you're fleshing out the description, adding benchmarks, etc...
It's definitely gaming the system, but it's not really suspicious (in the sense of "guy asks a question and immediately puts a response he had pre-written in a text editor" suspicious)
The "suspicious" case you mention is explicitly supported by the system; you can actually write the question and answer at the same time if you want. The normal content rules still apply, of course; if the question isn't about an actual problem, you're more likely to win a bundle of downvotes.
I think without this ability to "game" the system SO wouldn't be half as good as it is now. You end up with brilliant and above all complete answers that can be kept up to date, and the moderation policy reduces duplication and clutter, keeping the SNR high.
Sure, some people might be motivated only by receiving high scores, but it seems that the benefits outweigh the potential for abuse
Doesn't matter. You can answer your own question immediately after posting it if you want. It's designed to be a collections of questions and answers, not a forum, and so it doesn't matter who answers what question.
I often do this. There is one answer which I've been quite literally editing for about three years. Every few months I add a bit more information or a few more references or links to it. Its become a kind of personal (but public) repository of information on the topic.
If you're looking for a good online course on how computer systems work from AND gate scale - up to data center size, the iTunesU course <<"Computer Science 61C", UC Berkeley, w/ Daniel Garcia>> is really informative and fun to watch.
The answer focuses on branch misprediction but there is also another factor - CPU data prefetch.
In order to offset increasing CAS latencies (largely due to the increasing number of channels) modern CPU's are greedy (in a good sense) and fetch more pages per memory controller read request than are necessary.
If your data is laid out continuously (in this case - sorted), chances of the pages you need next being primed via prefetch are greatly increased and further read requests to the memory controller may be reduced or avoided.
I benchmarked the effect of data prefetching on modern processors with large data structures (usually limited to L1d cache lines):
While prefetch is an important factor, it is not the case here since the data array is accessed sequentially in both the sorted or unsorted scenarios. Also the initial number generation loop already primes the data in the memory for either scenario.