I actually have a similar question re. js since a while now [with Chromium 34]... Consider these two pieces of code which do exactly the same thing, one is a standalone function:
var makeKeyCodepoint = function(word) {
var len = word.length;
if ( len > 255 ) { return undefined; }
var i = len >> 2;
return String.fromCharCode(
(word.charCodeAt( 0) & 0x03) << 14 |
(word.charCodeAt( i) & 0x03) << 12 |
(word.charCodeAt( i+i) & 0x03) << 10 |
(word.charCodeAt(i+i+i) & 0x03) << 8 |
len
);
};
The other a method:
var MakeKeyCodepoint = function() {};
MakeKeyCodepoint.prototype.makeKey = function(word) {
var len = word.length;
if ( len > 255 ) { return undefined; }
var i = len >> 2;
return String.fromCharCode(
(word.charCodeAt( 0) & 0x03) << 14 |
(word.charCodeAt( i) & 0x03) << 12 |
(word.charCodeAt( i+i) & 0x03) << 10 |
(word.charCodeAt(i+i+i) & 0x03) << 8 |
len
);
};
var makeKeyCodepointObj = new MakeKeyCodepoint();
Now why the standalone function runs at over 6.3M op/sec, while the method runs at 710M op/sec (on my computer)?
I could be wrong (and if so, pie my face), but I believe it's mostly due to one of the many the inline cache optimizations that v8 employs.
Let's consider the receiver (i.e the `this` value) of Example 1 and 2. The receiver of Example 1 is Benchmark, if invoked normally. The receiver of Example 2 is the empty function object function(){}.
When you call makeKeyCodepointObj.makeKey() - the VM looks up the object's prototype chain and finds the function. This call site is cached (think of it as a K:V store, where the key is "makeKeyCodepointObj.makeKey" and the value is the call site of the function.)
When you call makeKeyCodepoint(), the VM has to, for each call, look up the prototype chain until it finds the variable. The variable is then resolved into the function call site. Because of scoping issues in JS, I don't think this is cached (or if it's cached, it'd be invalidated a lot), and a lookup has to happen every time. (I know in my JS engine, I tried to perform caching optimization for global object properties and I gave up).
TL;DR: Function lookups happen all the time when the function is a method of the global object. When a function is a method of an object, the lookup is cached.
If I am talking out of my arse, please feel free to correct me.
That is true, but I don't see how it explains the difference either. The function version and method version both reference a local variable, named makeKeyCodepoint or makeKeyCodepointObj respectively. The method version doesn't appear to have any fewer variable references than the function version.
For what it's worth, as far as I can tell SpiderMonkey is still more or less optimizing away the makeKeyConcat / makeKeyConcatObj testcases in Firefox 29, and all of them on trunk. I bet it's inlining the functions, discovering the arguments are constant strings and hence the return values are constant, constant-folding the if conditions, etc...
Microbenchmarking a decent optimizing compiler is hard; it will typically be able to figure out that the ubench is pointless and optimize it all away...
Any time you see jsperf numbers over about 1e9, that means that the testcase was compeletely optimized out by the JIT (presumably because it has no detectable side-effects so got dead-code eliminated). 1.7e9 iterations per second on typical modern 3Ghz hardware means 2 clock ticks or less per iteration. That's about enough time to increment the loop counter and compare it to the loop limit and nothing else.
Indeed, if you add a completely empty test case, it is only a tiny bit faster than the method version that appears to have such high performance. (jsperf doesn't allow a completely empty test, but you can use // to fool it.)
As bzbarsky points out, tests in the realm of 1e9 op/sec look like the entire function is being optimized away because there are no side effects. Something about the method allows it to do this, while it doesn't think its okay for the function version.
One thing I found out is that dropping `String.fromCharCode` in favor of a local function (`var fromCharCode = String.fromCharCode.bind(String);`) causes neither of them to be optimizable. See http://jsperf.com/makekey-concat-vs-join/9
Try Chrome 34 – the difference is massive. What's interesting is that the best result in Chrome is 45% slower than the worst result in Firefox 29 so it's probably a question of why v8 is failing to JIT the first 3 versions.
The "two pieces of code" I am referring to are obviously the body of the function and method. (Following your comment I had to look again, I thought I missed something).
I wasn't surprised to see it had to do with register allocation, since I've encountered some extremely odd compiler output with similar issues before. "Why would it ever decide this was a good idea?" is the thought that often comes to mind when looking through the generated code.
Register allocation is one of those areas where I think compilers are pretty horrible compared to a good or even mid-level Asm programmer, and I've never understood why graph colouring is often the only way that is taught because it's clearly not the way that an Asm programmer does it, and is also completely unintuitive to me. It seems to assume that variables are allocated in a fixed fashion and a load-store architecture, which is overly restrictive for real architectures like x86. There's also no interaction between RA and instruction selection, despite them both influencing each other, whereas a human programmer will essentially combine those two steps together. The bulk of research appears to be stuck on "how do we improve graph colouring", when IMHO a completely new, more intuitive approach would make more sense. At least it would make odd behaviour like this one less of a problem, I think.
Register allocation is an NP-complete problem. Graph coloring works because it can be done with information available to a compiler from live variable analysis.
Problems are classified as NP-complete based on the Turing machine model. Since the human brain is not a Turing machine, it may be better suited for solving NP-complete problems than a computer. Solutions to NP-complete problems commonly employ heuristics to strike a balance between completeness and efficiency. The human brain seems (at least to me) to be better at solving problems involving heuristics. Chess is an obvious example.
To me, whether it's NP-complete is of little concern, since humans have been allocating registers (and beating compilers) with little difficulty. On the contrary, I feel that being labeled as NP-complete has somehow slowed the development of better RA algorithms, on the basis that it's "too hard". There's a saying "one of the first steps to accomplishing something is to believe that it's possible", and if people believe that RA is a more difficult problem than it really is, then that has a discouraging effect. NP-completeness is only related to the complexity as the problem size increases, but in practice the problem sizes aren't that big --- e.g. within a function, having several dozen live variables at a time is probably extremely uncommon, and machines don't have that many registers - a few dozen at most.
I think the human brain is probably Turing-equivalent, but it also likely doesn't matter -- if I can describe the algorithm that I, as a human, take to perform compiler-beating register allocation and instruction selection, then a machine can probably do it just as well if not faster than me since it can consider many more alternatives and at a much faster rate.
I agree that heuristic approaches are the way to go, but in a "too far abstracted" model like graph colouring, some heuristics just can't be easily used; e.g. the technique of "clearing the table" --- setting up all the registers prior to a tight loop, so that the instructions within do not have to access memory at all. Using push/pop (x86) for "very ephemerally-spilled" values is another one.
> To me, whether it's NP-complete is of little concern, since humans have been allocating registers (and beating compilers) with little difficulty.
Following that logic, one would conclude that writing an AI for Go is trivial as well [1].
> if I can describe the algorithm that I, as a human, take ... then a machine can probably do it just as well if not faster
This is pretty easy to disprove. You can probably look at a program's source code and tell whether or not it halts. But it's been proven that a Turing machine cannot [2]. The halting problem is one of many undecidable problems in computer science [3]. If any one of the undecidable problems can be solved by a human, that proves that the human brain is not Turing equivalent.
I think you misunderstand what is the halting problem. It's being able to tell whether a program will halt or not, for all conceivable programs. A human brain certainly can't do that.
For example, does this program halt? (Let's assume infinite-precision numbers, for simplicity. After all, a Turing machine can access an infinitely long tape.)
for (int n = 3; ; n++)
for (int a = 1; a < n; a++)
for (int b = 1; b < n; b++)
for (int c = 1; c < n; c++)
for (int m = 3; m < n; m++)
if (pow(a, m) + pow(b, m) == pow(c, m)) exit(1);
Show me that this program never halts, and you just proved Fermat's last theorem.
AIs have been made for Go and they aren't extraordinarily complicated. They can beat most humans.
It's pretty widely believed that humans are Turing equivalent, and there is no evidence at all to suggest we aren't. Certainly humans can't determine halting for all possible programs and inputs. And we run on physics which as far as I know is Turing equivalent.
The simple explanation why is that both problems are very hard to solve in isolation. Combining the problems makes it even harder.
That said, of course it is better to try to solve the real problem and not only the decomposition. I think that the focus on cheap compiler passes is a bit obsolete now. A cheap debug-build with some optimizations is good, but for a real production build where speed is needed I am willing to wait quite some extra time to get the last few drops of performance.
I've seen some researchers looking into solving the combined problem which feels promising. That kind of interest combined with the nice modularity of LLVM makes me quite optimistic that there will be nice results that are relevant both academically and industrially.
I always thought the advent of RISC was a side effect of the (very) low register count and how hard it is to actually make optimal use of them without manually writing Assembler.
The flat memory model on i386 at least made it somewhat easier compared to the segment address mode coming with 8086.
One of the nice things about this question (apart from the serious in-depth answers) is that Eric Lippert himself comes with an answer after discussing it directly with the people that can actually provide the proper fix. Q&A at it's best!
Jon Skeet's SO reputation is only as modest as it is because of integer overflow (SQL Server does not have a datatype large enough) and When Jon Skeet points to null, null quakes in fear.
Notice that this question is 2 years old. I would imagine that several lots of things have happened for Roslyn. (They even talk about some of what they were working on). Nevertheless quite interesting.
maybe try compiling stuff with GCC 2.96 until you get that out of your system. I'm personally very happy that its been years since some weird behavior turned out to be a compiler bug.
Having found way too many bugs in the OS and compiler, it's amusing at first, but then gets tiresome. It's way easier to fix a bug when you're the one at fault, so any time something looks like it might be in the compiler or OS, I end up fervently hoping that it's not.
I don't code in C# but it would be interesting to surround the code in just a block instead of a try-catch block and see if the same behavior is evident. If a plain block is still slow, then maybe branch prediction gets overwhelmed with considering having to unwind the stack all the way to main and dealing with open streams and objects on the stack.
edit: I don't know byte code or machine code well so my description of what happens unwinding the stack is probably wrong, but my point is just that it's simpler for the CPU not having the possibility of unwinding the stack beyond the code section OP called out.
Puzzling that so many people still run in i386 mode. I haven't used a 32-bit system since shortly after x86 hardware went 64-bit, 10+ years ago. I guess in the Windows world it's because of XP?
I code in C# mostly, and we end up doing almost all of our builds in x86 only mode because of dependencies. If you build in x64 or AnyCPU and any of your dependent DLLs aren't x64-compatible, then you'll crash.
Also, a lot of our customers apparently use XP 32bit and have no intention of updating anytime soon. Sigh...
Also, there are often cases where memory on your computer is either non-expandable or already maxed out. I maxed out my laptop with 8 GB and I don't feel like I have memory to waste with the usual office tasks, on Ubuntu. Being conservative and remembering that Ubuntu switched to "recommending" its 64-bit variant as default on its start page less than a year ago, naturally I chose 32-bit the last time I reinstalled the system. But, I'll choose 64-bit next time, because I see mainstream software support shifting to 64-bit for some applications, and because I read up Phoronix tests where 64-bit shows noticeable performance advantage.
Have you used it or know people who have ? I am very curious about the experience and the details. Are there distributions that ship with libraries compiled with X32_ABI ?
Maybe some embedded systems distro have tried but as far as I know not any major distro. That would probably uncover quite a lot of bugs and require adaptations since lot of code assume Intel and compatible are either x86 or x86_64 and address memory accordingly.
There are some people trying to port Archlinux to x32 but I really don't know the status.
While a nice feat (32bit on <32GB heaps) 64bit pointers allows for storing metadata (more than 3bits* ) in the pointer itself. Having the class id alongside the pointer can be quite profitable. Other than that the 32bit pointers are a clear win with small heaps.
* If the hardware ignores the lower 3bits they can be utilized for various needs (e.g biased locking[0])
Generally, the hardware doesn't ignore those bits, so you'll need to mask em, and that costs some performance. All in all, it's probably a win, but it's not entirely free either. It really depends on what you're going to stick in those extra bits.
Java's assumption that you'll use those bits to address more memory is not a bad one; after all, for workloads between 4 and 32GB there's probably no optimization better than dramatically increasing memory density.
Try it: http://jsperf.com/makekey-concat-vs-join/3