Try-catch speeding up my code?

gorhill · on May 20, 2014

I actually have a similar question re. js since a while now [with Chromium 34]... Consider these two pieces of code which do exactly the same thing, one is a standalone function:

    var makeKeyCodepoint = function(word) {
        var len = word.length;
        if ( len > 255 ) { return undefined; }
        var i = len >> 2;
        return String.fromCharCode(
            (word.charCodeAt(    0) & 0x03) << 14 |
            (word.charCodeAt(    i) & 0x03) << 12 |
            (word.charCodeAt(  i+i) & 0x03) << 10 |
            (word.charCodeAt(i+i+i) & 0x03) <<  8 |
            len
        );
    };

The other a method:

    var MakeKeyCodepoint = function() {};
    MakeKeyCodepoint.prototype.makeKey = function(word) {
        var len = word.length;
        if ( len > 255 ) { return undefined; }
        var i = len >> 2;
        return String.fromCharCode(
            (word.charCodeAt(    0) & 0x03) << 14 |
            (word.charCodeAt(    i) & 0x03) << 12 |
            (word.charCodeAt(  i+i) & 0x03) << 10 |
            (word.charCodeAt(i+i+i) & 0x03) <<  8 |
            len
        );
    };
    var makeKeyCodepointObj = new MakeKeyCodepoint();

Now why the standalone function runs at over 6.3M op/sec, while the method runs at 710M op/sec (on my computer)?

Try it: http://jsperf.com/makekey-concat-vs-join/3

chewxy · on May 20, 2014

I could be wrong (and if so, pie my face), but I believe it's mostly due to one of the many the inline cache optimizations that v8 employs.

Let's consider the receiver (i.e the `this` value) of Example 1 and 2. The receiver of Example 1 is Benchmark, if invoked normally. The receiver of Example 2 is the empty function object function(){}.

When you call makeKeyCodepointObj.makeKey() - the VM looks up the object's prototype chain and finds the function. This call site is cached (think of it as a K:V store, where the key is "makeKeyCodepointObj.makeKey" and the value is the call site of the function.)

When you call makeKeyCodepoint(), the VM has to, for each call, look up the prototype chain until it finds the variable. The variable is then resolved into the function call site. Because of scoping issues in JS, I don't think this is cached (or if it's cached, it'd be invalidated a lot), and a lookup has to happen every time. (I know in my JS engine, I tried to perform caching optimization for global object properties and I gave up).

TL;DR: Function lookups happen all the time when the function is a method of the global object. When a function is a method of an object, the lookup is cached.

If I am talking out of my arse, please feel free to correct me.

Stratoscope · on May 20, 2014

I don't think a global variable lookup is the reason for the difference. Here is the code that jsperf generates for the function version of the test:

    (Benchmark.uid1400600789397runScript || function() {})();
    Benchmark.uid1400600789397createFunction = function(window, t14006007893970) {
        
        var global = window,
            clearTimeout = global.clearTimeout,
            setTimeout = global.setTimeout;
            
        var r14006007893970, s14006007893970, m14006007893970 = this,
            f14006007893970 = m14006007893970.fn,
            i14006007893970 = m14006007893970.count,
            n14006007893970 = t14006007893970.ns;
        
        // Test Setup
        var makeKeyCodepoint = function(word) {
            var len = word.length;
            if (len > 255) {
                return undefined;
            }
            var i = len >> 2;
            return String.fromCharCode(
                (word.charCodeAt(    0) & 0x03) << 14 |
                (word.charCodeAt(    i) & 0x03) << 12 |
                (word.charCodeAt(  i+i) & 0x03) << 10 |
                (word.charCodeAt(i+i+i) & 0x03) <<  8 |
                len
            );
        };
        
        s14006007893970 = n14006007893970.now();
        while (i14006007893970--) {
            // Test Code
            var key;
            
            key = makeKeyCodepoint('www.wired.com');
            key = makeKeyCodepoint('www.youtube.com');
            key = makeKeyCodepoint('scorecardresearch.com');
            key = makeKeyCodepoint('www.google-analytics.com');
        }
        r14006007893970 = (n14006007893970.now() - s14006007893970) / 1e3;
        
        return {
            elapsed: r14006007893970,
            uid: "uid14006007893970"
        }
    }

The test setup and the test itself are all part of the same function, and makeKeyCodepoint is a local variable in that function.

chewxy · on May 20, 2014

variable and property lookups do go through different processes (and hence optimized differently).

variables are stored on activation records (a Context object in v8), while properties are stored in well, a magic hidden class type of thing (for v8).

The latter can be cached, the former not so much. Plus, the former also creates quite a bit of garbage, so gc should theroetically kick in more often

comex · on May 20, 2014

I would expect any local variables to be stored in registers once the optimizer kicks in. I bet there's a different explanation.

Stratoscope · on May 20, 2014

That is true, but I don't see how it explains the difference either. The function version and method version both reference a local variable, named makeKeyCodepoint or makeKeyCodepointObj respectively. The method version doesn't appear to have any fewer variable references than the function version.

thedufer · on May 20, 2014

The first two tests on that jsperf don't show the same behavior, though, and they differ in the same way.

tantalor · on May 20, 2014

A perf test without side effects is suspect because the compiler can remove dead code. You should add asserts on the return values.

gorhill · on May 20, 2014

I thought that was an interesting comment, I did wonder originally if this could be something like that, but didn't follow up.

So now I took the time to try to go around this by rearranging the calls, and all of a sudden results make more sense:

http://jsperf.com/makekey-concat-vs-join/10

Results:

1. Firefox 29 makeKeyConcat / makeKeyConcatObj = ~440 Mops/s

2. Firefox 29 makeKeyCodepoint / makeKeyCodepointObj = ~64 Mops/s

3. Chrome 34 makeKeyCodepoint / makeKeyCodepointObj = ~5.7 Mops/s

4. Chrome 34 makeKeyConcat / makeKeyConcatObj = = ~2.2 Mops/s

bzbarsky · on May 21, 2014

For what it's worth, as far as I can tell SpiderMonkey is still more or less optimizing away the makeKeyConcat / makeKeyConcatObj testcases in Firefox 29, and all of them on trunk. I bet it's inlining the functions, discovering the arguments are constant strings and hence the return values are constant, constant-folding the if conditions, etc...

I tried to work around that in http://jsperf.com/makekey-concat-vs-join/11 but clearly that's not good enough.

Microbenchmarking a decent optimizing compiler is hard; it will typically be able to figure out that the ubench is pointless and optimize it all away...

tantalor · on May 20, 2014

Cool! Now it looks like the object case is slightly slower, which is exactly what I'd expect.

mike-cardwell · on May 20, 2014

On my 64bit Linux Firefox 29 desktop:

  codepoint    : 1,172,182,887 ops/sec
  codepoint obj: 1,168,116,461 ops/sec

No significant difference.

acdha · on May 20, 2014

Chrome 34:

    function:      5,572,574
      method:    903,375,064

Firefox 29:

    function:  1,747,475,085 
      method:  1,727,244,041

bzbarsky · on May 20, 2014

Any time you see jsperf numbers over about 1e9, that means that the testcase was compeletely optimized out by the JIT (presumably because it has no detectable side-effects so got dead-code eliminated). 1.7e9 iterations per second on typical modern 3Ghz hardware means 2 clock ticks or less per iteration. That's about enough time to increment the loop counter and compare it to the loop limit and nothing else.

Stratoscope · on May 20, 2014

Indeed, if you add a completely empty test case, it is only a tiny bit faster than the method version that appears to have such high performance. (jsperf doesn't allow a completely empty test, but you can use // to fool it.)

thedufer · on May 20, 2014

As bzbarsky points out, tests in the realm of 1e9 op/sec look like the entire function is being optimized away because there are no side effects. Something about the method allows it to do this, while it doesn't think its okay for the function version.

One thing I found out is that dropping `String.fromCharCode` in favor of a local function (`var fromCharCode = String.fromCharCode.bind(String);`) causes neither of them to be optimizable. See http://jsperf.com/makekey-concat-vs-join/9

ciupicri · on May 20, 2014

On Fedora 20 x86_66 with midori-0.5.8-1 (webkitgtk-2.2.7-1):

   concat    | concat obj | codepoint | codepoint obj
   ----------+------------+-----------+--------------
   2,243,214 | 1,983,801  | 1,823,882 | 1,746,316

On Fedora 20 x86_66 with epiphany-3.10.3-1 (webkitgtk3-2.2.7-1):

   concat    | concat obj | codepoint | codepoint obj
   ----------+------------+-----------+--------------
   2,515,750 | 2,280,291  | 2,187,448 | 1,957,199

vishal0123 · on May 20, 2014

If this is not enough try http://jsperf.com/single-vs-multiple-times-2. Running function 4 times is faster than running single time.

amalcon · on May 20, 2014

Which JS VM are you using to test this? That matters a lot for this sort of thing.

I ran it past the one in my browser (current Firefox, Linux) and didn't see a significant difference.

acdha · on May 20, 2014

Try Chrome 34 – the difference is massive. What's interesting is that the best result in Chrome is 45% slower than the worst result in Firefox 29 so it's probably a question of why v8 is failing to JIT the first 3 versions.

gorhill · on May 20, 2014

Argh sorry, forgot to mention the browser. It's Chromium 34/Linux 64-bit.

NDizzle · on May 20, 2014

Interesting. Chrome 34 here as well, and the scores from the jsperf link are 1.6M, 1.6M, 6.3M, 960M.

SixSigma · on May 20, 2014

They are not exactly the same ergo they are different.

gorhill · on May 20, 2014

The "two pieces of code" I am referring to are obviously the body of the function and method. (Following your comment I had to look again, I thought I missed something).

userbinator · on May 20, 2014

I wasn't surprised to see it had to do with register allocation, since I've encountered some extremely odd compiler output with similar issues before. "Why would it ever decide this was a good idea?" is the thought that often comes to mind when looking through the generated code.

Register allocation is one of those areas where I think compilers are pretty horrible compared to a good or even mid-level Asm programmer, and I've never understood why graph colouring is often the only way that is taught because it's clearly not the way that an Asm programmer does it, and is also completely unintuitive to me. It seems to assume that variables are allocated in a fixed fashion and a load-store architecture, which is overly restrictive for real architectures like x86. There's also no interaction between RA and instruction selection, despite them both influencing each other, whereas a human programmer will essentially combine those two steps together. The bulk of research appears to be stuck on "how do we improve graph colouring", when IMHO a completely new, more intuitive approach would make more sense. At least it would make odd behaviour like this one less of a problem, I think.

davidcuddeback · on May 20, 2014

Register allocation is an NP-complete problem. Graph coloring works because it can be done with information available to a compiler from live variable analysis.

Problems are classified as NP-complete based on the Turing machine model. Since the human brain is not a Turing machine, it may be better suited for solving NP-complete problems than a computer. Solutions to NP-complete problems commonly employ heuristics to strike a balance between completeness and efficiency. The human brain seems (at least to me) to be better at solving problems involving heuristics. Chess is an obvious example.

userbinator · on May 21, 2014

To me, whether it's NP-complete is of little concern, since humans have been allocating registers (and beating compilers) with little difficulty. On the contrary, I feel that being labeled as NP-complete has somehow slowed the development of better RA algorithms, on the basis that it's "too hard". There's a saying "one of the first steps to accomplishing something is to believe that it's possible", and if people believe that RA is a more difficult problem than it really is, then that has a discouraging effect. NP-completeness is only related to the complexity as the problem size increases, but in practice the problem sizes aren't that big --- e.g. within a function, having several dozen live variables at a time is probably extremely uncommon, and machines don't have that many registers - a few dozen at most.

I think the human brain is probably Turing-equivalent, but it also likely doesn't matter -- if I can describe the algorithm that I, as a human, take to perform compiler-beating register allocation and instruction selection, then a machine can probably do it just as well if not faster than me since it can consider many more alternatives and at a much faster rate.

I agree that heuristic approaches are the way to go, but in a "too far abstracted" model like graph colouring, some heuristics just can't be easily used; e.g. the technique of "clearing the table" --- setting up all the registers prior to a tight loop, so that the instructions within do not have to access memory at all. Using push/pop (x86) for "very ephemerally-spilled" values is another one.

davidcuddeback · on May 21, 2014

> To me, whether it's NP-complete is of little concern, since humans have been allocating registers (and beating compilers) with little difficulty.

Following that logic, one would conclude that writing an AI for Go is trivial as well [1].

> if I can describe the algorithm that I, as a human, take ... then a machine can probably do it just as well if not faster

This is pretty easy to disprove. You can probably look at a program's source code and tell whether or not it halts. But it's been proven that a Turing machine cannot [2]. The halting problem is one of many undecidable problems in computer science [3]. If any one of the undecidable problems can be solved by a human, that proves that the human brain is not Turing equivalent.

[1] https://en.wikipedia.org/wiki/Computer_Go

[2] https://en.wikipedia.org/wiki/Halting_problem

[3] https://en.wikipedia.org/wiki/Undecidable_problem

yongjik · on May 21, 2014

I think you misunderstand what is the halting problem. It's being able to tell whether a program will halt or not, for all conceivable programs. A human brain certainly can't do that.

For example, does this program halt? (Let's assume infinite-precision numbers, for simplicity. After all, a Turing machine can access an infinitely long tape.)

    for (int n = 3; ; n++)
      for (int a = 1; a < n; a++)
        for (int b = 1; b < n; b++)
          for (int c = 1; c < n; c++)
            for (int m = 3; m < n; m++)
              if (pow(a, m) + pow(b, m) == pow(c, m)) exit(1);

Show me that this program never halts, and you just proved Fermat's last theorem.

Edit: added one missing loop

Houshalter · on May 22, 2014

AIs have been made for Go and they aren't extraordinarily complicated. They can beat most humans.

It's pretty widely believed that humans are Turing equivalent, and there is no evidence at all to suggest we aren't. Certainly humans can't determine halting for all possible programs and inputs. And we run on physics which as far as I know is Turing equivalent.

mzl · on May 20, 2014

The simple explanation why is that both problems are very hard to solve in isolation. Combining the problems makes it even harder.

That said, of course it is better to try to solve the real problem and not only the decomposition. I think that the focus on cheap compiler passes is a bit obsolete now. A cheap debug-build with some optimizations is good, but for a real production build where speed is needed I am willing to wait quite some extra time to get the last few drops of performance.

I've seen some researchers looking into solving the combined problem which feels promising. That kind of interest combined with the nice modularity of LLVM makes me quite optimistic that there will be nice results that are relevant both academically and industrially.

xxs · on May 20, 2014

I always thought the advent of RISC was a side effect of the (very) low register count and how hard it is to actually make optimal use of them without manually writing Assembler.

The flat memory model on i386 at least made it somewhat easier compared to the segment address mode coming with 8086.

jmgrosen · on May 20, 2014

Having just reverse engineered some 16-bit DOS code, I agree with your last statement wholeheartedly. That was scary.

stinos · on May 20, 2014

One of the nice things about this question (apart from the serious in-depth answers) is that Eric Lippert himself comes with an answer after discussing it directly with the people that can actually provide the proper fix. Q&A at it's best!

edit same goes for Jon Skeet of course, and looking for info about him I came across this http://meta.stackexchange.com/questions/9134/jon-skeet-facts... which has some hilarious ones like

Jon Skeet's SO reputation is only as modest as it is because of integer overflow (SQL Server does not have a datatype large enough) and When Jon Skeet points to null, null quakes in fear.

driax · on May 20, 2014

Notice that this question is 2 years old. I would imagine that several lots of things have happened for Roslyn. (They even talk about some of what they were working on). Nevertheless quite interesting.

nutjob2 · on May 20, 2014

It's a compiler bug.

teebot · on May 20, 2014

I wish I could say that more often

hugi · on May 20, 2014

No you don't :)

dllthomas · on May 20, 2014

More often per unit of bug, maybe (... maybe.). No more often per unit of time or unit of code, please!

yaur · on May 20, 2014

maybe try compiling stuff with GCC 2.96 until you get that out of your system. I'm personally very happy that its been years since some weird behavior turned out to be a compiler bug.

LeonM · on May 20, 2014

Believe me, you really don't!

teebot · on May 20, 2014

you're right it must be really tricky to spot and work around it but just the tought of blaming the compiler not sloppy programming is amusing

mikeash · on May 20, 2014

Having found way too many bugs in the OS and compiler, it's amusing at first, but then gets tiresome. It's way easier to fix a bug when you're the one at fault, so any time something looks like it might be in the compiler or OS, I end up fervently hoping that it's not.

qntmfred · on May 20, 2014

there is something oddly satisfying about stumbling upon compiler bugs. here's one i discovered a while back http://stackoverflow.com/questions/11303732/x86-vs-anycpu-re...

logn · on May 20, 2014

I don't code in C# but it would be interesting to surround the code in just a block instead of a try-catch block and see if the same behavior is evident. If a plain block is still slow, then maybe branch prediction gets overwhelmed with considering having to unwind the stack all the way to main and dealing with open streams and objects on the stack.

edit: I don't know byte code or machine code well so my description of what happens unwinding the stack is probably wrong, but my point is just that it's simpler for the CPU not having the possibility of unwinding the stack beyond the code section OP called out.

batmansbelt · on May 20, 2014

It's literally the best feeling in the world when Eric Lippert answers your c# question.

It's like if you cried out "dear God, why?" about your troubles but actually got a response.

fulafel · on May 20, 2014

Puzzling that so many people still run in i386 mode. I haven't used a 32-bit system since shortly after x86 hardware went 64-bit, 10+ years ago. I guess in the Windows world it's because of XP?

ufmace · on May 20, 2014

I code in C# mostly, and we end up doing almost all of our builds in x86 only mode because of dependencies. If you build in x64 or AnyCPU and any of your dependent DLLs aren't x64-compatible, then you'll crash.

Also, a lot of our customers apparently use XP 32bit and have no intention of updating anytime soon. Sigh...

listic · on May 20, 2014

I run in i386 becaues it uses less memory (though I think I'll be switching)

davidw · on May 20, 2014

Yeah, on hosting systems where memory costs money, 32 bit can be a significant savings. I wrote this a few years ago:

http://journal.dedasys.com/2008/11/24/slicehost-vs-linode/

listic · on May 20, 2014

Also, there are often cases where memory on your computer is either non-expandable or already maxed out. I maxed out my laptop with 8 GB and I don't feel like I have memory to waste with the usual office tasks, on Ubuntu. Being conservative and remembering that Ubuntu switched to "recommending" its 64-bit variant as default on its start page less than a year ago, naturally I chose 32-bit the last time I reinstalled the system. But, I'll choose 64-bit next time, because I see mainstream software support shifting to 64-bit for some applications, and because I read up Phoronix tests where 64-bit shows noticeable performance advantage.

dbaupp · on May 20, 2014

Linux offers the x32 ABI[1] for this reason: small 32-bit pointers but maintaining the advantages of x86-64 (more/bigger registers etc).

[1]: http://en.wikipedia.org/wiki/X32_ABI

srean · on May 20, 2014

Have you used it or know people who have ? I am very curious about the experience and the details. Are there distributions that ship with libraries compiled with X32_ABI ?

ldng · on May 20, 2014

Maybe some embedded systems distro have tried but as far as I know not any major distro. That would probably uncover quite a lot of bugs and require adaptations since lot of code assume Intel and compatible are either x86 or x86_64 and address memory accordingly.

There are some people trying to port Archlinux to x32 but I really don't know the status.

[edit] An one year old LWN article : http://lwn.net/Articles/548838/

dbaupp · on May 20, 2014

I seem to have /libx32 and /usr/libx32 (and they're non-empty too) on my Debian unstable system.

krakensden · on May 20, 2014

Ubuntu?

pron · on May 20, 2014

Java uses 32-bit pointers in 64-bit mode by default.

http://docs.oracle.com/javase/7/docs/technotes/guides/vm/per...

xxs · on May 20, 2014

While a nice feat (32bit on <32GB heaps) 64bit pointers allows for storing metadata (more than 3bits* ) in the pointer itself. Having the class id alongside the pointer can be quite profitable. Other than that the 32bit pointers are a clear win with small heaps.

* If the hardware ignores the lower 3bits they can be utilized for various needs (e.g biased locking[0])

[0] http://www.azulsystems.com/blog/cliff/2010-01-09-biased-lock...

emn13 · on May 20, 2014

Generally, the hardware doesn't ignore those bits, so you'll need to mask em, and that costs some performance. All in all, it's probably a win, but it's not entirely free either. It really depends on what you're going to stick in those extra bits.

Java's assumption that you'll use those bits to address more memory is not a bad one; after all, for workloads between 4 and 32GB there's probably no optimization better than dramatically increasing memory density.