
Reporting a bug on a fragile analysis - lordgilman
http://blog.mozilla.com/rob-sayre/2010/11/16/reporting-a-bug-on-a-fragile-analysis/
======
lordgilman
I know that my submission's title is not the same as the blog post's title and
that I will get some hate for it. However, the two diffs linked in the post
give pretty convincing evidence that IE is picking up on the exact SunSpider
test. Furthermore, if you read the last sentence of the blog post the author
is more or less beating around the "You're cheating, we've caught you red-
handed" bush.

~~~
sounddust
I don't think it's ever a good idea to make a headline more sensational. We
should wait for a response from Microsoft before declaring them
liars/cheaters. It's entirely possible that Microsoft has a valid explanation.

Personally - on at least two occasions, I've been accused of writing code that
was specifically written to cause grief/problems with another person, only to
have to explain that it was a bug and that their personal test-case isn't the
only place where it fails.

* And even if the author did directly accuse MS of cheating, that doesn't mean we can't be more correct and rewrite it to be neutral.

~~~
awakeasleep
I totally agree with you on minimizing the controversy until incontrovertible
facts have been found.

In my personal experience, assuming bad will on another person's part kills
all chance of civil discussion and severely hurts your chances of finding more
facts, because people are on the defensive. Not to mention how hard it is to
remove that sort of egg stain from your face.

that'll be 2¢ please!

------
runjake
This submission demonstrates why you should just stick to the original article
title you're linking, instead of coming up with your own flamebait/trolly
title.

The issue appears to be a SunSpider bug, not an IE9 bug or "cheat". See
<http://news.ycombinator.com/item?id=1913368> for more information.

lordgilman, I hope you now realize it would've been wise to wait before
passing judgment (especially in a public forum).

 _Edit:_ I don't know what's with the downvotes. I'm just going by the HN
Guidelines, posted at <http://ycombinator.com/newsguidelines.html>? If you
have a problem, don't downvote me, take it up with pg.

~~~
bjg
The post you link to has edited it's previous findings.

They now state that IE9 is certainly cheating on the sunspider.

I hope you now realize it would've been wise to wait before passing judgment
on lordgilman ( especially in a public forum )

~~~
runjake
He simply should've followed the HN guidelines
(<http://ycombinator.com/newsguidelines.html>) and linked to the article and
stuck to the original title.

~~~
lordgilman
If I may be pedantic for a moment, the HN guidelines only warn against
"gratuitous editorial spin." I don't feel the changes were gratuitous at all
because (if you look at the last sentence in the blog post) my title is
clearly the point Mr. Sayrer is trying to make.

~~~
runjake
The changes were gratuitous. That's why the author didn't directly say "You're
cheating".

He wanted a direct answer while at the same time giving them the benefit of
the doubt.

But whatever, your mind is made up.

------
paulirish
Pretty much all browser vendors agree SunSpider is a bad benchmark, but yet it
keeps getting used and abused. All vendors have tweaked their JS engine for
SunSpider itself.

Dromaeo is a much better benchmark suite in that it tests actual DOM things
rather than pure language stuff. Kraken (also by Moz) also attempts to focus
on webapp usecases rather than doing billions of regexes per second.

~~~
yatsyk
I believe that we need as DOM+js as pure js benchmarks.

~~~
rimantas
<http://dromaeo.com/?dom|jslib|cssquery>

------
julian37
In case you don't have IE9 installed, the benchmark results (quoted in the
previous blog post) are:

    
    
      cordic: 1ms +/- 0.0%
      cordic-with-return: 22.6ms +/- 2.7%
      cordic-with-true: 22.5ms +/- 2.7%
    

(Taken from [http://blog.mozilla.com/rob-sayre/2010/09/09/js-
benchmarks-c...](http://blog.mozilla.com/rob-sayre/2010/09/09/js-benchmarks-
closing-in/) )

~~~
seldo
Simplifying for my tiny brain: does this mean that IE9 takes more than 22x
longer to complete a "real" test instead of the "optimized" one?

~~~
cheald
Edit (yet again): My initial conclusions were wrong, and it's nearly certainly
cheating. Dammit. I hate being wrong in front of people smarter than me. :<

\----

I'm running the same benchmark independently right now. Core i7 in a Win7
64-bit install.

For each test, I did 5 runs and averaged them. I increased the number of loops
in each test from 25,000 to 250,000 as well.

    
    
      Chrome 9.0.576.0
            Stock: 105.28ms
      With "true": 104.44ms
    
      MS IE 9.0.7930.16406
            Stock: 10.98ms
      With "true": 181.16ms
    

That's a pretty interesting jump.

Here's a fun little observation:

    
    
      Chome: 8.6ms
         IE: 10.9ms
    
      --- tests/sunspider-0.9.1/math-cordic.js        2010-11-17 00:55:29.000000000 -0700
      +++ tests/sunspider-0.9.1-deadcode/math-cordic.js       2010-11-17 01:10:36.000000000 -0700
      @@ -63,6 +63,7 @@
           TargetAngle = FIXED(28.027);
           CurrAngle = 0;
           for (Step = 0; Step < 12; Step++) {
      +        return;
               var NewX;
               if (TargetAngle > CurrAngle) {
                   NewX = X - (Y >> Step);
    

By returning immediately out of the loop, Chrome's time drops by a factor of
12.1, whereas IE's stays pretty much constant.

I suspect what's happening here is that the IE engine is somehow marking that
entire function as deadcode, and thus, not running it; the ~10ms accounts for
the time it takes to run that for loop 250k times, but the cordicsincos() code
is not being run at all. Ironically, deadcode somewhere in the function causes
the engine to NOT throw it all away, and its gets run.

In fact, if we just kill that for loop all together:

    
    
      --- tests/sunspider-0.9.1/math-cordic.js        2010-11-17 00:55:29.000000000 -0700
      +++ tests/sunspider-0.9.1-deadcode/math-cordic.js       2010-11-17 01:16:36.000000000 -0700
      @@ -62,20 +62,6 @@
    
           TargetAngle = FIXED(28.027);
           CurrAngle = 0;
      -    for (Step = 0; Step < 12; Step++) {
      - *snip for brevity*
      -    }
       }
    

We get the following times:

    
    
      Chrome: 8.9ms
          IE: 10.6ms
    

What I suspect is that the IE engine is seeing "Okay, nothing is returned, and
nothing outside of the scope of this function is ever altered", so once it
steps into it, it just immediately returns. This is arguably correct behavior!
That code is, for all practical purposes, worthless.

If we just move one of the variable references out of function scope (or just
remove the _var_ , making it effectively a global variable), IE takes the
extra time to run:

    
    
      --- tests/sunspider-0.9.1/math-cordic.js        2010-11-17 00:55:29.000000000 -0700
      +++ tests/sunspider-0.9.1-deadcode/math-cordic.js       2010-11-17 01:22:49.000000000 -0700
      @@ -50,11 +50,11 @@
      
      +var CurrAngle;
       function cordicsincos() {
           var X;
           var Y;
           var TargetAngle;
      -    var CurrAngle;
           var Step;
    
           X = FIXED(AG_CONST);         /* AG_CONST * cos(0) */
    
      Chrome: 99.9ms
          IE: 217.1ms
    

Sorry, guys. I like a good IE bash-fest (Hey, it's still slower than V8 when
it actually runs the code!) as much as anyone, but I think it's legit here.
The benchmark is poorly-conceived, and IE does the right thing with it, though
it obviously distorts the scores in their favor. That's a problem with the
benchmark, though, not IE.

Edit (like...#14): It could well just be cheating on analysis in this
particular case, which I stupidly overlooked. For example, this diff:

    
    
      --- tests/sunspider-0.9.1/math-cordic.js        2010-11-17 00:55:29.000000000 -0700
      +++ tests/sunspider-0.9.1-deadcode/math-cordic.js       2010-11-17 02:09:34.000000000 -0700
      @@ -62,7 +62,7 @@
    
           TargetAngle = FIXED(28.027);
           CurrAngle = 0;
      -    for (Step = 0; Step < 12; Step++) {
      +    for (Step = 12; Step > 0; Step--) {
               var NewX;
               if (TargetAngle > CurrAngle) {
                   NewX = X - (Y >> Step);
    

Results in runtimes of:

    
    
      Chrome: 246.8ms
          IE: 956.5ms
    

Replacing the for with a while also results in long runtimes:

    
    
      --- tests/sunspider-0.9.1/math-cordic.js        2010-11-17 00:55:29.000000000 -0700
      +++ tests/sunspider-0.9.1-deadcode/math-cordic.js       2010-11-17 02:12:03.000000000 -0700
      @@ -59,10 +59,11 @@
    
           X = FIXED(AG_CONST);         /* AG_CONST * cos(0) */
           Y = 0;                       /* AG_CONST * sin(0) */
      +    Step = 0;
    
           TargetAngle = FIXED(28.027);
           CurrAngle = 0;
      -    for (Step = 0; Step < 12; Step++) {
      +    while(Step < 12) {
               var NewX;
               if (TargetAngle > CurrAngle) {
                   NewX = X - (Y >> Step);
      @@ -75,6 +76,7 @@
                   X = NewX;
                   CurrAngle -= Angles[Step];
               }
      +        Step++;
           }
       }
    
      Chrome: 103.4ms
          IE: 190.0ms
    

So, my initial conclusions were wrong. Its dead code analysis is either
incredibly narrow, or it was hand-crafted to optimize out that part of the
benchmark. Either way it's rubbish.

~~~
JoachimSchipper
You missed the point. Yes, dead code analysis may make sense (although a
benchmark that includes dead code is probably a bad benchmark...), but this
"dead code analysis" fails on utterly trivial variations of the benchmark
(<http://people.mozilla.com/~sayrer/2010/sunspider/diff1.html>,
<http://people.mozilla.com/~sayrer/2010/sunspider/diff2.html> \- adding a
"true" in the middle breaks it!).

The most likely conclusion is that IE _doesn't_ do any "real" dead code
analysis; it just recognizes this particular snippet.

~~~
cheald
I think that regrettably, you might be right. It's obviously not just checking
for a bytecode match (see my var foo example), but it's doing something hinky.
I did a simple pow-and-modulo test with the same assumptions and it didn't
optimize it away.

~~~
Andrewski
This isn't "regrettable" it's simply par for the course from everybody's
favorite tech company.

The only thing they will regret is getting caught, much like any sociopath.

~~~
cheald
It's absolutely regrettable. If this was legit, it would mean that the browser
would be faster, the user experience would be better, and developers would be
another tiny step closer to having an easier time of things when working in
IE. I don't feel sorry for Microsoft here, but I'm a web developer, and I want
fast, continually-improving browsers to code against.

------
kenjackson
A better test to see if IE9 is cheating is to remove/rearrange code and rename
variables. I'd avoid changing operators. Adding a 'true;' or 'return;' may
seem harmless, but if their analysis is fragile they may just throw as "may
have side-effects" on those statements or (in the case of the 'return;') it
may not do liveness analysis on the other side of the block.

This code (taken from this thread) seems like a good test:

function numNumNum() { var I; var num = 10; for (I = 0; I < 10; I++) { num =
num * num * num * num * num % num; } }

Except it uses two new operators: '*' and '%'. Test the same code using '+'
and '-'.

This will give a much better idea of it the analysis is just fragile or if
this code was being targeted.

~~~
JoachimSchipper
At what point does "fragile" become "targeted"? Seriously, if it's that
narrow...

~~~
kenjackson
Well there's really three words of interest here: fragile, targeted, and
cheating.

Cheating is really doing something like looking specifically for sunspider and
then doing DCE based on knowing the function.

Fragile is distinct from cheating in that there is actually a real analysis
framework in place, but the analysis can be invalidated easily. For example,
it's not uncommon to see analysis assume function calls may write to all
globals and modify all byref arguments. Looking at the code you can say, "with
interprocedural analysis its obvious that this function has no side effects",
but the analysis may not be that smart. That's an example of fragility.

Now with this example, given that the browser is in Beta/CTP I wouldn't be at
all surprised if their framework was simply incomplete. The 'return;'
statement causing a problem, but renaming and reordering variables doesn't is
the clearest indication IMO. It seems to indicate that they aren't doing any
liveness analysis on the backside, but they aren't doing simple pattern
matching on the text, nor the IR.

Targeting is really about how one brings up the framework. I actually wouldn't
be surprised to hear that they did target sunspider, and that sunspider is
probably part of their regression suite. With that said, this is EXTREMELY
common in the compiler industry.

Now the question you're arguing is does targeting == cheating? In most cases,
no. In fact my suspicioun is that what we're seeing here is the result of
either an incomplete implementation where they did target sunspider, or a more
complete implementation that broke, but no one noticed because its main DCE
test was sunspider.

If IE9 can turn this around with a fix in their next CTP, it was probably not
cheating and just a case of targeting. The reason being that doing a static
analysis framework that is capable of being robust in these situations is non-
trivial, and not something you just add in post-beta.

And if someone could run the test I posted above with '+' '-' rather than '*'
'%' we'd have a first step in our answer. I would do it, but I neither know
the sunspider harness, and don't have IE9 installed (and getting a new VM on
this particular machine is a hassle).

------
nkurz
It certainly seems like Microsoft is 'cheating', but it also seems like an
excellent but warped example of Test Driven Development: they solved the
failing test by the simplest and most direct means available. If time and
budget hold out they will refactor later to generalize.

How do the TDD proponents feel about Microsoft's approach? How is it different
than the supposedly correct behaviour demonstrated here:
[http://thecleancoder.blogspot.com/2010/10/craftsman-62-dark-...](http://thecleancoder.blogspot.com/2010/10/craftsman-62-dark-
path.html)

~~~
stoney
Well in this fantasy TDD scenario (as in we don't know what happened in MS so
I'm just making stuff up), presumably the product requirement was "make IE9
look very fast on the benchmarks without getting caught cheating".

So sure, they solved the first failing unit test (make IE9 look quick), but
don't seem to have written enough unit tests to make sure the second part of
the requirement works. So they would fail the acceptance tests and have to
keep working on it.

(Wouldn't count myself as a full on TDD proponent but do use it when the time
is right.)

------
chollida1
The actual blog post title is:

> Reporting a bug on a fragile analysis

~~~
catch23
I think the title is pretty true based on the diffs though. In one example,
the only difference was adding "true;" in the middle of the code somewhere --
essentially adding a no-op instruction causes vastly different benchmarks?
definitely fraud.

I wouldn't be surprised if microsoft added code like "if (isSunspiderTest)
{loadHandOptimizedAssembly()}"

~~~
ndl
Adding seemingly trivial things to code can sometimes throw off performance
entirely, by disaligning cache tables and such. It's not always cheating.

That said, if this is a pure bug, it seems pretty pathetic. For one, it proves
that the engine is not robust. For another, it probably means that someone
spent hours upon hours tweaking the code with only the sunspider benchmark as
test - analogous to over-fitting the training data. It's really tempting to do
this, but it's also a common enough amateur mistake that Microsoft should have
best practices to avoid it.

All this is speculative for now. Let's see what they say.

~~~
rbanffy
> disaligning cache tables and such

This is JavaScript we are talking about.

------
niyazpk
IIRC there was this Microsoft website which listed a few HTML demos in which
ie9 was way faster than even google chrome. I wonder whether they used the
same 'technique' there too.

~~~
danpker
I think that was mostly down to IE9 using hardware acceleration, which Chrome
doesn't have yet.

------
itissid
They have a paradigm in machine learning called over fitting. Trying to do
well on a test dataset by cheating and seeing it first... I think teh
benchmark should choose tests randomly from a large set of tests and calculate
the expected performance over a number of such random runs. not allowing any
one to cheat...

~~~
thestoicattack
Over-fitting and peeking at the test set are completely different things.
Over-fitting may in fact _degrade_ performance on a test set, because it means
you are giving too much weight to idiosyncratic patterns in the training data.
Peeking at the test data, however, is right out, and should invalidate any
results you try to report.

------
pohl
This was revealed 68 days ago, but nobody seemed to be interested in it at the
time:

<http://news.ycombinator.com/item?id=1676827>

------
scottdw2
That's a pretty big conclusion to jump to (they are cheating the test) based
on a small amount of evidence. If they were "precompiling" the java script for
the test, and had functionality to "preconpile" java script code in the cache,
would the fact that they precompiled the benchmark mean they were cheating?
No. It wouldn't.

Keep in mind that there is a lot of code, such as Jquery, that is identical
but distributed from many sources. It could benefit from similar matching and
pre-compilation.

If dead code analysis (and other optimizations) was part of an "offline"
compilation step (that's not efficient enough to do online), then changing the
code would result in a slower execution path. Once the method body changes,
the compiler wouldn't know it was dead without re-running the analysis (the
changes could introduce side effects).

Now, this doesn't mean they are not cheating, because there is no evidence
either way. But, what you are observing in this case doesn't imply cheating
either.

~~~
mdda
Did you look at the diffs? There's not much room for the kind of excuses
you're coming up with.

~~~
scottdw2
Yes. I'm not comming up with an excuse.

Other java script engines, like the one in web kit, minimize the amount of
analysis they do of java script source, in order to avoid extra overhead.
Something like an optimizing compilation pass is generally too slow to be done
online. It would delay page load time considerably.

But, if it could be done off line, operating on cached, frequently used pages,
it could improve runtime considerably.

If one were to implement such a system for js, it would make sence to use file
hashes as keys to the precompiled code index, and fall back on slower methods
for cache misses, until such time as the offline process could compile the
code. Small changes (non white space), like the ones in the diffs, would
trigger hash changes.

Given such a system, precompiling the benchmark is not cheating. My point is
that you are confusing necessary with sufficient conditions, and are making
damning conclusions without proper evidence.

~~~
mdda
Ok, so your hypothesis is that this benchmark is fairly frequently executed,
so that it's reasonable to think that a precompiled version is stored
somewhere?

In that case, to avoid the accusation of cheating, the choice of precompiled
code should have an algorithmic basis : For instance, something akin to Alexa
rank of the .js at various CDN. That would make sure that JQuery would be
precompiled, which could well be rational.

But I seriously doubt that such an objective method would include this
benchmark code in the IE precompiled payload...

~~~
scottdw2
If they have the ability to precompiled JS code, they would, of course,
precompile the benchmark. Why would you run a benchmark in "slow" mode if you
had a fast mode available? There's nothing wrong with precompiling the
benchmark.

I'm not saying that's what they are doing, because I don't know. I'm saying
that the conclusion of cheating is unfounded.

------
olalonde
Could anyone explain what is "dead code analysis"?

Update: I still don't get why "the SunSpider math-cordic benchmark is very
fast, presumably due to some sort of dead code analysis.". Didn't the author
prove exactly the opposite by showing SunSpider is slower when adding dead
code to the benchmark? Sorry for the noob question.

~~~
ashish01
Finding code which is executed (so its not unreachable) but whose results are
not used. Its a kind of optimization to speed up program execution by not
doing unnecessary work.

<http://en.wikipedia.org/wiki/Dead_code>

vs

<http://en.wikipedia.org/wiki/Unreachable_code>

------
pers3us
What about this reply in IE blog!

[http://blogs.msdn.com/b/ie/archive/2010/11/17/html5-and-
real...](http://blogs.msdn.com/b/ie/archive/2010/11/17/html5-and-real-world-
site-performance-seventh-ie9-platform-preview-available-for-
developers.aspx?PageIndex=4#comments)

