I know that my submission's title is not the same as the blog post's title and that I will get some hate for it. However, the two diffs linked in the post give pretty convincing evidence that IE is picking up on the exact SunSpider test. Furthermore, if you read the last sentence of the blog post the author is more or less beating around the "You're cheating, we've caught you red-handed" bush.
I don't think it's ever a good idea to make a headline more sensational. We should wait for a response from Microsoft before declaring them liars/cheaters. It's entirely possible that Microsoft has a valid explanation.
Personally - on at least two occasions, I've been accused of writing code that was specifically written to cause grief/problems with another person, only to have to explain that it was a bug and that their personal test-case isn't the only place where it fails.
* And even if the author did directly accuse MS of cheating, that doesn't mean we can't be more correct and rewrite it to be neutral.
I totally agree with you on minimizing the controversy until incontrovertible facts have been found.
In my personal experience, assuming bad will on another person's part kills all chance of civil discussion and severely hurts your chances of finding more facts, because people are on the defensive. Not to mention how hard it is to remove that sort of egg stain from your face.
IMHO, you should not conclude unless you corroborate the evidence the blog author provided. There is a reason why he didn't name the blog post so. It's you who say MS cheated and it's for you to prove the fact with the evidence. Also "more or less" has never meant conclusive. Just a thought, that title is misleading and a question mark wouldn't have raised this issue at all.
This submission demonstrates why you should just stick to the original article title you're linking, instead of coming up with your own flamebait/trolly title.
lordgilman, I hope you now realize it would've been wise to wait before passing judgment (especially in a public forum).
Edit: I don't know what's with the downvotes. I'm just going by the HN Guidelines, posted at http://ycombinator.com/newsguidelines.html? If you have a problem, don't downvote me, take it up with pg.
If I may be pedantic for a moment, the HN guidelines only warn against "gratuitous editorial spin." I don't feel the changes were gratuitous at all because (if you look at the last sentence in the blog post) my title is clearly the point Mr. Sayrer is trying to make.
Pretty much all browser vendors agree SunSpider is a bad benchmark, but yet it keeps getting used and abused. All vendors have tweaked their JS engine for SunSpider itself.
Dromaeo is a much better benchmark suite in that it tests actual DOM things rather than pure language stuff. Kraken (also by Moz) also attempts to focus on webapp usecases rather than doing billions of regexes per second.
> Pretty much all browser vendors agree SunSpider is a bad benchmark, but yet it keeps getting used and abused. All vendors have tweaked their JS engine for SunSpider itself.
Still, there is a gap between tweaking the JS engine and running completely different code (a gap which most GPU makers jumped over without hesitating a few years ago, but it's annoying to see the issue crop up again)
On SunSpider: "The WebKit SunSpider tests exercise less than 10% of the API’s available from JavaScript and many of the tests loop through the same code thousands of times. This approach is not representative of real world scenarios and favors some JavaScript engine architectures over others."
About the Dromaeo test: IE could call CAPICOM to deal with AES, Base64, RSA in browser, which is super fast.
And personally I think all browsers could just expose an API for these kinds of encryption and computing-heavy stuff, like secure random seeds, etc. Implementing those in Javascript is just a temporary solution.
Edit (yet again): My initial conclusions were wrong, and it's nearly certainly cheating. Dammit. I hate being wrong in front of people smarter than me. :<
----
I'm running the same benchmark independently right now. Core i7 in a Win7 64-bit install.
For each test, I did 5 runs and averaged them. I increased the number of loops in each test from 25,000 to 250,000 as well.
Chrome 9.0.576.0
Stock: 105.28ms
With "true": 104.44ms
MS IE 9.0.7930.16406
Stock: 10.98ms
With "true": 181.16ms
By returning immediately out of the loop, Chrome's time drops by a factor of 12.1, whereas IE's stays pretty much constant.
I suspect what's happening here is that the IE engine is somehow marking that entire function as deadcode, and thus, not running it; the ~10ms accounts for the time it takes to run that for loop 250k times, but the cordicsincos() code is not being run at all. Ironically, deadcode somewhere in the function causes the engine to NOT throw it all away, and its gets run.
In fact, if we just kill that for loop all together:
What I suspect is that the IE engine is seeing "Okay, nothing is returned, and nothing outside of the scope of this function is ever altered", so once it steps into it, it just immediately returns. This is arguably correct behavior! That code is, for all practical purposes, worthless.
If we just move one of the variable references out of function scope (or just remove the var, making it effectively a global variable), IE takes the extra time to run:
--- tests/sunspider-0.9.1/math-cordic.js 2010-11-17 00:55:29.000000000 -0700
+++ tests/sunspider-0.9.1-deadcode/math-cordic.js 2010-11-17 01:22:49.000000000 -0700
@@ -50,11 +50,11 @@
+var CurrAngle;
function cordicsincos() {
var X;
var Y;
var TargetAngle;
- var CurrAngle;
var Step;
X = FIXED(AG_CONST); /* AG_CONST * cos(0) */
Chrome: 99.9ms
IE: 217.1ms
Sorry, guys. I like a good IE bash-fest (Hey, it's still slower than V8 when it actually runs the code!) as much as anyone, but I think it's legit here. The benchmark is poorly-conceived, and IE does the right thing with it, though it obviously distorts the scores in their favor. That's a problem with the benchmark, though, not IE.
Edit (like...#14): It could well just be cheating on analysis in this particular case, which I stupidly overlooked. For example, this diff:
So, my initial conclusions were wrong. Its dead code analysis is either incredibly narrow, or it was hand-crafted to optimize out that part of the benchmark. Either way it's rubbish.
I think that regrettably, you might be right. It's obviously not just checking for a bytecode match (see my var foo example), but it's doing something hinky. I did a simple pow-and-modulo test with the same assumptions and it didn't optimize it away.
It's absolutely regrettable. If this was legit, it would mean that the browser would be faster, the user experience would be better, and developers would be another tiny step closer to having an easier time of things when working in IE. I don't feel sorry for Microsoft here, but I'm a web developer, and I want fast, continually-improving browsers to code against.
The claim that IE is legit hinges on the diffs causing a valid bug in IE's optimization code (e.g. deadcode inside deadcode prevents the latter being optimized out), versus foul play in the engine (a hard-coded case for this benchmark).
Can you find any variation on the benchmark code that still allows IE to optimize it, or does it only optimize the exact form of the code used in the benchmark suite?
I changed variable names and declaration order, the number of loops in that inner for loop, and other such things that could possibly change the bytecode (to what effect, I don't know - I'm not a JS VM engineer, obviously) without changing the operations actually performed.
I don't have an explanation for this, though (maybe variable initialization counts as "run up until here"?):
Runs fast (11ms):
--- tests/sunspider-0.9.1/math-cordic.js 2010-11-17 00:55:29.000000000 -0700
+++ tests/sunspider-0.9.1-deadcode/math-cordic.js 2010-11-17 01:42:43.000000000 -0700
@@ -56,12 +56,14 @@
var TargetAngle;
var CurrAngle;
var Step;
+ var foo;
X = FIXED(AG_CONST); /* AG_CONST * cos(0) */
Y = 0; /* AG_CONST * sin(0) */
TargetAngle = FIXED(28.027);
CurrAngle = 0;
+ foo = 1;
for (Step = 0; Step < 12; Step++) {
var NewX;
if (TargetAngle > CurrAngle) {
But if I assign foo after the for loop, it runs slow:
--- tests/sunspider-0.9.1/math-cordic.js 2010-11-17 00:55:29.000000000 -0700
+++ tests/sunspider-0.9.1-deadcode/math-cordic.js 2010-11-17 01:43:20.000000000 -0700
@@ -56,6 +56,7 @@
var TargetAngle;
var CurrAngle;
var Step;
+ var foo;
X = FIXED(AG_CONST); /* AG_CONST * cos(0) */
Y = 0; /* AG_CONST * sin(0) */
@@ -76,6 +77,7 @@
CurrAngle -= Angles[Step];
}
}
+ foo = 1;
}
If I assign foo inside the for loop, it runs fast:
--- tests/sunspider-0.9.1/math-cordic.js 2010-11-17 00:55:29.000000000 -0700
+++ tests/sunspider-0.9.1-deadcode/math-cordic.js 2010-11-17 01:44:41.000000000 -0700
@@ -56,6 +56,7 @@
var TargetAngle;
var CurrAngle;
var Step;
+ var foo;
X = FIXED(AG_CONST); /* AG_CONST * cos(0) */
Y = 0; /* AG_CONST * sin(0) */
@@ -63,6 +64,7 @@
TargetAngle = FIXED(28.027);
CurrAngle = 0;
for (Step = 0; Step < 12; Step++) {
+ foo = 1;
var NewX;
if (TargetAngle > CurrAngle) {
NewX = X - (Y >> Step);
The point is whether it does the right thing to all simlar dead code. Or if a human being has made the same analysis as you and added a shortcut that triggers when it sees this exact benchmark code.
The Mozilla guys clearly know this is dead code, do you really want them and every other javascript engine to be adding code targetted at this exact code snippet?
That's a good point; it appears to be doing legit dead code analysis, but the point still remains that if it's custom-tailored to detect that as dead code when it won't do it in the general case, it's cheating.
I may be eating my hat here, because I just replaced the cordicsincos() with the following:
function numNumNum() {
var I;
var num = 10;
for (I = 0; I < 10; I++) {
num = num * num * num * num * num % num;
}
}
Using the same benchmarking framework, I get these times:
Chrome: 849.5ms
IE: 1226.4ms
That would seem to satisfy all the previous conditions - no leaked scope, no return, no external functions - but it doesn't get optimized away. I'd assumed that by "cheating", it would be hot-swapping that benchmark's bytecode for optimized bytecode, or running a function in C or something, rather than just cheating on the dead code optimization. Bad assumptions make for bad benchmarks!
--- tests/sunspider-0.9.1/math-cordic.js 2010-11-17 00:55:29.000000000 -0700
+++ tests/sunspider-0.9.1-deadcode/math-cordic.js 2010-11-17 15:08:43.000000000 -0700
@@ -80,11 +80,15 @@
///// End CORDIC
+function numNumNum() { var I; var num = 10; for (I = 0; I < 10; I++) { num = num + num + num + num + num - num; } }
+
+///// End CORDIC
+
function cordic( runs ) {
var start = new Date();
for ( var i = 0 ; i < runs ; i++ ) {
- cordicsincos();
+ numNumNum();
}
var end = new Date();
Chrome: 19.2ms
IE: 1.0ms
I think this is just fragility, not cheating. I don't know JS super well, but in some languages you might see some rules associated with certain operations and preserving over/under flow exceptions and such.
In any case I think a few things happened here:
1) For whatever reason the "true" statement caused the compiler to think there was a side-effect potential. I suspect the compiler simply didn't know what to do with it, and they hadn't handled 'true;' or 'false;' as standalone statements in their optimizer. I bet if you put 'true;' in the middle of that loop it will break the DCE.
2) The probably don't do liveness analysis. So they can see that a block doesn't change global state, but don't look to see if the proceeding blocks use any of the variables. So if there is any code after a block they assume that they can't DCE that block.
3) '*' and '%' causing problems may be very specific to those operations, and I'm guessing '/' too.
All in all I'd say it is a target incomplete implementation, but not cheating. Based on what I've seen thus far.
This reminds me a lot of benchmarking Haskell which does significant dead code analysis and thus was breaking benchmarks. The benchmarks were, of course, modified to do more with the looping code and ensure that all the code paths were run, but the interesting moral is that sometimes artificial benchmarks unsurprisingly don't test what you think they're testing.
I have to say, that is some damn good code analysis. The function executes, nothing external happens in the function, nothing is returned from the function: the function is dead code, don't run it.
So yes, while IE9 is technically slower than chrome, it does good stuff with code analysis which chrome should too. Impressive by the way. :)
In the end the benchmark should be augmented to force the function to run.
It would be good code analysis if it applied in the general case. That was my initial assumption too - that IE is doing the right thing - but the fact that it fails to apply this same analysis in other cases where the same conditions apply (no external scope modified, no returns, etc) makes it feel awfully suspiciously like it's cheating on the analysis of that function in particular.
I don't have IE9 installed and therefore can't verify the benchmark results. If they are genuine, I'm having a hard time coming up with a different conclusion.
By the way, hilarious idea to make this into a bug report.
EDIT: of course I find it much easier to believe that someone at Microsoft optimized for the benchmark than that someone at Mozilla would fudge the timing results, especially when it's so easy to verify the claims.
um.. Are you using the right version of IE 9? The current version is IE 9 PP7 (1.9.8023.6000), which produced a Sunspider result of 216ms. You are testing the beta version of IE 9, which was released in back in September. Of course the old version of IE 9 will have a much slower speed. By the way the Sunspider result for IE 9 beta was about 340 ms
A better test to see if IE9 is cheating is to remove/rearrange code and rename variables. I'd avoid changing operators. Adding a 'true;' or 'return;' may seem harmless, but if their analysis is fragile they may just throw as "may have side-effects" on those statements or (in the case of the 'return;') it may not do liveness analysis on the other side of the block.
This code (taken from this thread) seems like a good test:
function numNumNum() {
var I;
var num = 10;
for (I = 0; I < 10; I++) {
num = num * num * num * num * num % num;
}
}
Except it uses two new operators: '*' and '%'. Test the same code using '+' and '-'.
This will give a much better idea of it the analysis is just fragile or if this code was being targeted.
Well there's really three words of interest here: fragile, targeted, and cheating.
Cheating is really doing something like looking specifically for sunspider and then doing DCE based on knowing the function.
Fragile is distinct from cheating in that there is actually a real analysis framework in place, but the analysis can be invalidated easily. For example, it's not uncommon to see analysis assume function calls may write to all globals and modify all byref arguments. Looking at the code you can say, "with interprocedural analysis its obvious that this function has no side effects", but the analysis may not be that smart. That's an example of fragility.
Now with this example, given that the browser is in Beta/CTP I wouldn't be at all surprised if their framework was simply incomplete. The 'return;' statement causing a problem, but renaming and reordering variables doesn't is the clearest indication IMO. It seems to indicate that they aren't doing any liveness analysis on the backside, but they aren't doing simple pattern matching on the text, nor the IR.
Targeting is really about how one brings up the framework. I actually wouldn't be surprised to hear that they did target sunspider, and that sunspider is probably part of their regression suite. With that said, this is EXTREMELY common in the compiler industry.
Now the question you're arguing is does targeting == cheating? In most cases, no. In fact my suspicioun is that what we're seeing here is the result of either an incomplete implementation where they did target sunspider, or a more complete implementation that broke, but no one noticed because its main DCE test was sunspider.
If IE9 can turn this around with a fix in their next CTP, it was probably not cheating and just a case of targeting. The reason being that doing a static analysis framework that is capable of being robust in these situations is non-trivial, and not something you just add in post-beta.
And if someone could run the test I posted above with '+' '-' rather than '*' '%' we'd have a first step in our answer. I would do it, but I neither know the sunspider harness, and don't have IE9 installed (and getting a new VM on this particular machine is a hassle).
It certainly seems like Microsoft is 'cheating', but it also seems like an excellent but warped example of Test Driven Development: they solved the failing test by the simplest and most direct means available. If time and budget hold out they will refactor later to generalize.
Well in this fantasy TDD scenario (as in we don't know what happened in MS so I'm just making stuff up), presumably the product requirement was "make IE9 look very fast on the benchmarks without getting caught cheating".
So sure, they solved the first failing unit test (make IE9 look quick), but don't seem to have written enough unit tests to make sure the second part of the requirement works. So they would fail the acceptance tests and have to keep working on it.
(Wouldn't count myself as a full on TDD proponent but do use it when the time is right.)
I think the title is pretty true based on the diffs though. In one example, the only difference was adding "true;" in the middle of the code somewhere -- essentially adding a no-op instruction causes vastly different benchmarks? definitely fraud.
I wouldn't be surprised if microsoft added code like "if (isSunspiderTest) {loadHandOptimizedAssembly()}"
Adding seemingly trivial things to code can sometimes throw off performance entirely, by disaligning cache tables and such. It's not always cheating.
That said, if this is a pure bug, it seems pretty pathetic. For one, it proves that the engine is not robust. For another, it probably means that someone spent hours upon hours tweaking the code with only the sunspider benchmark as test - analogous to over-fitting the training data. It's really tempting to do this, but it's also a common enough amateur mistake that Microsoft should have best practices to avoid it.
All this is speculative for now. Let's see what they say.
Noooooo! I can already imagine the pain writing HTML5 in 2014: "Well, IE9 is too slow to do common thing X reliably, so let's trigger the SunspiderTest optimizations using this hack..."
It's like finding out that IE9 only performs well on the subset of JS needed when you are drawing a fishtank..
The second example is even more of a smoking gun -- adding a "return;" to the end of a function shouldn't affect optimization within an earlier loop. Especially not that much!
IIRC there was this Microsoft website which listed a few HTML demos in which ie9 was way faster than even google chrome. I wonder whether they used the same 'technique' there too.
They have a paradigm in machine learning called over fitting.
Trying to do well on a test dataset by cheating and seeing it first...
I think teh benchmark should choose tests randomly from a large set of tests and calculate the expected performance over a number of such random runs. not allowing any one to cheat...
Over-fitting and peeking at the test set are completely different things. Over-fitting may in fact degrade performance on a test set, because it means you are giving too much weight to idiosyncratic patterns in the training data. Peeking at the test data, however, is right out, and should invalidate any results you try to report.
If I understand you correctly, what you are suggesting is that one way to improve deadcode analysis would be to start with known dead code and compare the results the deadcode analysis algorithm to the results achieved by "cheating."
Given that SunSpyder is a known example of deadcode and that using it is easier than writing a new deadcode benchmark, your explanation seems somewhat plausible (assuming I am understanding you correctly).
Edit: As a general case, there would seem to be a legitimate rationale for recognizing standard javascript snippets and loading pre-compliled routines to improve execution.
That's a pretty big conclusion to jump to (they are cheating the test) based on a small amount of evidence. If they were "precompiling" the java script for the test, and had functionality to "preconpile" java script code in the cache, would the fact that they precompiled the benchmark mean they were cheating? No. It wouldn't.
Keep in mind that there is a lot of code, such as Jquery, that is identical but distributed from many sources. It could benefit from similar matching and pre-compilation.
If dead code analysis (and other optimizations) was part of an "offline" compilation step (that's not efficient enough to do online), then changing the code would result in a slower execution path. Once the method body changes, the compiler wouldn't know it was dead without re-running the analysis (the changes could introduce side effects).
Now, this doesn't mean they are not cheating, because there is no evidence either way. But, what you are observing in this case doesn't imply cheating either.
Other java script engines, like the one in web kit, minimize the amount of analysis they do of java script source, in order to avoid extra overhead. Something like an optimizing compilation pass is generally too slow to be done online. It would delay page load time considerably.
But, if it could be done off line, operating on cached, frequently used pages, it could improve runtime considerably.
If one were to implement such a system for js, it would make sence to use file hashes as keys to the precompiled code index, and fall back on slower methods for cache misses, until such time as the offline process could compile the code. Small changes (non white space), like the ones in the diffs, would trigger hash changes.
Given such a system, precompiling the benchmark is not cheating. My point is that you are confusing necessary with sufficient conditions, and are making damning conclusions without proper evidence.
Ok, so your hypothesis is that this benchmark is fairly frequently executed, so that it's reasonable to think that a precompiled version is stored somewhere?
In that case, to avoid the accusation of cheating, the choice of precompiled code should have an algorithmic basis : For instance, something akin to Alexa rank of the .js at various CDN. That would make sure that JQuery would be precompiled, which could well be rational.
But I seriously doubt that such an objective method would include this benchmark code in the IE precompiled payload...
If they have the ability to precompiled JS code, they would, of course, precompile the benchmark. Why would you run a benchmark in "slow" mode if you had a fast mode available? There's nothing wrong with precompiling the benchmark.
I'm not saying that's what they are doing, because I don't know. I'm saying that the conclusion of cheating is unfounded.
Could anyone explain what is "dead code analysis"?
Update: I still don't get why "the SunSpider math-cordic benchmark is very fast, presumably due to some sort of dead code analysis.". Didn't the author prove exactly the opposite by showing SunSpider is slower when adding dead code to the benchmark? Sorry for the noob question.
Finding code which is executed (so its not unreachable) but whose results are not used. Its a kind of optimization to speed up program execution by not doing unnecessary work.
if you start by assuming that sunspider is very fast because of dead code analysis then adding more dead code shouldn't change anything. but it does. so their deadcode analysis seems less like "real analysis" and more like "this looks like the sunspider test so we know we can ignore this piece of code". and if that is true then they are "cheating" because the result no longer reflects normal behaviour - it is tailored exactly for this test.
It seems the hypothesis is that the benchmark originally goes very fast due to dead-code analysis (the function 'cordicsincos' has been marked as dead code, and is therefore not executed).
That the test goes much slower when more dead code is added (code that in no way 'undeadens' the 'cordicsincos' function, indeed code that does nothing at all) implies that the dead-code analysis being done is either not really dead-code analysis at all but simply looking for this specific function (this would be the 'cheating' hypothesis), or, more charitably, the dead-code analysis could merely be extremely fragile.
Dead code is a segment, that is not reachable during execution or does not add anything to the result. If you find such code, you can spare CPU cycles and get a better benchmark result.