I call these "weekend benchmarks" -- what you'd typically do when you have a block of free time, then spend time optimizing for said benchmark. Roll in on Monday with some staggering results, only to find one (or many) of your variables were off.
Did the author try multiple instances on each provider? VM tenancy is a bitch. (Think of how annoyed you get at the noise levels, when your neighbor in the next apartment throws a party)
Is the author's source benchmarking machine, a physical machine, or a virtualized guest? Does it have power savings turned off so that the process is running at 100% speed, instead of a variably-clocked down core?
Did the author enable or disable TCP time-wait recycle? So he doesn't bump into said ceiling when running such tests back to back?
Did the author run the tests back to back, or have a cool down period between tests?
Where was the author's network upstream located when he tried said tests? Were there any network issues at the time of the test? Would the author even be aware of them?
Your page you're testing against, does the same database call, which is presumably cached. Did he throw out the cached results? Can he identify the cached results?
Are we firing up ApacheBench with HTTP keepalives enabled? With parallel connections? How many parallel connections?
How many Apache servers (StartServers, MinSpareServers, etc)? Which httpd modules were enabled? Which httpd modules were disabled? Which PHP modules were enabled?
You're trying to benchmark CPU and I/O horsepower across three different platforms but doing it through this narrow "straw" which consists of your "independent server", your upstream, your upstream's upstream, your upstream's upstream's peering connection with Amazon/Linode/DigitalOcean, your web server and its PHP module, your application, and MySQL.
If you're rolling your eyes at this, then you shouldn't be doing weekend benchmarks.
Do you have something better to point to? It's easy to complain about stuff, but at least he's out there trying to do something. Presumably it can be improved.
I'm particularly fond of the quote "lead, follow, or get the hell out of the way", which is a bit harsh in this case because a lot of your advice is good. It could be framed in a more constructive way, though - there's some Comic Book Guy tone there in your comment.
My intended tone was not "don't try", but "try harder".
I've listed at least 5 ways to improve/normalize the testing, as well as linking to a document that does a pretty good job of explaining statistics (particularly, how programmers do a bad job of statistics; baselines for benchmarks; etc).
"At least he's out there trying" -- with this not-so-great benchmarking, the author has just effectively SHITTED on 2/3 companies that have gone to great lengths to build amazing infrastructure AND managed to spread his FUD around the web, to the point where it reached the HN front page -- and you want credit for trying?
It's true: you don't say something like "Well, Amazon just sucks." without backing the statement up with something more credible. As someone a little less savvy on the topic I'm glad to know that the test wasn't even close to the final word and why. Thank you.
It's probably also true that your tone is more abrasive than it needs to be.
They probably have some faults, but the general conclusions smell right to me: I don't think they're in the "really screwed up and wildly misleading" category, but in the "ok, interesting, could use some work though" category.
There is nothing you should be more wary of than a benchmark that matches your pre-existing intuition. It'll lead you to ignore serious methodological issues, without any sound scientific (or any other epistemological) reason. https://speakerdeck.com/alex/benchmarking is a slide deck I gave at my office on how to do better benchmarking
EDIT: I should probably mention I work at Rackspace, and thus everything I say on this subject should be taken with appropriate grains of sand :)
This statement is exactly the problem he is describing. :) One metric for a specific use case or scenario is a terrible indicator of overall "quality". It is much more nuanced than that. I think the worst tickets I've gotten in the 10 years sofar sysadmining is when a customer just states their app is "slow".
Yes in simplistic terms for a specific metric I'm sure other providers have better hardware than AWS, if that is all one wants to base their value of "better" on then so be it, but that is pretty naive.
Many argue that the AWS ecosystem (25 services at last count) and the extensive featureset of AWS outweigh the bare bones "fast" metrics of other providers.
I think like the poster above is mentioning...there is generally more to it than a simple metric or two sampled a few times from a single endpoint. But I guess it all lies on ones definition of what they consider valuable...
Honest Question: Why not launch your app's infrastructure on both platforms and then round-robin your traffic between the two for a billing cycle and compare the results at the end?
If you are working towards "best practices" on AWS, you should be running multi-region (who wants to be the one left holding the bag when US-EAST goes down again?). If you've done all the heavy lifting to enable yourself to run in "pods" across mutliple regions.
Well, if you can do that, why not treat Linode/DO/Rackspace as a separate region and deploy a "pod" of servers there?
At the end of a month you should have enough statistics that are directly applicable to your own app and your specific customers, as well as some real experience with the operational experience of dealing with the new provider.
For example, maybe one of the other providers has really fast machines and their major upstream provider has a great peering relationship with whatever test node you were using for these microbenchmarks, but perhaps those servers are really flaky and crash all the time, or perhaps the majority of your customers see really bad latency when hitting those servers? Maybe their API isn't just "immature", maybe it crashes a lot and they have bad customer service.
Those are the sorts of things you aren't going to figure out after simply running a few load tests. Anyhow, it just seems like something like this would be a lot more valuable than any amount of synthetic testing.
How would you ensure that an instance launched on hardware bought 6 months ago is identical to the hardware under an instance launched N years ago and still running? Buy old hardware on eBay to prevent newer hardware introducing variation?