I call these "weekend benchmarks" -- what you'd typically do when you have a block of free time, then spend time optimizing for said benchmark. Roll in on Monday with some staggering results, only to find one (or many) of your variables were off.
Did the author try multiple instances on each provider? VM tenancy is a bitch. (Think of how annoyed you get at the noise levels, when your neighbor in the next apartment throws a party)
Is the author's source benchmarking machine, a physical machine, or a virtualized guest? Does it have power savings turned off so that the process is running at 100% speed, instead of a variably-clocked down core?
Did the author enable or disable TCP time-wait recycle? So he doesn't bump into said ceiling when running such tests back to back?
Did the author run the tests back to back, or have a cool down period between tests?
Where was the author's network upstream located when he tried said tests? Were there any network issues at the time of the test? Would the author even be aware of them?
Your page you're testing against, does the same database call, which is presumably cached. Did he throw out the cached results? Can he identify the cached results?
Are we firing up ApacheBench with HTTP keepalives enabled? With parallel connections? How many parallel connections?
How many Apache servers (StartServers, MinSpareServers, etc)? Which httpd modules were enabled? Which httpd modules were disabled? Which PHP modules were enabled?
You're trying to benchmark CPU and I/O horsepower across three different platforms but doing it through this narrow "straw" which consists of your "independent server", your upstream, your upstream's upstream, your upstream's upstream's peering connection with Amazon/Linode/DigitalOcean, your web server and its PHP module, your application, and MySQL.
If you're rolling your eyes at this, then you shouldn't be doing weekend benchmarks.
I'll leave you with this as well:
I'm particularly fond of the quote "lead, follow, or get the hell out of the way", which is a bit harsh in this case because a lot of your advice is good. It could be framed in a more constructive way, though - there's some Comic Book Guy tone there in your comment.
My intended tone was not "don't try", but "try harder".
I've listed at least 5 ways to improve/normalize the testing, as well as linking to a document that does a pretty good job of explaining statistics (particularly, how programmers do a bad job of statistics; baselines for benchmarks; etc).
"At least he's out there trying" -- with this not-so-great benchmarking, the author has just effectively SHITTED on 2/3 companies that have gone to great lengths to build amazing infrastructure AND managed to spread his FUD around the web, to the point where it reached the HN front page -- and you want credit for trying?
Get the hell out of the way.
It's probably also true that your tone is more abrasive than it needs to be.
"At least we're doing something!" is a silly defense.
EDIT: I should probably mention I work at Rackspace, and thus everything I say on this subject should be taken with appropriate grains of sand :)
Many argue that the AWS ecosystem (25 services at last count) and the extensive featureset of AWS outweigh the bare bones "fast" metrics of other providers.
I think like the poster above is mentioning...there is generally more to it than a simple metric or two sampled a few times from a single endpoint. But I guess it all lies on ones definition of what they consider valuable...
Another major flaw is taking results for a single instance type and implying that those apply to all instance sizes and each provider as a whole.
If you're going to do a benchmark at least pick something realistic like the m3.* types:
At least the author had enough sense not to do the bench on a t1.micro
If you are working towards "best practices" on AWS, you should be running multi-region (who wants to be the one left holding the bag when US-EAST goes down again?). If you've done all the heavy lifting to enable yourself to run in "pods" across mutliple regions.
Well, if you can do that, why not treat Linode/DO/Rackspace as a separate region and deploy a "pod" of servers there?
At the end of a month you should have enough statistics that are directly applicable to your own app and your specific customers, as well as some real experience with the operational experience of dealing with the new provider.
For example, maybe one of the other providers has really fast machines and their major upstream provider has a great peering relationship with whatever test node you were using for these microbenchmarks, but perhaps those servers are really flaky and crash all the time, or perhaps the majority of your customers see really bad latency when hitting those servers? Maybe their API isn't just "immature", maybe it crashes a lot and they have bad customer service.
Those are the sorts of things you aren't going to figure out after simply running a few load tests. Anyhow, it just seems like something like this would be a lot more valuable than any amount of synthetic testing.
I'd much rather see a two or three synthetic benchmarks around harddrive throughput/latency, memory throughput/latency and CPU.
Why choose the fastest one? This would be the least accurate way to give an indication if the performance of AWS instances. A mean, or perhaps median depending on the skew, would be a better choice.
Unless you're saying that the benchmarks on the other servers are effectively cherry picked best results.