Hacker News new | past | comments | ask | show | jobs | submit login
AWS Outperforms GCP in the 2018 Cloud Report (cockroachlabs.com)
214 points by awoods187 4 months ago | hide | past | web | favorite | 98 comments

Hey all - Seth from Google here.

Thank you to the authors who worked on this report. These types of reports help us better understand the ways in which our customers and partners utilize our platform. Our team is reviewing the report and will provide a response as we conduct our own benchmarks.

Varying factors impact these types of benchmark analyses, many of which are difficult to isolate and control. As an example, I refer to some of the benchmarks others have posted in this very thread.

As a technical practitioner, I'm positive there are areas in which cloud A outperforms cloud B and vice versa, giving users choice and flexibility. As an employee of Google, I can assure you that we are committed to providing best in class performance and availability on our platform.

Thank you for your patience as we review these findings and craft our responses.

Where do you intend to publish your response? GCP blog [1]? I'd like to make sure I won't miss it, and this HN thread may be buried by the time you come up with a response.

[1] https://cloud.google.com/blog/

Hi there - sorry for the delayed response. I'm still working on getting an answer here, but I didn't want to give the impression that I was ignoring the question. My suspicion is that we'll either work with the original authors or publish on the GCP blog, but I can't confirm any of those options at this time.


Bumping does not work on HN. If you want to give something visibility, upvote it.

To me, the most interesting part is the large variance in networking performance on GCP while AWS networking is solid (and in line roughly with what you'd see on a good 10gbit network). Variance in networking performance is far worse than a low mean, but in this case not only is the GCP mean much lower, as well as the variance higher.

TIL that Seth Vargo is now working at Google.

Dude, I sure hope they deserve having you on board. I am not at all convinced that they are capable of understanding just how much you bring to the table.

My opinion of them has definitely gone up a few notches.

Where will we find the responses?

Probably on the front page of HN.

Not if its "AWS outperforms GCP" - Google

It'll be something like, at least, GCP outperforms AWS on things that aren't cockroachlabs

If you review the findings and then respond with, "Actually, GCP outperforms AWS," how many pinches of salt should we take your response with?

Yeah, again, all this really shows is that instance types don't translate well across cloud platforms comparison-wise because they are strictly different. See other posts.

I've also benchmarked GCP vs. AWS [0], and, for the tests that I ran, found that GCP outperformed AWS by a factor of 3:1. Specifically, a GCP instance n1-highcpu-8 with a 256GB pd-ssd disk, clocked in at 11,728 IOPS vs an AWS c4.xlarge with a 256GB gp2 disk, clocking in at 3,634 IOPS.

To put that in context of the blog post, it means your setup can drastically affect your results. Using local NVMe disk, for example, yields excellent results at the expense of increased risk. Also, AWS's io1 disk is very expensive—after my first io1 bill from AWS, I never used that disk type again.

[0] http://engineering.pivotal.io/post/gobonniego_results/

I don't think the AWS side of your test is a good example.


You will never get more than ~3,000 iops on a 256GB gp2 disk because the IOP cap for gp2 type disks of this size is 768 or 3,000 iops. Yes, "or". If you ran the test above for longer, you'd find that the iops will eventually drop to the baseline of 768 (which is the size of your disk multiplied times 3). Or, if you test a much larger gp2 volume you'll see higher numbers as well. Check out the description of the "burst bucket" in the link above.

This is why I tend to like GCP SSD persistent disk. For a 333GB disk I get ~10k IOPS consistently. It's $.17/GB vs. $.10/GB for EBS GP2, but still way cheaper than IO1.

If you need raw speed for storage why mess around with EBS or any other network storage device. Your case with videos. You could shove the videos for permanent storage on S3 and then use I2 instance and it's ephemeral storage to serve videos. I think I2 NVME drives push 250,000 IOPS per drive and you can get up to 8 drives per instance.

If you expect your read/writes to be mostly sequential, and about throughput more than about IOPS, then local HDDs (d2 or h1) may even do the job for a lower $/GB. (You may even be able to serve straight from S3 or CloudFront but that's veering off the topic.)

I would compare against the C5 instance type; it uses the newer "Nitro" hypervisor.

Just looking at the report's second graph, c5d.4xlarge has twice the throughput of i3.4xlarge, which is incredible. I wish AWS would release a new generation of "i" instances, with the same large amount of local storage as i3, but built on the same platform (including Nitro and processor choices) as c5/m5.

To be fair, an i3.4xl has 10x the disk and 4x the memory as a c5d.4xl at 1.5x the cost. If you are optimizing for GB/cost, i3s are still a reasonable choice.

There's also the i3.metal which has no hypervisor. I'm curious how the performance of one of those compares to a traditional i3.

> I wish AWS would release a new generation of "i" instances, with the same large amount of local storage as i3

I figure they can't. [Economically, I mean.]

I always figured the "local per-instance storage" on most instance-types was actually not literally local, but rather a set of disks allocated from a per-rack iSCSI disk server. (The host "wastage" if this wasn't true would be quite large.) The "i" instance-type hosts—especially the ones for the large instance-types—were likely just making a claim for the entire disk-server of their rack (leaving the rest of the rack to be schedulable only by instances that need no instance disks.)

Since iSCSI and "bare metal" don't go together, on Nitro, you'd actually need to build instances with real physical reserved disk pools that the hardware can see. That may even mean putting the disks inside the computer (shock horror!) and thus needing complex 2Us that need to be "recycled" (= having the still-good stuff fished out of them when they die) and may need to be opened up by ops folks for more than one reason, rather than just having "throwaway" compute blades + equally "throwaway" hotswap disk pools. In such a 1990s-reminiscent setup, i3.metal seems like a sensible upper bound for how big such an instance could get, economically.

Maybe you don't need the 2Us; maybe you can build the disk pool as something like a disk server (i.e. a dedicated PCIe backplane leading to RAID-controller daughterboards? something something Thunderbolt?) That'd lower the TCO of these host machines a bit, but it'd still be questionable how much usage such instances would get—and when they're unscheduled, despite the disks being a separate physical box, those disks would still be unavailable for any other instance to use, even ones scheduled to machines in the same rack.

(Or maybe they could get really fancy, and have a disk-server rack that can present itself as a RAID controller over PCIe, such that, most of the time, it can just be a disk server, but when it gets reserved by a Nitro-i-instance, it can switch off its Infiniband cards and switch on its PCIe client interface cards, and moonlight as a RAID array. If AWS does get bigger i-type instances, I'd wager that this is what they would have built to achieve that. That or custom RAID controller cards that present Infiniband-rDMA targets as if they were local NVMe devices, and don't allow host configuration, only BMC configuration.)

This would clash with the AWS documentation and re:invent discussions:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instance... >This storage is located on disks that are physically attached to the host computer.


But c5d, m5d, r5d and other instance types use Nitro and have "local NVMe-based SSD storage" just like i3. Doesn't that contradict your entire point?

If you picked a r5d, and either gave it a larger piece of the local SSD pie or replaced the local SSDs with larger capacity ones, you would essentially get the i4 that I've been calling for. I'm sure it's more complicated than that, but I would be surprised if technical roadblocks were the main reason why this hasn't happened yet, rather than e.g. lack of customer demand.

AWS has already built that—on Nitro based systems, even EBS volumes are NVMe attached.

you can still have remote disks(same or next rack, 2m runway) with bare metal if you use HBA with external sas cabling.

this works reliably, native speed, not too expensive and is well tested.

Yeah I think the instance types are different enough that this doesn't measure anything generally useful. Now comparing something like Lambda vs GCF, or GCS vs S3, that would be at least telling on a platform-wide scale.

Developer experience on GCP is vastly superior to AWS.

- Pricing on GCP is much easier, no need to purchase reserved instances, figure out all the details and buried AWS billing rules. Run your GCP instances and automatically get discounts. AWS reserved instances requires knowing your instance types, knowing that you can purchase the smallest type of an instance class and combine, knowing that you can only purchase 20 reserved instances per zone/per region in an account. So many gotchas.

- GCP projects by default span all regions. It is much easier if you run multiple regions, all services can communicate with all regions. Multi-region in AWS is sort of a nightmare, setting up VPC peering, can't reference security groups across regions, etc..

- Custom machine types. With GCE, you simply select the number of cores you need and memory. No trying to decipher the crazy amount of AWS instance types T2, T3, M5, M5a, R5, R5a, C5, C5n, I3...

- Instance attached block storage is easier to grok and in my experience is much faster than EBS. The bigger the disk on GCE, the more IOPS. No provisioned IOPS madness.

Another great point to GCP: -- You can setup an 'organization' that will hold multiple projects, in a hierarchy of folders. You can set permissions/roles on a folder, and have them propagate down to all projects underneath.

> no need to purchase reserved instances

GCE offered committed use discounts for quite some time (note: completely different from sustained use discount that is automatic), by ignoring this discount tier, the results from this post look significantly worse

The idea that "developer experience" is paramount is the entire reason why there is an entire sub-industry of vendors dedicated to cost optimization, following in the wake of choices made with completely the wrong business priorities in mind

AWS has cheaper discounted instances though too.

I'd argue that developer experience on AWS is better. I work on both. Two things stand out for me:

Serverless lambdas on AWS are more functional. Google does not have comparable functionality yet.

Auth and IAM permissions are far more configurable on AWS. Google's ACL does not have the same depth of tools there. Making the ACL do the same thing as AWS's IAM generally feels cludgy and a pain.

Working with AWS there is a deep ecosystem of tools there and I only use maybe 30%. Working with Google, there often are tools and logging missing that I took for granted in AWS.

What feature do lambdas have that cloud functions don't? (asking not challenging)

Never used lambdas but the fact that gcloud functions allow python imports make them pretty versatile from what I can tell

> The bigger the disk on GCE, the more IOPS

This is true on AWS, too. Common advice for AWS is to just make your disk bigger instead of paying for provisioned IOPS.

Yet more reasons my employer chose to partner with GCP instead of AWS

On the "enterprise" end of cloud computing, the only truly good options are GCP & Azure

AWS is very much the IBM of yesteryear - it's expensive, but "no one ever got fired for buying AWS" ... except AWS is about to have (even starting already) their own IBM moment - question is, will they fail to pivot the way IBM did?

Local SSDs can also be attached to any instance types.

There might be a networking cap & disk I/O issues with the instances you picked on GCP vs AWS.

The GCP instance has 8 Gbps vs 10 Gbps for AWS. I don't really know without seeing the graphs from the instances, if you hit a cap, but this could make a difference in both transfer speeds and latency #'s for GCP. Also, for your local disk test, on GCP, disk size makes a difference to get the best performance. The larger the disk, the better the performance. PD disk read/write performance also comes out of the available network bandwidth! So, the instance you picked on GCP was at a disadvantage right from the start [3]. This likely explains the I/O Experiment graph and the "67x difference in throughput" as you're likely hitting caps, both in terms of network bandwidth, and disk performance compared to AWS. Seeing anything where it is x67 difference is a pretty big red flag that something strange is going on and needs further investigation.

GCP's n1-standard-16 = 8 Gbps max [1]

AWS's c5d.4xlarge = 10 Gbps max [2]

I guess the problem with comparing clouds, it is never apples vs apples, and I don't fault you for picking what do you (as it is not obvious). GCP typically gives you (core count / 2) = # Gbps network bandwidth. A good followup to your comparison might be to investigate why they #'s are different. Does adding more cpus, memory, network bandwidth increase performance?

[1] https://cloud.google.com/blog/products/gcp/5-steps-to-better... (see section #3).

[2] https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-inst...

[3] https://cloud.google.com/compute/docs/disks/performance#size... (see the table re: disk size to bandwidth)

> Also, for your local disk test, on GCP, disk size makes a difference to get the best performance. The larger the disk, the better the performance. Disk read/write performance also comes out of the available network bandwidth

Do you have a source for local SSD performance coming out of the available network bandwidth? According to GCP docs [1], this only applies to persistent disks. Local SSD perf depends only on disk size and choice of SCSI/NVMe interface.

According to another GCP doc [2], local SSDs are all 375 GB in size. For comparison, c5d.4xlarge has 400 GB, which is very close. So I don't see anything wrong in the benchmark unless they messed up and ran it against the persistent root disk instead of the local SSD.

[1] https://cloud.google.com/compute/docs/disks/performance#type...

[2] https://cloud.google.com/compute/docs/disks/#localssds

You are right. Sorry for the confusion. It is only PD (Persistent Disk) that comes out of network bandwidth. Anything on the NVMe SSD would be totally local to the machine (no network caps, etc). The article doesn't really say if they are using SSD PD or SSD NVMe for GCP. Also, disk size does matter for the NVMe SSD and performance (as you can stripe them together by adding more; up to 4). You can see the #'s by using the console and playing around with adding more NVMe SSDs (via this doc [1]).

  Size      Random IOPS                Throughput limit (MB/s)
  375GB     169,987 (r)  90,000 (w)      663 (r)   352 (w)
  750GB     339,975 (r) 180,000 (w)    1,327 (r)   705 (w)
  1125GB    509,962 (r) 270,000 (w)    1,991 (r) 1,057 (w)
  1500GB    679,950 (r) 360,000 (w)    2,650 (r) 1,400 (w)
[1] https://cloud.google.com/compute/docs/disks/local-ssd

I don't understand the throughput numbers given. 5.6GB/s for GCP and 9.6GB/s for AWS would be 44gbps and 76gbps respectively.

I don't don't know of any instances offering that kind of throughput.

I've personally validated GCP's statement that they offer 2gbps/core up to 16gbps. I can get 16gbps consistently between any two n1-standard-8 using iperf.

This generally makes network IO in GCP much cheaper.

I wouldn't read too much into it, it's clearly Gbps. The authors are just sloppy with capitalization. They also elsewhere talk about iperf having "128 kb" buffer which seems unlikely, and the throughput graph says "gb" where the text says "GB".

And then there's "iPerf" and "PING"...

If it were gbps then that's certainly weird. I consistently get way better network performance in GCP than in AWS.

GCP caps egress throughput to 2gbps per core up to 16gbps:


Are you saying gigabytes or gigabits?

GCP is 2 gigabits/second per core up to a max of 16 gigabits/second for a single VM. Persistent disks other than local SSDs also eat into this network traffic as well.

Choosing a cloud provider is about more than just performance. For me, I lean towards GCP because of the combination of awesome UX, custom VM configurations (which also means GPUs attached to a custom # CPUs), no-bidding spot instances, and the multi-regional cloud storage offering that replicates data across many regions for much cheaper than AWS.

I wrote about this last year if you're curious: https://medium.com/@robaboukhalil/a-tale-of-two-clouds-amazo...

AWS API gateway is one of the worst UIs I've ever used in my life. Lambda isn't great either. That said I love AWS, but I haven't used GCP much.

Yeah, because who cares about performance when you have a pretty UX?

You're confusing UX with UI. You can have an excellent UX that's still just a CLI.

Google uses "gsutil" and "gcloud" whereas AWS just uses "aws". Both are clear but because Google arbitrarily has 2 tools I constantly have to remember which one had which feature.

It's not a big thing but little stuff like makes it more cumbersome to use.

There's also bq for BigQuery stuff.

If they did merge... it would be "gcloud bq..." or "gcloud storage...", my fingers protest :)

Honestly I would have expected performance on a cloud provider to be measured in x per $. You're pretty much renting everything, the hardware etc is really difficult to compare, and you can usually throw more machines at the problem anyway, so measuring a single machine vs a single machine doesn't make much sense if the two might cost vastly different amounts.

> At first glance it appears that GCP has a tighter latency spread (when compared to the network throughput) centered on 0.2 ms

Comparing a distribution of two completely different metrics is not particularly meaningful. You can change the histogram buckets to a different size and it will look just as spread out. When I first read it, I thought it was saying GCP has a tighter spread than AWS. This was confusing, especially since the chart immediately above that seems to have AWS and GCP numbers flipped.

I can't find the the text you quoted in the article. Did they edit it out?

Yes, it looks like the text was edited, but the table is still incorrect.

(editor here) Table was corrected last night, and the text was edited per the comment above and a few keen eyes internally who also caught the confusing sentence.

Hmm it'd take forever to dig into the respective documentation at AWS and GCP, but from a quick look the CPU frequency alone is quite different (3.0 GHz for AWS's Xeon Scalable and 2.0 GHz for GCP's Xeon Scalable), and we don't know anything about CPU cache sizes, etc. etc. That's problematic to start. Then we have very little info to go by on the underlying storage performance. More problematic, I don't know how large the working data set is for TPM-C (i.e. how many warehouses are being simulated?), so I can't tell how much of the storage is being used. I assume it's larger than the 60GB of DRAM offered on the GCP instance (thus spilling into the storage), but with the CPU differences and unknown storage performance, I don't know what to make of this report.

Is there anything open source so I can reproduce these results?

Looking for the experimental settings as it would be difficult to reproduce without detail. Could you post the scripts to GitHub?

The report states they are using iperf, ping, stress-ng, and sysbench; these are all open-source. There might not be enough details to reproduce exactly, but I think there is enough there that you should be able to produce similar results.

Is this table labeled wrong?


All the text around it says GCP is worse, but the table shows GCP is better.

The labels are incorrect (we combined the charts in the final draft)--fix incoming. GCP was much worse than AWS on this test.

Looks fixed now. Thanks.

> What about network throughput variance? On AWS, the variance is only 0.006 GB/sec. This means that the GCP network throughput is 81x more variable when compared to AWS.

What would cause this particular effect? It's very interesting.

Is it, perhaps, that with GCP you're hitting the capacity of the network, while with AWS you're being artificially capped at that speed on a network that could theoretically go faster?

Or maybe it's just different strategies for bandwidth-limiting instances employed by AWS's SDN layer vs. GCP's? Probabilistic packet-drop (to force TCP window scaling) vs. artificially-induced nanosecond-scale egress latencies?

GCP Andromeda is software-based network virtualization which tends to have lower performance and higher performance variability. https://www.usenix.org/node/211244

AWS Nitro/ENA is hardware network virtualization which is faster and more consistent.

Did anyone consider comparing the cost of this vs. Cloud Spanner?

Because CockroachDB has the explicit inspiration of being Google's Spanner without the special hardware, so... why not just use Spanner with the special hardware instead?

Launch EKS on AWS, wait 20-30min. Launch GKE on GCP, wait 5min.

So, in conclusion, When Using cockroachdb, with using 32gb ram instead of 60gb ram, and with different throughtput setting, you may consider AWS as it provides slight better cost because of using lower resources. And since there are no other big player in the market, we will test only two of them, and our only choice will be AWS.

Most small companies don't reserve for 3 years. The on-demand monthly cost is 388 (GCP) vs 562 (AWS) per instance for the instances in the report (omitting SSD costs).

Furthermore, GCP encrypts its disks by default. So I'd expect a slight degradation of performance.

I wonder if they will consider expanding the cloud report to things that are not necessarily relevant to Cockroach labs use-case. In my work, latency to the block store (especially outliers) is very relevant. It would be great to see latency distributions of various workloads (sequential/random and read/write/mixed) to the patform's distributed block store.

edit: I missed that 95th percentile latency for read and write was included. That is helpful; I also typically look at p99.99 and p100.

You can tradeoff talk latency vs cost yourself easily.

For example, you could have two copies of your data in the block store, and issue reads to both simultaneously, and use whichever returns first. Suddenly, your 99% latency becomes your 99.99% latency...

You can do the same for writes (albeit a bit more complex).

Yes, the cost/latency trade-off is exactly why this can be interesting. Significantly fewer or less-impactful outliers can save a lot of money (or allow a better SLA with the same cost).

We can consider expanding this in our testing next year! Glad you found it helpful

How does this compare to the number 2 in the cloud, Microsoft Azure?

For someone who has worked with both: which offers a better developer onboarding experience, for small web/data apps? Ease of learning and use wise.

There are many dimensions in which working with GCP is a lot better for small and medium apps, if only the proximity to Firebase.

These results are, to my mind, quite questionable. They certainly don't line up with my personal experience and the measurements I've done in the last year.

AWS is much harder to use correctly right now, to me.

I prefer aws to google and msft for account mgmt purposes. google and msft are so dead set on binding your logins to your global accounts which might be used for other things.

I know someone might argue that aws and amazon.com are sharing the same account, I guess they "can" but its easy to make a new aws account with just an email - I think we manage about 12-15 different aws accounts. We tend to isolate major clients or projects by making entirely new accounts. I don't feel like I'm working uphill to keep multiple aws accounts from merging or somehow binding to my personal shopping account for amazon.com. With msft and google it feels like I am always almost "tricked" in to binding multiple accounts and login states together.

I would argue that once you have an account up and running most of these guys are similar with differences. The ui on aws isn't glamorous but I find its utilitarian simplicity pretty easy to deal with and get most things done.

On the note of account management, you cannot be a Google Cloud Partner without GSuite or Google Cloud Identity accounts, period, end of story. This whole setup is just ludicrous, I have to pay Google money just to have a partner login? Even Microsoft offers a "good enough" tier of Azure AD that can be used to sign up for their partner portal, and AWS just uses normal Amazon accounts.

That seems to be dependent on whether you're working on your own personal/company accounts or on behalf of other clients.

For owned accounts, I prefer GCP and Azure because the logins are seamlessly integrated into GSuite and Office 365 so we can manage IAM on an individual basis in one place.

It depends on your goals. If you're planning on just setting up a couple of servers then it's all much the same. However for scaling and maintenance I prefer AWS with a tool like Terraform to manage infrastructure. It forces you to think of situations like replacing servers, as you'll have to replace them at some point.

I think Google's AppEngine is easier for starting, but AWS Lambda + API Gateway is pretty close IMO.

Is there anything that would preclude you from using the GCP provider in terraform?

In my experience, most folks choose GCP for the Kubernetes offerings which is vastly superior in terms of scaling and maintenance than EKS. Specifically, there are actually upgrades that actually work. There's also autoscaling nodepools which work pretty well.

I think Google does much better in terms of UX / DevX. Particular features that I appreciate are the cloud console shell and their API explorer/fiddler. Still, they do not have as many services, and the services they have do not have the same features so actually getting things done may still be easier on AWS.

If you have no experience with cloud providers and are set on one of the two: Google. UI/UX and documentation are so much better. Also I can't really agree with the articles findings, maybe my personal benchmarks are too specific to my workload.

I’ve had much better experience with docs on aws than Google. And prefer the ux of aws.

I've also had a way better experience with AWS's docs and UX.

Honestly, for that kind of workload I'd adjust my sights to Digital Ocean or Firebase.

Very promising results. Look forward to seeing what’s in store for 2019!

This is interesting given that I heard on the grapevine that some major cloud players are actually using AWS on the back-end even though they are advertising say as.. "GCP".. wonder if anyone can confirm or deny...

By nature large companies have massive global teams and there is no single provider for anything. Team A could using AWS, while team B cloud be using GCP, and team C is using Azure. Just because team B says they are using GCP doesn't mean the others are lying. Or, that there is anything weird going on.

Actually I meant to say that the news on the grapevine is that the big cloud players (GCP etc.) are potentially outsourcing demand for cloud services in excess of their capacity to AWS...

To be blunt. No. This is totally absurd and would be trivial to detect if true. On the legal side, you'd be breaking all sorts of ToS, Security, and Privacy agreements (you'd have to disclose this via a data processor clause). On the technical side, latency would also be so obvious to detect this too. They are totally different hardware/software platforms and would have different characteristics (as proven by this thread). On the business side, no AWS/GCP/Azure CEO will ever do this, staff would totally be aware too. This is 100% bogus.

I actually did not even want to reply and this make zero sense but felt invested. Whoever told you this doesn't know what they are talking about.

That said, hosting providers who historically ran services on only their own hardware could certainly be load-balancing to cloud hosting providers. This is certainly not what the comment you responded to implies. However, it is something I'd expect people to get confused about..

You are right, a good example might be Heroku. Heroku is a hosting platform that is hosted on AWS (maybe parts or all; I'm not sure).

"Heroku’s physical infrastructure is hosted and managed within Amazon’s secure data centers and utilize the Amazon Web Service (AWS) technology. Amazon continually manages risk and undergoes recurring assessments to ensure compliance with industry standards." See https://www.heroku.com/policy/security.

This is the type of disclose I was talking about.

This is absolutely not happening. All the major clouds are investing billions into infrastructure build-out and I've toured their data centers first-hand.

I would question the people who told you this rumor as they either have no clue how things work or are blatantly lying to you.

It is very unlikely that google needs to buy computers from amazon.

not a chance.

Absolutely not for GCP. Heroku however, uses AWS.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact