
AWS Outperforms GCP in the 2018 Cloud Report - awoods187
https://www.cockroachlabs.com/blog/2018_cloud_report/
======
sethvargo
Hey all - Seth from Google here.

Thank you to the authors who worked on this report. These types of reports
help us better understand the ways in which our customers and partners utilize
our platform. Our team is reviewing the report and will provide a response as
we conduct our own benchmarks.

Varying factors impact these types of benchmark analyses, many of which are
difficult to isolate and control. As an example, I refer to some of the
benchmarks others have posted in this very thread.

As a technical practitioner, I'm positive there are areas in which cloud A
outperforms cloud B and vice versa, giving users choice and flexibility. As an
employee of Google, I can assure you that we are committed to providing best
in class performance and availability on our platform.

Thank you for your patience as we review these findings and craft our
responses.

~~~
peferron
Where do you intend to publish your response? GCP blog [1]? I'd like to make
sure I won't miss it, and this HN thread may be buried by the time you come up
with a response.

[1] [https://cloud.google.com/blog/](https://cloud.google.com/blog/)

~~~
cybernoodles
bump

~~~
ahmedalsudani
Bumping does not work on HN. If you want to give something visibility, upvote
it.

------
brian_cunnie
I've also benchmarked GCP vs. AWS [0], and, for the tests that I ran, found
that GCP outperformed AWS by a factor of 3:1. Specifically, a GCP instance
n1-highcpu-8 with a 256GB pd-ssd disk, clocked in at 11,728 IOPS vs an AWS
c4.xlarge with a 256GB gp2 disk, clocking in at 3,634 IOPS.

To put that in context of the blog post, it means your setup can drastically
affect your results. Using local NVMe disk, for example, yields excellent
results at the expense of increased risk. Also, AWS's io1 disk is very
expensive—after my first io1 bill from AWS, I never used that disk type again.

[0]
[http://engineering.pivotal.io/post/gobonniego_results/](http://engineering.pivotal.io/post/gobonniego_results/)

~~~
dpedu
I don't think the AWS side of your test is a good example.

[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolum...](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#EBSVolumeTypes_gp2)

You will _never_ get more than ~3,000 iops on a 256GB gp2 disk because the IOP
cap for gp2 type disks of this size is 768 or 3,000 iops. Yes, "or". If you
ran the test above for longer, you'd find that the iops will eventually drop
to the baseline of 768 (which is the size of your disk multiplied times 3).
Or, if you test a much larger gp2 volume you'll see higher numbers as well.
Check out the description of the "burst bucket" in the link above.

~~~
halbritt
This is why I tend to like GCP SSD persistent disk. For a 333GB disk I get
~10k IOPS consistently. It's $.17/GB vs. $.10/GB for EBS GP2, but still way
cheaper than IO1.

------
nodesocket
Developer experience on GCP is vastly superior to AWS.

\- Pricing on GCP is much easier, no need to purchase reserved instances,
figure out all the details and buried AWS billing rules. Run your GCP
instances and automatically get discounts. AWS reserved instances requires
knowing your instance types, knowing that you can purchase the smallest type
of an instance class and combine, knowing that you can only purchase 20
reserved instances per zone/per region in an account. So many gotchas.

\- GCP projects by default span all regions. It is much easier if you run
multiple regions, all services can communicate with all regions. Multi-region
in AWS is sort of a nightmare, setting up VPC peering, can't reference
security groups across regions, etc..

\- Custom machine types. With GCE, you simply select the number of cores you
need and memory. No trying to decipher the crazy amount of AWS instance types
T2, T3, M5, M5a, R5, R5a, C5, C5n, I3...

\- Instance attached block storage is easier to grok and in my experience is
much faster than EBS. The bigger the disk on GCE, the more IOPS. No
provisioned IOPS madness.

~~~
_wmd
> no need to purchase reserved instances

GCE offered committed use discounts for quite some time (note: completely
different from sustained use discount that is automatic), by ignoring this
discount tier, the results from this post look significantly worse

The idea that "developer experience" is paramount is the entire reason why
there is an entire sub-industry of vendors dedicated to cost optimization,
following in the wake of choices made with completely the wrong business
priorities in mind

~~~
yovagoyu
AWS has cheaper discounted instances though too.

------
WestCoastJustin
There might be a networking cap & disk I/O issues with the instances you
picked on GCP vs AWS.

The GCP instance has 8 Gbps vs 10 Gbps for AWS. I don't really know without
seeing the graphs from the instances, if you hit a cap, but this could make a
difference in both transfer speeds and latency #'s for GCP. Also, for your
local disk test, on GCP, disk size makes a difference to get the best
performance. The larger the disk, the better the performance. PD disk
read/write performance also comes out of the available network bandwidth! So,
the instance you picked on GCP was at a disadvantage right from the start [3].
This likely explains the I/O Experiment graph and the " _67x difference in
throughput_ " as you're likely hitting caps, both in terms of network
bandwidth, and disk performance compared to AWS. Seeing anything where it is
x67 difference is a pretty big red flag that something strange is going on and
needs further investigation.

GCP's n1-standard-16 = 8 Gbps max [1]

AWS's c5d.4xlarge = 10 Gbps max [2]

I guess the problem with comparing clouds, it is never apples vs apples, and I
don't fault you for picking what do you (as it is not obvious). GCP typically
gives you (core count / 2) = # Gbps network bandwidth. A good followup to your
comparison might be to investigate why they #'s are different. Does adding
more cpus, memory, network bandwidth increase performance?

[1] [https://cloud.google.com/blog/products/gcp/5-steps-to-
better...](https://cloud.google.com/blog/products/gcp/5-steps-to-better-gcp-
network-performance) (see section #3).

[2] [https://aws.amazon.com/blogs/aws/ec2-instance-
update-c5-inst...](https://aws.amazon.com/blogs/aws/ec2-instance-
update-c5-instances-with-local-nvme-storage-c5d/)

[3]
[https://cloud.google.com/compute/docs/disks/performance#size...](https://cloud.google.com/compute/docs/disks/performance#size_price_performance)
(see the table re: disk size to bandwidth)

~~~
peferron
> Also, for your local disk test, on GCP, disk size makes a difference to get
> the best performance. The larger the disk, the better the performance. Disk
> read/write performance also comes out of the available network bandwidth

Do you have a source for local SSD performance coming out of the available
network bandwidth? According to GCP docs [1], this only applies to persistent
disks. Local SSD perf depends only on disk size and choice of SCSI/NVMe
interface.

According to another GCP doc [2], local SSDs are all 375 GB in size. For
comparison, c5d.4xlarge has 400 GB, which is very close. So I don't see
anything wrong in the benchmark unless they messed up and ran it against the
persistent root disk instead of the local SSD.

[1]
[https://cloud.google.com/compute/docs/disks/performance#type...](https://cloud.google.com/compute/docs/disks/performance#type_comparison)

[2]
[https://cloud.google.com/compute/docs/disks/#localssds](https://cloud.google.com/compute/docs/disks/#localssds)

~~~
WestCoastJustin
You are right. Sorry for the confusion. It is only PD (Persistent Disk) that
comes out of network bandwidth. Anything on the NVMe SSD would be totally
local to the machine (no network caps, etc). The article doesn't really say if
they are using SSD PD or SSD NVMe for GCP. Also, disk size does matter for the
NVMe SSD and performance (as you can stripe them together by adding more; up
to 4). You can see the #'s by using the console and playing around with adding
more NVMe SSDs (via this doc [1]).

    
    
      Size      Random IOPS                Throughput limit (MB/s)
      375GB     169,987 (r)  90,000 (w)      663 (r)   352 (w)
      750GB     339,975 (r) 180,000 (w)    1,327 (r)   705 (w)
      1125GB    509,962 (r) 270,000 (w)    1,991 (r) 1,057 (w)
      1500GB    679,950 (r) 360,000 (w)    2,650 (r) 1,400 (w)
    

[1] [https://cloud.google.com/compute/docs/disks/local-
ssd](https://cloud.google.com/compute/docs/disks/local-ssd)

------
raboukhalil
Choosing a cloud provider is about more than just performance. For me, I lean
towards GCP because of the combination of awesome UX, custom VM configurations
(which also means GPUs attached to a custom # CPUs), no-bidding spot
instances, and the multi-regional cloud storage offering that replicates data
across many regions for much cheaper than AWS.

I wrote about this last year if you're curious:
[https://medium.com/@robaboukhalil/a-tale-of-two-clouds-
amazo...](https://medium.com/@robaboukhalil/a-tale-of-two-clouds-amazon-vs-
google-4f2520516a38)

~~~
yovagoyu
Yeah, because who cares about performance when you have a pretty UX?

~~~
hueving
You're confusing UX with UI. You can have an excellent UX that's still just a
CLI.

~~~
yovagoyu
Google uses "gsutil" and "gcloud" whereas AWS just uses "aws". Both are clear
but because Google arbitrarily has 2 tools I constantly have to remember which
one had which feature.

It's not a big thing but little stuff like makes it more cumbersome to use.

~~~
ernsheong
There's also bq for BigQuery stuff.

If they did merge... it would be "gcloud bq..." or "gcloud storage...", my
fingers protest :)

------
null000
Honestly I would have expected performance on a cloud provider to be measured
in x per $. You're pretty much renting everything, the hardware etc is really
difficult to compare, and you can usually throw more machines at the problem
anyway, so measuring a single machine vs a single machine doesn't make much
sense if the two might cost vastly different amounts.

------
planckscnst
> At first glance it appears that GCP has a tighter latency spread (when
> compared to the network throughput) centered on 0.2 ms

Comparing a distribution of two completely different metrics is not
particularly meaningful. You can change the histogram buckets to a different
size and it will look just as spread out. When I first read it, I thought it
was saying GCP has a tighter spread than AWS. This was confusing, especially
since the chart immediately above that seems to have AWS and GCP numbers
flipped.

~~~
gamegoblin
I can't find the the text you quoted in the article. Did they edit it out?

~~~
planckscnst
Yes, it looks like the text was edited, but the table is still incorrect.

~~~
orangechairs
(editor here) Table was corrected last night, and the text was edited per the
comment above and a few keen eyes internally who also caught the confusing
sentence.

------
Rafuino
Hmm it'd take forever to dig into the respective documentation at AWS and GCP,
but from a quick look the CPU frequency alone is quite different (3.0 GHz for
AWS's Xeon Scalable and 2.0 GHz for GCP's Xeon Scalable), and we don't know
anything about CPU cache sizes, etc. etc. That's problematic to start. Then we
have very little info to go by on the underlying storage performance. More
problematic, I don't know how large the working data set is for TPM-C (i.e.
how many warehouses are being simulated?), so I can't tell how much of the
storage is being used. I assume it's larger than the 60GB of DRAM offered on
the GCP instance (thus spilling into the storage), but with the CPU
differences and unknown storage performance, I don't know what to make of this
report.

------
verdverm
Is there anything open source so I can reproduce these results?

~~~
awoods187
All of the benchmarks we used to test are open source.

TPC-C [https://www.cockroachlabs.com/docs/stable/performance-
benchm...](https://www.cockroachlabs.com/docs/stable/performance-benchmarking-
with-tpc-c.html) Sysbench
[https://github.com/akopytov/sysbench](https://github.com/akopytov/sysbench)
Stress-ng [https://kernel.ubuntu.com/~cking/stress-
ng/](https://kernel.ubuntu.com/~cking/stress-ng/) iPerf
[https://github.com/esnet/iperf](https://github.com/esnet/iperf) PING
[https://linux.die.net/man/8/ping](https://linux.die.net/man/8/ping)

~~~
verdverm
Looking for the experimental settings as it would be difficult to reproduce
without detail. Could you post the scripts to GitHub?

------
kyrra
Is this table labeled wrong?

[https://d33wubrfki0l68.cloudfront.net/f61cd6683f5c13f8d2b506...](https://d33wubrfki0l68.cloudfront.net/f61cd6683f5c13f8d2b506a8cc45c8e9e8e70430/2d211/uploads/2018/12/aws_v_gcp_network-
latency-table.png)

All the text around it says GCP is worse, but the table shows GCP is better.

~~~
awoods187
The labels are incorrect (we combined the charts in the final draft)--fix
incoming. GCP was much worse than AWS on this test.

~~~
kyrra
Looks fixed now. Thanks.

------
derefr
> What about network throughput variance? On AWS, the variance is only 0.006
> GB/sec. This means that the GCP network throughput is 81x more variable when
> compared to AWS.

What would cause this particular effect? It's very interesting.

Is it, perhaps, that with GCP you're hitting the capacity of the network,
while with AWS you're being artificially capped at that speed on a network
that could theoretically go faster?

Or maybe it's just different strategies for bandwidth-limiting instances
employed by AWS's SDN layer vs. GCP's? Probabilistic packet-drop (to force TCP
window scaling) vs. artificially-induced nanosecond-scale egress latencies?

~~~
wmf
GCP Andromeda is software-based network virtualization which tends to have
lower performance and higher performance variability.
[https://www.usenix.org/node/211244](https://www.usenix.org/node/211244)

AWS Nitro/ENA is hardware network virtualization which is faster and more
consistent.

------
riking
Did anyone consider comparing the cost of this vs. Cloud Spanner?

Because CockroachDB has the explicit inspiration of being Google's Spanner
without the special hardware, so... why not just use Spanner with the special
hardware instead?

------
nogbit
Launch EKS on AWS, wait 20-30min. Launch GKE on GCP, wait 5min.

------
iamgopal
So, in conclusion, When Using cockroachdb, with using 32gb ram instead of 60gb
ram, and with different throughtput setting, you may consider AWS as it
provides slight better cost because of using lower resources. And since there
are no other big player in the market, we will test only two of them, and our
only choice will be AWS.

------
planckscnst
I wonder if they will consider expanding the cloud report to things that are
not necessarily relevant to Cockroach labs use-case. In my work, latency to
the block store (especially outliers) is very relevant. It would be great to
see latency distributions of various workloads (sequential/random and
read/write/mixed) to the patform's distributed block store.

edit: I missed that 95th percentile latency for read and write was included.
That is helpful; I also typically look at p99.99 and p100.

~~~
londons_explore
You can tradeoff talk latency vs cost yourself easily.

For example, you could have two copies of your data in the block store, and
issue reads to both simultaneously, and use whichever returns first. Suddenly,
your 99% latency becomes your 99.99% latency...

You can do the same for writes (albeit a bit more complex).

~~~
planckscnst
Yes, the cost/latency trade-off is exactly why this can be interesting.
Significantly fewer or less-impactful outliers can save a lot of money (or
allow a better SLA with the same cost).

------
ernsheong
Most small companies don't reserve for 3 years. The on-demand monthly cost is
388 (GCP) vs 562 (AWS) per instance for the instances in the report (omitting
SSD costs).

~~~
ernsheong
Furthermore, GCP encrypts its disks by default. So I'd expect a slight
degradation of performance.

------
kharms
For someone who has worked with both: which offers a better developer
onboarding experience, for small web/data apps? Ease of learning and use wise.

~~~
social_quotient
I prefer aws to google and msft for account mgmt purposes. google and msft are
so dead set on binding your logins to your global accounts which might be used
for other things.

I know someone might argue that aws and amazon.com are sharing the same
account, I guess they "can" but its easy to make a new aws account with just
an email - I think we manage about 12-15 different aws accounts. We tend to
isolate major clients or projects by making entirely new accounts. I don't
feel like I'm working uphill to keep multiple aws accounts from merging or
somehow binding to my personal shopping account for amazon.com. With msft and
google it feels like I am always almost "tricked" in to binding multiple
accounts and login states together.

I would argue that once you have an account up and running most of these guys
are similar with differences. The ui on aws isn't glamorous but I find its
utilitarian simplicity pretty easy to deal with and get most things done.

~~~
snuxoll
On the note of account management, you cannot be a Google Cloud Partner
without GSuite or Google Cloud Identity accounts, period, end of story. This
whole setup is just ludicrous, I have to pay Google money just to have a
partner login? Even Microsoft offers a "good enough" tier of Azure AD that can
be used to sign up for their partner portal, and AWS just uses normal Amazon
accounts.

------
kerng
How does this compare to the number 2 in the cloud, Microsoft Azure?

------
elmo1788
Very promising results. Look forward to seeing what’s in store for 2019!

------
ta_271828
This is interesting given that I heard on the grapevine that some major cloud
players are actually using AWS on the back-end even though they are
advertising say as.. "GCP".. wonder if anyone can confirm or deny...

~~~
WestCoastJustin
By nature large companies have massive global teams and there is no single
provider for anything. Team A could using AWS, while team B cloud be using
GCP, and team C is using Azure. Just because team B says they are using GCP
doesn't mean the others are lying. Or, that there is anything weird going on.

~~~
ta_271828
Actually I meant to say that the news on the grapevine is that the big cloud
players (GCP etc.) are potentially outsourcing demand for cloud services in
excess of their capacity to AWS...

~~~
WestCoastJustin
To be blunt. No. This is totally absurd and would be trivial to detect if
true. On the legal side, you'd be breaking all sorts of ToS, Security, and
Privacy agreements (you'd have to disclose this via a data processor clause).
On the technical side, latency would also be so obvious to detect this too.
They are totally different hardware/software platforms and would have
different characteristics (as proven by this thread). On the business side, no
AWS/GCP/Azure CEO will ever do this, staff would totally be aware too. This is
100% bogus.

I actually did not even want to reply and this make zero sense but felt
invested. Whoever told you this doesn't know what they are talking about.

~~~
JaRail
That said, hosting providers who historically ran services on only their own
hardware could certainly be load-balancing to cloud hosting providers. This is
certainly not what the comment you responded to implies. However, it is
something I'd expect people to get confused about..

~~~
WestCoastJustin
You are right, a good example might be Heroku. Heroku is a hosting platform
that is hosted on AWS (maybe parts or all; I'm not sure).

 _" Heroku’s physical infrastructure is hosted and managed within Amazon’s
secure data centers and utilize the Amazon Web Service (AWS) technology.
Amazon continually manages risk and undergoes recurring assessments to ensure
compliance with industry standards."_ See
[https://www.heroku.com/policy/security](https://www.heroku.com/policy/security).

This is the type of disclose I was talking about.

