Hacker News new | past | comments | ask | show | jobs | submit login
AWS vs. GCP reliability is wildly different (freeman.vc)
544 points by icyfox 9 days ago | hide | past | favorite | 234 comments





There were 84 errors for GCP, but the breakdown says 74 409s and 5 timeouts. Maybe it was 79 409s? Or 10 timeouts?

I suspect the 409 conflicts are probably from the instance name not being unique in the test. It looks like the instance name used was:

    instance_name = f"gpu-test-{int(time())}"
which has a 1-second precision. The test harness appears to do a `sleep(1)` between test creations, but this sort of thing can have weird boundary cases, particularly because (1) it does cleanup after creation, which will have variable latency, (2) `int()` will truncate the fractional part of the second from `time()`, and (3) `time.time()` is not monotonic.

I would not ask the author to spend money to test it again, but I think the 409s would probably disappear if you replaced `int(time())` with `uuid.uuid4()`.

Disclosure: I work at Google - on Google Compute Engine. :-)


> (3) `time.time()` is not monotonic.

I just winced in pain thinking of the ways that can bite you. I guess in a cloud/virtualized environment with many short lived instances it isn't even that obscure an issue to run into.

A nice discussion on Stack Overflow:

https://stackoverflow.com/questions/64497035/is-time-from-ti...


> I just winced in pain thinking of the ways that can bite you.

Something similar caused my favorite bug so far to track down.

We were seeing odd spikes in our video playback analytics of some devices watching multiple years worth of video in < 1 hour.

System.currenTimeMillis() in Java isn't monotonic either is my short answer for what was causing it. Tracking down _what_ was causing it was even more fun though. Devices (phones) were updating their system time from the network and jumping between timezones.


That's a bad day at the office when you have to go and say "hey remember all that data we painstakingly collected and maybe even billed clients for?"

Luckily I worked for the company that made the analytics tools and consumed them!

Bosses actually came to us because our analytics team was trying to figure out who was causing it, because it had been caught by the team doing checks against the data. (a playback period should never have had > 30s of time)


Yes. When people write `time.time()` they almost always actually want `time.monotonic()`.

A lot of the places you use time.time and would be bitten by non-monotonicity you probably want something like time.perf_counter, which is useless for measuring absolute time but perfect for calculating time elapsed.

This is a very good point - AWS uses tags to give instances a friendly name, so the name does not have to be unique. The same logic would not fail on AWS.

Which makes 2000% sense.

Why would any tenant supplied data affect anything whatsoever?

As a tenant, unless you are clashing with another resource under your own name, I don't see the point of failing.

aws S3 would be an exception, where they make that limitation on globally unique bucket name very clear.


Idempotency.

You're inserting a VM with a specific name. If you try to create the same resource twice, the GCE control plane reports that as a conflict.

What they're doing here would be roughly equivalent to supplying the time to the AWS RunInstances API as an idempotency token.

(I work on GCE, and asked an industry friend at AWS about how they guarantee idempotency for RunInstances).


GCP control plane is generally not idempotent.

When trying to create the same resource twice, all request should report the same status instead one failing, one succeeding.

In AWS, their APIs allow you to supply a client token if the API is not idempotent by default.

See https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_I....


> When try to create the same resource twice, the second should report success instead of failing.

Before I quibble with the idempotency point: I agree with this, entirely, but it is what it is and a lot of software has been written against the current behavior. So I'll cite Hyrum's law here: https://www.hyrumslaw.com/

> GCP control plane is generally not idempotent.

The GCE API occupies an odd space here, imo. The resource being created is, in practice, an operation to cause the named VM to exist. The operation has its own name, but the name of the VM in the insert operation is the name of the ultimate resource.

Net, the API is idempotent at a macro level in terms of the end-to-end creation or deletion of uniquely named resources. Which is a long winded way of saying that you're right, but that from a practical perspective it accomplishes enough of the goals of a truly idempotent API to be _useful_ for avoiding the same things that the AWS mechanism avoids: creation of unexpected duplicate VMs.

The more "modern" way to do this would be to have a truly idempotent description of the target state of the actual resource with a separate resource for the current live state, but we live with the sum of our past choices.


It's a lot shorter to write:

You're right, we did it wrong.

// And paradoxically makes engineers like you.


Sure, except I think that at a macro level we got it more right than AWS, despite some choices that I believe we'd make differently today.

The GCE API can be idempotent if you'd like. Fill out the requestId field with the same UUID in multiple instances.insert calls (or other mutation calls) and you will receive the same operation Id back in response.

Disclaimer: I work on GCE.


Today I learned! I'll admit I didn't know this functionality existed, and I've instead had used instances.insert following by querying the VM resource.

This is nicer!


Do you really need idempotency for runVM though.

I mean, it's kinda nice to know that if you reissue a request for an instance that could costs thousands of dollars per month due to a network glitch that you won't accidentally create two of them?

More practically, though, the instance name here is literally the name of the instance as it appears in the RESTful URL used for future queries about it. The 409 here is rejecting an attempt to create the same explicitly named resource twice.


Sounds like AWS got it right.

You're entitled to that takeaway, but I disagree. I believe GCP's tendency to use caller-supplied names for resources is one of the single best features of the platform, particularly when compared against AWS's random hex identifiers.

Note that whether this creates collisions is entirely under the customer's control. There's no requirement for global uniqueness, just a requirement that you not try to create two VMs with the same name in the same project in the same zone.


With GCE can you create 10 instances or do you need to create all 10 individually?

As far as I know, the `instances.insert` API only allows individual VMs, although the CLI can issue a bulk set of API calls[0], and MIGs (see below) allow you to request many identical VMs with a single API call if that's for some reason important.

You can also batch API calls[1], which also gives you a response for each VM in the batch while allowing for a single HTTP request/response.

That said, if you want to create a set of effectively identical VMs all matching a template (i.e., cattle not pets), though, or you want to issue a single API call, we'd generally point you to managed instance groups[2] (which can be manually or automatically scaled up or down) wherein you supply an instance template and an instance count. The MIG is named (like nearly all GCP resources), as are the instances, with a name derived from the MIG name. After creation you can also have the group abandon the instances and then delete the group if you really wanted a bunch of unmanaged VMs created through a single API call, although I'll admit I can't think of a use-case for this (the abandon API is generally intended for pulling VMs out of a group for debugging purposes or similar).

For cases where for whatever reason you don't want a MIG (e.g., because your VMs don't share a common template). You can still group those together for monitoring purposes[3], although it's an after-creation operation.

The MIG approach sets a _goal_ for the instance count and will attempt to achieve (and maintain) that goal even in the face of limited machine stock, hardware failures, etc. The top-level API will reject (stock-out) in the event that we're out of capacity, or in the batch/bulk case will start rejecting once we run out of capacity. I don't know how AWS's RunInstances behaves if it can only partially fulfill a request in a given zone.

[0]: https://cloud.google.com/compute/docs/instances/multiple/cre...

[1]: https://cloud.google.com/compute/docs/api/how-tos/batch

[2]: https://cloud.google.com/compute/docs/instance-groups

[3]: https://cloud.google.com/compute/docs/instance-groups/creati...


> unless you are clashing with another resource under your own name, I don't see the point of failing.

Is that not the conclusion? The tester was clashing with their own names?


What are your thoughts on the generally slower launch times with a huge variance on GCP?

The author failed to mention which regions these tests were run. GPU availability can vary depending on the regions that were tested for both Cloud providers.

The author linked to the code at the end of the post.

The regions used are "us-east-1" for AWS [1] and "us-central1-b" for GCP [2].

1: https://github.com/piercefreeman/cloud-gpu-reliability/blob/...

2: https://github.com/piercefreeman/cloud-gpu-reliability/blob/...


This is a big missed point.

At work we run some (non-GPU) instances in every AWS region, and there's pretty big variability over time and region for on-demand launch time. I'd expect it might be even higher for GPU instances. I suspect that a more rigorous investigation might find there isn't quite as big a difference overall as this article suggests.

Just remember this is for GPU instances. Other vm families are pretty fast to launch.

FWIW in our use case of non-GPU instances they launched way faster and more consistently on GCP than AWS. So I guess it is complicated and may depend on exactly what instance you are launching.

Time is difficult.

Reminds of me this post on mtime which recently resurfaced on HN: https://apenwarr.ca/log/20181113


I've naively used millisecond precision things for a long time - not in anything critical I don't think - but I've only recently come to more of an awareness that a millisecond is a pretty long time. Recent example is that I used a timestamp to version a record in a database, but it's feasible that in a Go application, a record could feasilby be mutated multiple times a millisecond by different users / processes / requests.

Unfortunately, millisecond-precise timestamps proved to be a bit tricky in combination with sqlite.


To put this into perspective, a game (or anything) running at 60 FPS only as a bit over 16 milliseconds to render each frame. These days higher frame rates are common enough, often putting you down into single-digit milliseconds per frame. Not think about how many things there are simulated and rendered in each frame for common games. Way more than 16.

Hope icyfox can try running this with a fix.

I wonder why someone would equate "instance launch time" with "reliability"... I won't go as far as calling it "clickbait" but wouldn't some other noun ("startup performance is wildly different") have made more sense?

Well, if your system elastically uses GPU compute and needs to be able to spin up, run compute on a GPU, and spin down in a predictable amount of time to provide reasonable UX, launch time would definitely be a factor in terms of customer-perceived reliability.

All the clouds are pretty upfront about availability being non-guaranteed if you don't reserve it. I wouldn't call it a reliability issue if your non-guaranteed capacity takes some tens of seconds to provision. I mean, it might be your reliability issue, because you chose not to reserve capacity, but it's not really unreliability of the cloud — they're providing exactly what they advertise.

"Guaranteed" has different tiers of meaning - both theoretical and practical.

In many cases, "guaranteed" just means "we'll give you a refund if we fuck up". SLAs are very much like this.

IN PRACTICE, unless you're launching tens of thousands of instances of an obscure image type, reasonable customers would be able to get capacity, and promptly from the cloud.

That's the entire cloud value proposition.

So no, you can't just hand-waive past these GCP results and say "Well, they never said these were guaranteed".


Ignoring the fact that the results are probably partially flawed due to methodology (see top-level comment from someone who works on GCE) and are not reproducible due to missing information, pointing out the lack of a guarantee is not hand-waving. The OP uses the word "reliability" to catch attention, which certainly worked, but this has nothing to do with reliability.

This isn't actually true, even for tiny customers. In a personal project, I used a single host of a single instance type several times per day and had to code up a fallback.

Try spinning up 32+ core instances with local ssds attached or anything not n1 family and you will find that in may regions you can only have like single digits of them

I'd still consider it as "performance issue", not "reliability issue". There is no service unavailability here. It just takes your system a minute longer until the target GPU capacity is available. Until then it runs on fewer GPU resources, which makes it slower. Hence performance.

The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really


I’d like to see a breakdown of the cost differences. If the costs are nearly equal, why would I not choose the one that has a faster startup time and fewer errors?

With GCP you can right-size the CPU and memory of the VM the GPU is attached to, unlike the fixed GPU AWS instances, so there is the potential for cost savings there.

Sure but not anywhere remotely near clearing the bar to simply calling that “reliability”.

When I think “reliability” I think “does it perform the act consistently?”

Consistently slow is still reliability.


It is not reliably running the machine but reliably getting the machine.

Like the article said, The promise of the cloud is that you can easily get machines when you need them the cloud that sometimes does not get you that machine(or does not get you that machine in time) is a less reliable cloud than the one that does.


It’s still performance. If this was “AWE failed to deliver the new machines and GCP delivered”, sure, reliability. But this isn’t that.

The race car that finishes first is not “more reliable” than the one in 10th. They are equally as reliable, having both finished the race. The first place car is simply faster at the task.


The one in first can more reliably win races however.

You cannot infer that based on the results of the race...that's literally the entire point I am making. The 1st place car might blow up in the next race, the 10th place car might finish 10th place for the next 100 races.

If the article were measuring HTTP response times and found that AWS's average response time was 50ms and GCP's was 200ms, and both returned 200s for every single request in the test, would you say AWS is more reliable than GCP based on that? Of course not, it's asinine.


If you want that promise you can reserve capacity in various ways. Google has reservations. Folks use this for DR, your org can get a pool of shared ones going if you are going to have various teams leaning on GPU etc.

The promise of the cloud is that you can flexibly spin up machines if available, and easily spin down, no long term contracts or CapEx etc. They are all pretty clear that there are capacity limits under the hood (and your account likely has various limits on it as a result).


I would still call it "reliability".

If the instance takes too long to launch then it doesn't matter if it's "reliable" once it's running. It took too long to even get started.


Why would you not call it “startup performance”.

Calling this reliability is like saying a Ford is more reliable than a Chevy because the Ford has a better throttle response.


that's not what reliability means

> that's not what reliability means

What is your definition of reliability?


unfortunately cloud computing and marketing have conflated reliability, availability and fault tolerance so it's hard to give you a definition everyone would agree to, but in general I'd say reliability is referring to your ability to use the system without errors or significant decreases in throughput, such that it's not usable for the stated purpose.

in other words, reliability is that it does what you expect it to. GCP does not have any particular guarantees around being able to spin up VMs fast, so its inability to do so wouldn't make it unreliable. it would be like me saying that you're unreliable for not doing something when you never said you were going to.

if this were comparing Lambda vs Cloud Functions, who both have stated SLAs around cold start times, and there were significant discrepancies, sure.


true, the grammar and semantics work out, but since reliability needs a target usually it's a serious design flaw to rely on something that never demonstrably worked like your reliability target assumes.

so that's why in engineering it's not really used as such. (as far as I understand at least.)


Why would you scale to zero in high perf compute? Wouldn't it be wise to have a buffer of instances ready to pick up workloads instantly? I get that it shouldnt be necessary with a reliable and performant backend, and that the cost of having some instances waiting for job can be substantial depending on how you do it, but I wonder if the cost difference between AWS and GCP would make up for that and you can get an equivalent amount of performance for an equivalent price? I'm not sure. I'd like to know though.

> Why would you scale to zero in high perf compute?

Midnight - 6am is six hours. The on demand price for a G5 is $1/hr. That's over $2K/yr, or "an extra week of skiing paid for by your B2B side project that almost never has customers from ~9pm west coat to ~6am east coast". And I'm not even counting weekends.

But that's sort of a silly edge case (albeit probably a real one for lots of folks commenting here). The real savings are in predictable startup times for bursty work loads. Fast and low variance startup times unlock a huge amount of savings. Without both speed and predictability, you have to plan to fail and over-allocate. Which can get really expensive fast.

Another way to think about this is that zero isn't special. It's just a special case of the more general scenario where customer demand exceeds current allocation. The larger your customer base, and the burstier your demand, the more instances you need sitting on ice to meet customers' UX requirements. This is particularly true when you're growing fast and most of your customers are new; you really want a good customer experience every single time.


Scaling to zero means zero cost when there is zero work. If you have a buffer pool, how long do you keep it populated when you have no work?

Maintaining a buffer pool is hard. You need to maintain state, have a prediction function, track usage through time, etc. just spinning up new nodes for new work is substantially easier.

And the author said he could spin up new nodes in 15 seconds, that’s pretty quick.


GCP provides elactic features for that. One should use them instead of manually requesting new instances.

Hopefully anyone with a workload that's that latency sensitive would a have preallocated pool of warmed up instances ready to go.

Wouldn't Cloud Run be a better product for that use case?

It is clickbait, the real title should be "AWS vs. GCP on-demand provisioning of GPU resources performance is wildly different".

That said, while I agree that launch time and provisioning error rate are not sufficient to define reliability, they are definitely a part of it.


“ AWS vs. GCP on-demand provisioning of GPU resources performance is wildly different”

yeah i guess it does make sense that one didn’t win the a/b test


> wildly different

For this, I'd prefer a title that lets me draw my own conclusions. 84 errors out of 3000 doesn't sound awful to me...? But what do I know – maybe just give me the data:

"1 in 3000 GPUs fail to spawn on AWS. GCP: 84"

"Time to provision GPU with AWS: 11.4s. GCP: 42.6s"

"GCP >4x avg. time to provision GPU than AWS"

"Provisioning on GCP both slower and more error-prone than AWS"


84 of 3000 failed is only "one nine"

GCP also had 84 errors compared to 1 for AWS

Another comment on this thread pointed out they had a potential collision in their instance name generation which may have caused this. That would mean this was user error, not a reliability issue. AWS doesn’t require instance names to be unique.

Maybe 1 reported. Not saying aws reliability is bad, but the number of various glitches that crop up in various aws services and not reflected on their status page is quite high.

Errors returned from APIs and the status page are completely separate topics in this context.

that was measured from API call return codes, not by looking at overall service status page

Amazon is pretty good about this, if their API says machine is ready, it usually is.


They were almost exclusively user errors (HTTP 4xx). They are supposed to indicate that the API is being used incorrectly.

Although, it seems the author couldn't find out why they occurred, which points to poor error messages and/or lacking documentation.


If not a 4xx, what should they return for instance not available?

503 service unavailable?

It's not the service that's unavailable. The resource isn't available. The service is running just fine.

GCP error messages will indicate if resources were not available, if you reached your quota, or if it was some other error. Tests like OP can differentiate these situations

Yeah, 4xx is client error, 5xx is server error.

Yes, and trying to create duplicate resources is a client error.

Still, 409 seems inappropriate, as it is meant to signal a version conflict, i.e. someone else changed something, and user tried to uplod a stale version.

”10.4.10 409 Conflict

The request could not be completed due to a conflict with the current state of the resource. This code is only allowed in situations where it is expected that the user might be able to resolve the conflict and resubmit the request. The response body SHOULD include enough information for the user to recognize the source of the conflict. Ideally, the response entity would include enough information for the user or user agent to fix the problem; however, that might not be possible and is not required.

Conflicts are most likely to occur in response to a PUT request. For example, if versioning were being used and the entity being PUT included changes to a resource which conflict with those made by an earlier (third-party) request, the server might use the 409 response to indicate that it can't complete the request. In this case, the response entity would likely contain a list of the differences between the two versions in a format defined by the response Content-Type.”

Then again, perhaps it is the service itself making that state change.


That would be confusing. The HTTP response code should not be conflated with the application's state.

There will come a moment in time when you realize exactly what you have stated here and why it is not a good mental palace to live in.

Using HTTP error codes for non-REST things is cringe.

503 would mean the IaaS API calls themselves are unavailable. Very different from the API working perfectly fine but the instances not being available.


What? REST is just some API philosophy, its doesn't even have to be on top of HTTP.

Why would you think HTTP status codes are made for REST? They are made for HTTP to describe the response of the resource you are requesting, and the AWS API uses HTTP so it makes sense to use HTTP status codes.


Cloud reliability is not the same as a reliability of already spawned VM.

Here it's the possibility to launch new VMs to satisfy dynamic projects' needs. Cloud provider should allow you to scale-up in a predictable way. When it doesn't - it can be called unreliable.

Also, "unreliable" is basically a synonym for "Google" these days.


Let me unreliable that for you.

To be fair their search is so crap lately, throwing the dice is not the worst option in the world to find a result that will be actually useful.

I'll say it is valid to use reliability.

If I depend on some performance metric, startup, speed, etc, my dependance on it equates to reliability. Not just on/off but the spectrum that it produces.

If a CPU doesn't operate at its 2GHz setting 60% of the time, I would say that's not reliable. When my bus shows up on time only 40% of the time - I can't rely on that bus to get me where I need to go consistently.

If the GPU took 1 hour to boot, but still booted, is it reliable? What about 1 year? At some point it tips over an "personal" metric of reliability.

The comparison to AWS which consistently out-performs GCP, while not explicitly, implicitly turns that into a reliability metric by setting the AWS boot time as "the standard".


I mean if you're talking about worst case systems you assume everything is gone except your infra code and backups. In that case your instance launch time would ultimately define what your downtime looks like assuming all else is equal. It does seem a little weird to define it that way but in a strict sense maybe not.

Well, I mean it is measuring how reliably you can get a GPU instance. But it certainly isn't the overall reliability. And depending on your workflow, it might not even be a very interesting measure. I would be more interested in seeing a comparison of how long regular non-GPU instances can run without having to be rebooted, and maybe how long it takes to allocate a regular VM.

"AWS encountered one valid launch error in these two weeks whereas GCP had 84."

84 times more launch errors seems like a valid definition for "less reliable".


Reliability is a fair term, with an asterix. It is a specific flavor of reliability: deployment or scaling or net-new or allocation or whatever you want to call it.

I won't go so far as saying "you didn't read the article", but I think you missed something.

They are talking about the reliability of AWS vs GCP. As a user of both, I'd categorize predictable startup times under reliability because if it took more than a minute or so, we'd consider it broken. I suspect many others would have even tighter constraints.

Anecdotally I tend to agree with the author. But this really isn't a great way of comparing cloud services.

The fundamental problem with cloud reliability is that it depends on a lot of stuff that's out of your control, that you have no visibility into. I have had services running happily on AWS with no errors, and the next month without changing anything they fail all the time.

Why? Well, we look into it and it turns out AWS changed something behind the scenes. There's a different underlying hardware behind the instance, or some resource started being in high demand because of some other customers.

So, I completely believe that at the time of this test, this particular API was performing a lot better on AWS than on GCP. But I wouldn't count on it still performing this way a month later. Cloud services aren't like a piece of dedicated hardware where you test it one month, and then the next month it behaves roughly the same. They are changing a lot of stuff that you can't see.


That was my thoughts. People are probably pummeling GCP GPU free tier right now with stable diffusion image generators. Since it seems like all the free plug and play examples use the google python notebooks.

You've just perfectly characterized why on-site infrastructure will always have its place.

You can reserve capacity on both of these services as well.

Instance types and regions make a big difference.

Some regions and hardware generations are just busier than others. It may not be the same across cloud providers (although I suspect it is similar given the underlying market forces).


> The offerings between the two cloud vendors are also not the same, which might relate to their differing response times. GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed. AWS only provisions defined VMs that have GPUs attached - the g4dn.x series of hardware here. Each of these instances are fixed in their CPU allocation, so if you want one particular varietal of GPU you are stuck with the associated CPU configuration.

At a surface level, the above (from the article) seems like a pretty straightforward explanation? GCP gives you more flexibility in configuring GPU instances at the trade off of increased startup time variability.


I wouldn't be surprised if GCP has GPUs scattered throughout the datacenter. If you happen to want to attach one, it has to find one for you to use - potentially live migrating your instance or someone else's so that it can connect them. It'd explain the massive variability between launch times.

Yeah that was my thought too when I first read the blurb.

It’s neat…but like a lot of things in large scale operations, the devil is in the details. GPU-CPU communications is a low latency high bandwidth operation. Not something you can trivially do over standard TCP. GCP offering something like that without the ability to flawlessly migrate the VM or procure enough “local” GPUs means it’s just vaporware.

As a side note, I’m surprised the author didn’t note the amount of ICE’s (insufficient capacity errors) AWS throws whenever you spin up a G type instance. AWS is notorious for offering very few G’s and P’s is certain AZs and regions.


Fungible is selling a GPU decoupling solution via PCIe encapsulated over Ethernet today, so it can certainly be done.

And NVIDIA's vGPU solutions do support live migration of GPUs to another host (in which case the vGPU gets moved too, to a GPU on that target).


I've only ever used AWS for this stuff. When the author said that you could just "add a GPU" to an existing instance, my first reaction was "wow, that sounds like it would be really complicated behind the scenes."

I doubt it would be setup like that. Compute is usually deployed as part of a large set of servers. The reason for that is different compute workloads require different uplink capacity.You don't need a petabyte of uplink capacity for many GPU loads but you may for compute. Just switching ASICs are much more expensive for 400G+ than 100G. That hasn't even got into the optics, NICs and other things. You don't mix and match compute across the same place in the data center traditionally.

A few weeks ago I needed to change the volume type on an EC2 instance to gp3. Following the instructions, the change happened while the instance was running. I didn't need to reboot or stop the instance, it just changed the type. While the instance was running.

I didn't understand how they were able to do this, I had thought volume types mapped to hardware clusters of some kind. And since I didn't understand, I wasn't able to distinguish it from magic.


Dunno about AWS, but GCP uses live migration, and will migrate your VM across physical machines as necessary. The disk volumes are all connected over the network, nothing really depends on the actual physical machine your VM is ran on.

Disclosure: I work for Amazon, and in the past I worked directly on EC2.

From the FAQ: https://aws.amazon.com/ec2/faqs/

Q: How does EC2 perform maintenance?

AWS regularly performs routine hardware, power, and network maintenance with minimal disruption across all EC2 instance types. To achieve this we employ a combination of tools and methods across the entire AWS Global infrastructure, such as redundant and concurrently maintainable systems, as well as live system updates and migration.


> AWS regularly performs routine hardware, power, and network maintenance with minimal disruption across all EC2 instance types. To achieve this we employ a combination of tools and methods across the entire AWS Global infrastructure, such as redundant and concurrently maintainable systems, as well as live system updates and migration.

And yet, I keep getting almost every weeks emails like this:

"EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-xxxxxxx) associated with your AWS account (AWS Account ID: NNNNN) in the eu-west-1 region. Due to this degradation your instance could already be unreachable. We will stop your instance after 2022-09-21 16:00:00 UTC"

And we don't have tens of thousands of VMs in that region, just around 1k.


Live migration can't be used to address every type of maintenance or underlying fault in a non-disruptive way.

Azure, AWS and GCP all have live migration. VMWare has it too.

Not really. Or at least not in the same league.

AWS doesn't have live migration at all. You have to stop/start.

Azure technically does, but it doesn't always work(they say 90%). 30 seconds is a long time.

VMWare has live migration (and seems to be the closest to what GCP does) but it is still an inferior user experience.

This is the key thing you are missing – GCP not only has live migration, but it is completely transparent. We do not have to initiate migration. GCP does, transparently, 100% of the time. We have never even notice migrations even when we were actively watching those instances. We don't know or care what hypervisors are involved. They even preserve the network connections.

https://cloudplatform.googleblog.com/2015/03/Google-Compute-...


VMware's live migration is totally seamless, so I don't know what you mean by "inferior user experience". You typically see less than a second of packet loss, and a small performance hit for about a minute while the memory is "swapped" across to the new machine. Similarly, VMware has had live storage migration for years.

VMware is lightyears ahead of the big clouds, but unfortunately they "missed the boat" on the public cloud, despite having superior foundational technology.

For example:

- A typical vSphere cluster would use live migration to balance workloads dynamically. You don't notice this as an end user, but it allows them to bin-pack workloads up above 80% CPU utilisation in my experience with good results. (Especially if you allocate priorities, min/max limits, etc...)

- You can version-upgrade a vSphere cluster live. This includes rolling hypervisor kernel upgrades and live disk format changes. The upgrade wizard is a fantastic thing that asks only for the cluster controller name and login details! Click "OK" and watch the progress bar.

- Flexible keep-apart and keep-together rules that can updated at any time, and will take effect via live migration. This is sort-of like the Kubernetes "control loops", but the migrations are live and memory-preserving instead of stop-start like with containers.

- Online changes to virtual hardware, including adding not just NICs and disks, but also CPU and memory!

- Thin-provisioned disks, and memory deduplication for efficiencies approaching that of containerisation.

- Flexible snapshots, including the ability for "thin provisioned" virtual machines to share a base snapshot. This is often used for virtual desktops or terminal services, and again this approaches containerisation in terms of cloning speed and storage efficiency.

In other words, VMware had all of the pieces, and just... didn't... use it to make a public cloud. We could have had "cloud.vmware.com" or whatever 15 years ago, but they decided to slowly jack up the price on their enterprise customers instead.

For comparison, in Azure: You can't add a VM to an availability set (keep apart rule) or remove the VM from it without a stop-start cycle. You can't make most changes (SKU, etc...) to a VM in an availability set without turning off every machine in the same AS! This is just one example of many where the public cloud has a "checkbox" availability feature that actually decreases availability. For a long time, changing an IP address in AWS required the VM to be basically blown away and recreated. That brought back memories of the Windows NT 4 days in 1990s when an IP change required a reboot cycle.


You're right to say that VMware has the right fundamental building blocks and that they are mature enough (especially the compute aspect).

But I think you underestimate the maturity and effectiveness of the underlying google compute and storage substrate.

(FWIW, I worked at both places)

Now how the Google's substrate maps onto GCP, that's another story. There is a non trivial amount of fluff to be added on top of your building blocks to build a manageable multitenant planet scale cloud service. Just the network infrastructure is mind boggling.

I wouldn't be surprised if your experience with a "VMware cloud" would surprise you if you naively compare it with your experience with a standalone vsphere cluster.


So, i used to be a part time vSphere admin, worked with many others, and had to automate the hell out of it to deal as little as possible with that dumpster fire.

No, VMware didn't miss the boat, vCloud Air was announced in 2009 and made generally available in 2013. Roughly same timelines as Azure and GCP, slightly trailing AWS, and those were the early days, where the public cloud was still exotic. And VMware had the massive advantage of brand recognition in that domain and existing footprint with enterprises which could be scaled out.

Problem was, vCloud Air, like vSphere, was shit. Yeah, it did some things well, and had some very nice features - vMotion, DRS (though it doesn't really use CPU ready contention for scheduling decisions which is stupid), vSAN, hot adding resources (but not RAM, because decades ago Linux had issues if you had less than 4GB RAM and you added more, so to this day you can't do that). When they worked, because when they didn't, good luck because error messages are useless, logs are weirdly structured and uselessly verbose, so a massive pain to deal with. Oh and many of those features were either behind a Flash UI(FFS), or an abomination of an API that is inconsistent ("this object might have been deleted or hasn't been created yet") and had weird limitations like when you have an async task you can't check it's status details. And many of those features were so complex, that a random consuming user basically had to rely on a dedicated team of vExperts, which often resulted in a nice silo slowing everyone down.

Their hardware compatibility list was a joke - the Intel X710 NIC stayed on it for more than a year with a widely known terribly broken driver.

But what made VMware fail the most, IMHO, was the wrong focus, technically - VM, instead of application. A developer/ops person couldn't care less about the object of a VM. Of course they tried some things like vApp and vCloud Director etc. which are just disgusting abominations designed with a PowerPoint in mind, not a user. And pricing. Opaque and expensive, with bad usability. No wonder everyone jumped on the pay as you go, usable alternatives.


> many of those features were either behind a Flash UI(FFS)

My introduction to the industry. The memories.


Are you sure, because AWS consistently requires me to migrate to a different host. They go as far as shutting down instances, but don't do any kind of live migrations.

Ec2 does not have live migration. On azure it’s spotty so not every maintenance can offer it.

EC2 does support live migration, but it's not public and only for certain instance types/hypervisors.

See: https://news.ycombinator.com/item?id=17815806


Here's a comment that I made in a past thread.

https://news.ycombinator.com/item?id=26650082


My experience running c5/6 instances makes me very confident ec2 doesn’t do live migration for these. Fwiw gcp live migration on latency sensitive workloads is very noticeable and often time straight up causes instance crash

Intrigued by this observation. What is it about your experience that leads you to conclude that EC2 doesn't do live migration?

And could it be phrased differently as "EC2 doesn't do live migration badly"?


Mainly the barrage of "instance hardware degradation" emails that i get whereas on gcp those are just migrated (sometimes with a reboot/crash). Also there is no brownout. I've never used t2/3s which apparently do support migration which would make sense.

After some kinds of hardware failure, it can become impossible to do live migration safely. When a crash can ensure due to a live migration from faulty HW, I'd argue that it's much better to not attempt it.

Linode also uses live migrations now for most (all?) maintenance.

How does migrating a vm to another physical machine work?

VMware has been doing this for years, it's called vmotion and there is a lot of documentation about it if you are interested (eg https://www.thegeekpub.com/8407/how-vmotion-works/ )

Essential, memory state is copied to the new host, the VM is stunned for a millisecond and the cpu states is copied and resumed on the new host (you may see a dropped ping). All the networking and storage is virtual anyway so that is "moved" (it's not really moved) in the background.


The clever trick here is that they'll pre-copy most of the memory without bothering to do it consistently, but mark pages that the source had written to as "dirty". The network cutover is stop-the-world, but VMware doesn't copy the dirty pages during the stop. Instead, it simply treats them as "swapped to pagefile", where the pagefile is actually the source machine memory. When computation resumes at the target, the source is used to page memory back in on-demand. This allows very fast cutovers.

> VM is stunned for a millisecond

This conjures up hilarious mental imagery, thanks


You just bop it on the head, and move it to the new machine quickly. By the time the VM comes to it won't even realize that it is in a new home.

Up to 500ms per your source, depending on how much churn there is in the memory from the source system.

Very cool.


That is really interesting I didn't realize it was so fast. Thanks for the post I will give it a read!

This blog post is pretty old (2015) but gives a good introduction.

https://cloudplatform.googleblog.com/2015/03/Google-Compute-...


Thanks for sharing, I will give it a read!

They pause your VM, copy everything about its state over to the new machine, and quickly start the other instance. It's pretty clever. I think there are tricks you can play with machines that have large memory footprints to copy most of it before the pause, and only copy what has changed since then during the pause.

The disks are all on the network, so no need to move anything there.


In reality it sync the memory first to the other host and only pause the vm when the last state sync is small enough to be so quick the pause is barely measurable.

Indian jones and the register states

When its transferring the state to the target, how does it handle memory updates that are happening at that time? Is the programs execution paused at that point?

No, but the memory accesses have hooks that say "This memory was written". Then, program execution is paused, and just the sections of memory that were written are copied again.

This has memory performance characteristics - I ran a benchmark of memory read/write speed while this was happening once. It more than halved memory speed for the 30s or so it took from migration started to migration complete. The pause, too, was much longer.


Ahh I think that was the piece I was missing, thanks! I didn't realize there were hooks for tracking memory changes.

No, they keep track of dirty pages.

Stream the contents of ram from source to dest, pause the source, reprogram the network and copy and memory that changed since the initial stream, resume the dest, destroy the source, profit.

vsphere vmotion has been a thing for years lmao

EBS is already replicated so they probably just migrate behind the scenes, same as if the original physical disk was corrupted. It looks like only certain conditions allow this kindof migration.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/modify-v...


Assuming this blurb is accurate: " General-purpose SSD volume (gp3) provides the consistent 125 MiB/s throughput and 3000 IOPS within the price of provisioned storage. Additional IOPS (up to 16,000) and throughput (1000 MiB/s) can be provisioned with an additional price. The General-purpose SSD volume (gp2) provides 3 IOPS per GiB storage provisioned with a minimum of 100 IOPS"

... then it seems like a device that limits bandwidth either on the storage cluster or between the node and storage cluster is present. 125MiB/s is right around the speed of a 1gbit link, I believe. That it was a networking setting changed in-switch doesn't seem to be surprising.


This would have been my guess. All EBS volumes are stored on a physical disk that supports the highest bandwidth and IOPS you can live migrate to, and the actual rates you get are determined by something in the interconnect. Live migration is thus a matter of swapping out the interconnect between the VM and the disk or even just relaxing a logical rate-limiter, without having to migrate your data to a different disk.

The actual migration is not instantaneous despite the volume being immediately reported as gp3. You get a status change to "optimizing" if my memory is correct with a percentage. And the higher the volume the longer it takes so there is definitely a sync to faster storage.

Changing the volume type on AWS is somewhat magical. Seeing it happens on-line was amazing.

If I remember right they use the equivalent of a ledger of changes to manage volume state. So in this case, they copy over the contents (up to a certain point in time) to the new faster virtual volume, then append and direct all new changes to the new volume.

This is also how they are able to snapshot a volume at a certain point in time without having any downtime or data inconsistencies.


Look up AWS Nitro on YouTube if you are interested in learning more about it.

Having being a high scale AWS user with a bill of +$1M/month and now working since 2 years with a company which uses GCP. I would say AWS is superior and way ahead.

** NOTE: If you're a low scale company this won't matter to you **

1. GKE

When you cross a certain scale certain GKE components won't scale with you and SLOs on those components are crazy, it takes 15+ mins for us to update a GKE ingress controller backed Ingress.

Cloud Logging hasn't been able to keep up with our scale, disabled since 2 years now. This last Q we got an email from them to enable it and try it again on our clusters, still have to confirm these claims as our scale is more higher now.

Konnectivity agent release was really bad for us, it affected some components internally, total dev time we lost was more than 3 months debugging this issue. They had to disable konnectivity agent on our clusters, I had to collect TCP dumps and other evidences just to prove nothing was wrong on our end, fight with our TAM to get a meeting with the product team. After 4 months they agreed and reverted our clusters to SSH tunnels. Initially GCP support said they said they can't do this. Next Q Ill be updating the clusters hopefully they have fixed this by then.

2. Support.

I think AWS support always were more pro active in debugging with us, GCP support agents most of the times lack the expertise or proactiveness to debug/solve things in simple cases. We pay for enterprise support and don't see getting much from them. At AWS we had reviews of the infra how we could better it every 2 Qs and we got new suggestion and was also the time when we shared what we would like to see in their roadmap.

3.Enterprisyness is missing with design

A simple thing as cloudbuild doesn't have access to static IPs. We have to maintain a forward proxy just cause of this.

L4 LBs were a mess you could only use specified ports in a (L4 LB) TCP proxy, For a tcp proxy based loadbalancer, the allowed set of ports are - [25, 43, 110, 143, 195, 443, 465, 587, 700, 993, 995, 1883, 3389, 5222, 5432, 5671, 5672, 5900, 5901, 6379, 8085, 8099, 9092, 9200, and 9300]. Today I see they have removed these restrictions. I don't know who came up with this idea to allow only a few ports on a L4 LB. I think such design decisions make it less Enterprisy.


Unclear what the article has to do with reliability. Yes, spinning up machines on GCP is incredibly fast and has always been. AWS is decent. Azure feels like I'm starting a Boeing 747 instead of a VM.

However, there's one aspect where GCP is a clear winner on the reliability front. They auto-migrate instances transparently and with close to zero impact to workloads – I want to say zero impact but it's not technically zero.

In comparison, in AWS you need to stop/start your instance yourself so that it will move to another hypervisor(depending on the actual issue AWS may do it for you). That definitely has impact on your workloads. We can sometimes architect around it but there's still something to worry about. Given the number of instances we run, we have multiple machines to deal with weekly. We get all these 'scheduled maintenance' events (which sometimes aren't really all that scheduled), with some instance IDs(they don't even bother sending the name tag), and we have to deal with that.

I already thought stop/start was an improvement on tech at the time (Openstack, for example, or even VMWare) just because we don't have to think about hypervisors, we don't have to know, we don't care. We don't have to ask for migrations to be performed, hypervisors are pretty much stateless.

However, on GCP? We had to stop/start instances exactly zero times, out of the thousands we run and have been running for years. We can see auto-migration events when we bother checking the logs. Otherwise, we don't even notice the migration happened.

It's pretty old tech too:

https://cloudplatform.googleblog.com/2015/03/Google-Compute-...


EC2 live migrates instances too. Not sure where we are with rollout across the fleet.

The reason, from what I understand, why GCP does live migration more is because ec2 focused on live updates instead of live migration. Whereas GCP migrates instances to update servers, ec2 live updates everything down to firmware while instances are running.

Curious, what instance types are you using on EC2 that you see so much maintenance?


> Curious, what instance types are you using on EC2 that you see so much maintenance

We use a bunch of different types. M5 and R5 (different sizes) are the most commonly used types but we use many different families. I haven't done an analysis to figure out which types are hotspots.

This is across thousands of instances over many regions worldwide. The percentage is low, but that still translates to daily maintenance alerts.


> Yes, spinning up machines on GCP is incredibly fast and has always been. AWS is decent.

FWIW this article is saying the opposite--it's AWS that beats GCP in startup speed.


This article states that GPU instances are slower on GCP - it doesn’t make any claims about non-GPU instances.

I always wondered why you couldn't do that on AWS, mainly because I could do it at home with Hyper-V a decade ago.

https://learn.microsoft.com/en-us/previous-versions/windows/...


> Azure feels like I'm starting a Boeing 747 instead of a VM.

Huh... interesting, this has not been my experience with Azure VM launch times. I'm usually surprised how quickly they pop up.


Depends on your disks.

Premium SSD allows 30 minutes of "burst" IOPS, which can bring down boot times to about 2-5 seconds for a typical Windows VM. The provisioning time is a further 60-180 seconds on top. (The fastest I could get it is about 40 seconds using a "smalldisk" image to ephemeral storage, but then it took a further 30 seconds or so for the VM to become available.)

Standard HDD was slow enough that the boot phase alone would take minutes, and then the VM provisioning time is almost irrelevant in comparison.


I wouldn't call this reliability, which already has a loaded definition in the cloud world, and instead something along time-to-start or latency or something.

It is though based on a specific definition. If X doesn't do Y based on Z metric with a large standard deviation and doesn't meet spec limits, it is not reliable as per the predefined tolerance T.

  X = Compute intances
  Y = Launch
  Z = Time to launch
  T = LSL (N/A), USL (10s), Std Dev (2s)
Where LSL is lower spec limit, USL is upper spec limit. LSL is N/A since we don't care if the instance launches instantly (0 seconds).

You can define T as per your requirements. Here we are ignoring the accuracy of the clock that measures time, assuming that the measurement device is infinitely accurate.

If your criteria is to, say for example, define reliability as how fast it shuts down, then this article isn't relevant. Article is pretty narrow in testing reliability, they only care about launch time.


Reliability in general is measured on the basic principle of: does it function within our defined expectations? As long as it's launching, and it eventually responds within SLA/SLO limits, and on failure comes back within SLA/SLO limits, it is reliable. Even with GCP's multiple failures to launch, that may still be considered "reliable" within their SLA.

If both AWS and GCP had the same SLA, and one did better than the other at starting up, you could say one is more performant than the other, but you couldn't say it's more reliable if they are both meeting the SLA. It's easy to look at something that never goes down and say "that is more reliable", but it might have been pure chance that it never went down. Always read the fine print, and don't expect anything better than what they guarantee.


> In total it scaled up about 3,000 T4 GPUs per platform

> why I burned $150 on GPUs

How do you rent 3000 GPUs over a period of weeks for $150? Were they literally requisitioning it and releasing it immediately? Seems like this is quite a unrealistic type of usage pattern and would depend a lot on whether the cloud provider optimises to hand you back the same warm instance you just relinquished.

> GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator

it's quite fascinating that GCP can do this. GPUs are physical things (!) do they provision every single instance type in the data center with GPUs? That would seem very expensive.


It probably live-migrates your VM to a physical machine that has a GPU available.

...if there are any GPUs available in the AZ that is. I had a hell of a time last year moving back and forth between regions to grab just 1 GPU to test something. The web UI didn't have a "any region" option for launching VMs so if you don't use the API you'll have to sit there for 20 minutes trying each AZ/region until you managed to grab one.


GPUs are physical but VMs are not; I expect they just move them to a host with a GPU.

Unlikely. More likely they put your VM on a host with GPU attached, and use live migration to move workloads around for better resource utilization.

However, live-migration can cause impact to HPC workloads.


> $150

Was asking myself the same question. From the pricing information on gcp it seems minimum billing time is 1 minute, making 3000 GPUs cost $50 minimum. If this is the case then $150 is reasonable for the kind of usage pattern you describe.


AWS has different pools of EC2 instances depending on the customer, the size of the account and any reservations you may have.

Spawning a single GPU at varying times is nothing. Try spawning more than one, or using spot instances, and you’ll get a very different picture. We often run into capacity issues with GPU and even the new m6i instances at all times of the day.

Very few realistic company size workloads need a single GPU. I would willingly wait 30 minutes for my instances to become available if it meant all of them where available at the same time.


This is great.

I have always been feeling there is so little independent content on benchmarking the IaaS providers. There is so much you can measure in how they behave.


Heard from a Googler that the internal infrastructure (Borg) is simply not optimized for quick startup. Launching a new Borg job often takes multiple minutes before the job runs. Not surprising at all.

Echoing this. The SRE book is also highly revealing about how Google request prioritization works. https://sre.google/sre-book/load-balancing-datacenter/

My personal opinion is that Google's resources are more tightly optimized than AWS and they may try to find the 99% best allocation versus the 95% best allocation on AWS.. and this leads to more rejected requests. Open to being wrong on this.


A well-configured isolated borg cluster and well-configured job can be really fast. If there's no preemption (IE, no other job that is kicked off and gets some grace period), the packages are already cached locally, and there is no undue load on the scheduler, the resources are available, and it's a job with tasks, rather than multiple jobs, it will be close to instantaneous.

I spend a significant fraction of my 11+ years there clicking Reload on my job's borg page. I was able to (re-)start ~100K jobs globally in about 15 minutes.


Psh someone's bragging about not being at batch priority.

I ran at -1

This is mostly not true in cases where resources are actually available (and in GCE if they're not the API rejects the VM outright, in general). To the extent that it is true for Borg when the job schedules immediately, it's largely due to package (~container layers, ish) loading. This is less relevant today (because reasons), and also mostly doesn't apply to GCE as the relevant packages are almost universally proactively made available on relevant hosts.

The origin for the info that jobs take "minutes" likely involves jobs that were pending available resources. This is a valid state in Borg, but GCE has additional admission control mechanisms aimed at avoiding extended residency in pending.

As dekhn notes, there are many factors that contribute to VM startup time. GPUs are their own variety of special (and, yes, sometimes slow), with factors that mostly don't apply to more pedestrian VM shapes.


As another comment points out, GPU resources are less common so it takes longer to create, which makes sense. In general, start up times are pretty quick on GCP as other comments also confirm.

booting VMs != starting a borg job.

The technology may be different but the culture carries over. People simply don't have the habit to optimize for startup time.

Borg is not used for gcp vms, though.

It is used but borg scheduler does not manage vm startups

Is this testing for spot instances?

In my limited experience, persistent (on-demand) GCP instances always boot up much faster than AWS EC2 instances.


I noticed that too and it does appear to be using spot instances. I have a feeling if it was ran without you may see much better startup times. Spot instances on GCP are hit and miss and you sort of have to build that into your workflow.

In my experience GPU persistent instances often simply don't boot up on GCP due to lack of available GPUs. One reason I didn't choose GCP at my last company.

Oh interesting. Which region and GPU type were you working with? (Asking so I can avoid in future)

I think it was us-east1 or us-east4. Had issues getting TPUs as well in us-central1. I know someone at a larger tech company who was told to only run certain workflows in a specific niche European region as that's the only one that had any A100 GPUs most of the time.

Worth pointing out that the article is measuring provisioning latency and success rates (how quickly can you get a GPU box running and whether or not you get an error back from the API when you try), and not "reliability" as most readers would understand it (how likely they are to do what you want them to do without failure).

Definitely seems like interesting info, though.


That's interesting but not what I expected when I read "reliability". I would have expected SLO metrics like uptime of the network or similar metrics that users would care about more. Usually when scaling a system that's built well you don't have hard short constraints on how fast an instance needs to be spun up. If you are unable to spin up any that can be problematic of course. Ideally this is all automated so nobody would care much about whether it takes a retry or 30s longer to create an instance. If this is important to you, you have other problems.

The article only talks about GPU start time, but the title is "CloudA vs CloudB reliability"

bit of a stretch, right


> These differences are so extreme they made me double check the process. Are the "states" of completion different between the two clouds? Is an AWS "Ready" premature compared to GCP? It anecdotally appears not; I was able to ssh into an instance right after AWS became ready, and it took as long as GCP indicated before I was able to login to one of theirs.

This is a good point and should be part of the test: after launching, SSH into the machine and run a trivial task to confirm that the hardware works.


> GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed.

That would seem to indicate that asking for a VM on GCP gets you a minimally configured VM on basic hardware, and then it gets migrated to something bigger if you ask for more resources. Is that correct?

That could make sense if, much of the time, users get a VM and spend a lot of time loading and initializing stuff, then migrate to bigger hardware to crunch.


This is not quite true - GPU's are limited to select VM types, and the number of GPU's you have influences the maximum number of cores you can get. In general they're only available on the n1 instances (except the a100's, but those are far less popular)

We have constant autoscaling issues because of this in GCP - glad someone plotted this - hope people in GCP will pay a bit more attention to this. Thanks to the OP!

This is all about cloud GPUs, I was expecting something totally different from the title.

This is not reliability. This is a measure of how much spare capacity AWS seems to be leaving idle for you to snatch on-demand.

This is going to vary a lot based on the time of year. Why don't you try this same experiment at around some time when there's a lot of retail sales activity (Black Friday), and watch AWS suddenly have much less capacity to dole out on-demand.

To me reliability is a measure of what a cloud does compared to what it says it will do. GCP is not promissing you on-demand instances instantaneously is it? If you want that ... reserve capacity.


AWS normally has machines sitting idle just waiting for you to use. Thats why they can get you going in a couple of seconds.

GCP on the other hand fills all machines with background jobs. When you want a machine, they need to terminate a background job to make room for you. That background job has a shutdown grace time. Usually thats 30 seconds.

Sometimes, to prevent fragmentation, they actually need to shuffle around many other users to give you the perfect slot - and some of those jobs have start-new-before-stop-old semantics - that's why sometimes the delay is far higher too.


borg implements preemption but the delay to start VMs is not because they are waiting for a background task to clean up.

It may or may not matter for various use cases, but the EC2 instances in the test use EBS and the AMIs are lazily loaded from S3 on boot. So it may be possible that the boot process touches few files and quickly gets to 'ready' state, but you may have crummy performance for a while in some cases.

I haven't used GCP much, but maybe they load the image onto the node prior to launch, accounting for some of the launch time difference?


Thank you for this article, it confirms my direct experience. Never run a benchmarking test but I can see this every day.

I'd say that's a weak test of capacity. Would love to see on Azure - T4s or an equiv aren't even really provided anymore!

We find reliability a diff story. Eg, our main source of downtime on Azure is they restart (live migrate?) our reserved T4s every few weeks, causing 2-10min outages per GPU per month.


Anybody know if on GCP the cheaper ephemeral spot instances are available on managed instance groups and cloud run, where it spins up more instances according to demand, and how well it deals with replacing spot instances that drop dead, if so? How about AWS?

I wish Azure was here to round it out!

It's meant to say "ephemeral"... right? It's hard to read after that.

ephemeral and ethereal are commonly confused words.

I guess that's fair. It's sort of a smell when someone uses the wrong word (especially in writing) though. It suggests they aren't in industry, throwing ideas around with other folks. The word "ephemeral" is used extensively amongst software engineers.

Ephimerides really throws them. (And thank God for PyEphem, which makes all that otherwise quite fiddly stuff really easy...)

The author is using 'Quantile' which I hadn't heard of before, and when I did, it seems like it actually should be 'Percentile'. Percentiles are the percentages, which is what the author is referring to.

Quantiles are a generic term for percentiles, deciles, quartiles etc. Percentiles would have been a more precise term.

It's not surprising. Amazon is an amazing customer focus company. Google is a spyware company that only wants to make more by invading our privacy. Of course Amazon products will be better than Google.

I would love to see the same for deploying things like a cloud/lambda function.

> This is particularly true for GPUs, which are uniquely squeezed by COVID shutdowns, POW mining, and growing deep learning models

Is the POW mining part true any more? Hasn't mining moved to dedicated hardware?


Bitcoin mining has used dedicated hardware for a long time. But I believe Ethereum mining used GPUs before the very recent proof-of-stake update.

This benchmark (too) is probably incorrect. It produces 409:s so there are errors in there that I doubt are caused by GCP.

Would be interested to see a comparison of lambda functions vs google 2nd gen functions. I think that gcp is more serverless focused

this doesn't really seem like a fair comparison, nor is it a measure of "reliability".

It seems entirely fair to me, but the term "reliability" has a few different angles. This time it's not about working or not working, but the ability to auto-scale by invoking resources on the spot, which can be a very real requirement.

unless you're willing to burn $150 a quarter doing this exact assessment, it tells you nothing other than the data center conditions at the time of running.

it would be like doing this in us-central1 when us-central1 is down for one provider, and not another, resulting in increased latency, and saying how much faster one is than the other.

unlike say a throughput test or similar, neither of these services promise particular cold-starts, and so the numbers here cannot be contexutalized against any metric given by either company and so are only useful in the sense that they can be compared, but since there are no guarantees the positions could switch anytime.

that's why I like comparisons between serverless functions where there are pretty explicit SLAs and what not given by each company for you to compare against, as well as one another.


Given the stark contrast and that the pattern was identical every day over a two-week course, it tells me we're observing a fundamental systemic difference between GCP and AWS - and I think that's all the author really wanted to point out. I would not be surprised if the results are replicable three months from now.

Should probably change the title to "AWS vs GCP on-demand GPU launch time consistency"

Yep. Author colloquially meant, can I rely on a quick start.

Can you put this in context of the problem/use case /need you are solving for ?

Looks like the author has never heard of the word "histogram"

That graph is a pain to see.


A histogram would take away one of the dimensions, probably time, unless they resorted to some weird stacked layout. Without time, people would complain that they don't know if it was consistent across the tested period. The graph is fine.

What would you expect? AWS is an org dedicated to giving customers what they want and charging them for it, while GCP is an org dedicated to telling customers what they want and using the revenue to get slightly better cost margins on Intel servers.

I don’t believe this reasoning is used since at least Diane

I haven't seen any real change from Google about how they approach cloud in the past decade (first as an employee and developer of cloud services there, and now as a customer). Their sales people have hollow eyes

Didn’t say they were good at making the switch

They don't really tell us what we want, we just buy what we need. Might work for you.

The link is broken?

Works for me using Firefox in Germany, although the article doesn't really match the title so maybe that's why you were confused? :p

Thanks for the report. It only confirms my judgment.

The word "Google" attached to anything is a strong indicator that you should look for an alternative.


... why does the first graph show some instances as having a negative launch time? Is that meant to indicate errors, or has GCP started preemptively launching instances to anticipate requests?

Perhaps if you read the line directly about the graph you would see it was explained and would not have to ask this question

I don't know how that value (looks like -50?) was chosen, but it seems to correspond to the launch failures.

The y axis here measures duration that it took to successfully spin up the box, where negative results were requests that timed out after 200 seconds. The results are pretty staggering



Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: