
Azure Appears to Be Full - estensen
https://www.theregister.co.uk/2020/03/24/azure_seems_to_be_full/
======
sz4kerto
This is a bit like the financial crisis in 2008.

In 2008, the idea was that if you bundle up a large bunch of mortgages, then
the bundle will have low risk because the chances of everything failing at the
same time is low. The cloud is designed so that resource usage spikes of
individual customers can always be served because one customer is very small
compared to the whole infrastructure.

However, in some cases, these mortgages/resource spikes become highly
correlated.

~~~
sksksk
It's a pretty common model in many industries...

If every gym member visited the gym at the same time, they wouldn't all fit.
Only a small fraction of the members use the gym at any one time, so it works.

Banks would crash if everyone tried to withdraw their money at the same time,
but they don't, so the bank can loan the money out.

~~~
axlee
As a matter of fact, I cannot think of a single industry that can serve a full
service to the entirety of their clients at a discrete point in time.

~~~
Intermernet
Garbage collection. As in the people who pick up your bins. They do this on a
weekly basis.

~~~
tdrgabi
But if all their customers want their garbage picked up today, or everyday, it
won't work.

~~~
dragonwriter
More relevant to the cloud analogy, if all their customers wanted to purchase
a large extra pickup (beyond their normal baseline) on their normal day, which
is part of garbage service offerings, they wouldn't be able to accommodate it.

~~~
Intermernet
Aha, thank you for finding the flaw in my example! I lounge corrected ;-)

------
estensen
I think it's very problematic that a major cloud provider is unable to update
their status page, even when this has been ongoing for days. All green ticks
here: [https://status.azure.com/en-us/status](https://status.azure.com/en-
us/status)

~~~
logicallee
How to Use a Cloud Provider Status Page:

1\. Enter the URL of the cloud provider's status page into your browser and
press enter.

2\. If the status page loads instantly, all services are go.

3\. If the status page takes between 2 and 5 seconds to serve, the cloud
provider is experiencing a slowdown.

4\. If the status page takes between 5 and 30 seconds to load, the cloud
provider is experiencing a major problem.

5\. If the status page takes between 30 seconds and 1 minute to load, requires
you to refresh before you can see it, or fails to load completely such as with
missing images, then the cloud provider is experiencing widespread problems in
multiple regions and has only sporadic availability.

6\. If the status page doesn't load at all, all services are down. Check the
CEO's twitter page.

7\. If the CEO's twitter page has a pinned tweet telling you not to worry,
then all of your data has been lost.

------
imeron
I heard from an insider that some Azure services had a 10x growth because of
the recent changes in our society. It's not like you can prepare for a 10x
hit.

My personal experience for our AWS CI infra that it's struggling more and more
recently. Builds are slower on average than a couple of weeks ago. Maybe those
VCPUs are not the same VCPUs as yesterday ;D.

------
dmos62
There is something satisfying about the bounds of the cloud metaphor being
reached (a cloud can't fill the sky).

~~~
ironic_ali
A trip to Scotland will change your mind, fast...

~~~
arethuza
That's a bit harsh, we've had at least 3 days this year where it wasn't
raining!

------
redwood
AWS and GCP are not full

~~~
jeffhuys
Feels like an opportunity for building a service that uses AWS, GCP or Azure
based on which is cheapest at that moment + which is not "full"... Unless that
already exists.

~~~
tyingq
Lowest common denominator though. If you can use just plain old VMs, there's
probably little value in using the big cloud vendors. Traditional hosting
would be loads cheaper.

~~~
dannyeei
Really? Do you mean running your own data center or what services are you
referring to?

~~~
tyingq
No, not running your own data center. Traditional server hosting. Rackspace,
Liquidweb, Packet.net and similar.

Meaning that if you're going to use the lowest common denominator, why not pay
fair market prices for egress and compute?

Any value in cloud is typically the services that are higher level than a VM.
Those services would be hard to put a generic multi-cloud facade in front of.
It would be brittle and bug ridden.

------
tasubotadas
That would explain why I have to wait sometimes ~1h until the Azure DevOps
queued pipelines build starts.

~~~
taspeotis
They do document capacity problems with the hosted pools on their status page
[https://status.dev.azure.com/_history](https://status.dev.azure.com/_history)
e.g. from their last event:

> Capacity constraints due to increased demand stemming from the global health
> pandemic are causing pipeline delays when using our hosted pools. We are
> working on mitigations, but currently expect the issue to persist for at
> least the rest of 25-March peak hours. You can work around these issues by
> temporarily moving critical pipelines to self-hosted agents.

------
paulcarroty
Worked on small Azure setup several weeks ago and probably my experience will
be useful for other people:

Pro:

\- you can use shell in browser

\- traffic is cheaper related to AWS

\- fast 1GbE network

Cons:

\- VM deploy is VERY slow, 2-3 minutes

\- no ipv6 out the box, you need a balancer(!) and 4-5 non-trivial shell
commands

\- attaching new storage was extremely painful experience

It general Azure feels just like middle cloud service.

------
JackPoach
Yep, people are finally realizing that 'cloud' isn't something magical and
limitless. It's just a bunch of servers, connected together, with each having
a limit as to how much data in can store and process.

~~~
satanspastaroll
I doubt anyone actually thinks that.

It's understandable to be surprised, it's not every day everyone needs
resources at once at the same time, although some foresight a month before
couldn't have hurt

------
Just1689
I think this introduces some interesting points to the DR and BCP
conversation.

Is it a safe bet that we can rely on the cloud to have capacity? Normally I
wouldn't doubt it but in this sort of situation is becomes more likely they
will be put under capacity stress.

Will the cloud vendors learn and build slack in? I think they're very lean
operations and maybe this kind of slack would damage the profitability too
much.

If the cloud vendors can't guarantee capacity ( I suspect this will be the
conclusion ) then what does they mean for our DR and BCP planning?

~~~
redis_mlc
> Normally I wouldn't doubt it but in this sort of situation

Then you're very misinformed.

As a cloud administrator, I see resource availability and account limits on a
weekly basis going back years.

I tell people:

\- to pre-provision at least some extra servers rather than wait for an
autoscaling operation to fail.

\- that new instance types often are rolled out gradually, and lead time is
often 1 month in AWS

\- that killing a 1000-node cluster then expecting to immediately rebuild it
often doesn't work.

\- for DR and BCP planning, each region (or AZ) should be able to handle
enough load at all times in case one region (or AZ) is unavailable. I've never
seen anybody do that, even after I told them, because cost.

~~~
ldoughty
For AWS, limit monitor is a handy tool for "small" customers:

[https://aws.amazon.com/solutions/limit-
monitor/](https://aws.amazon.com/solutions/limit-monitor/)

It starts having issues when you get to 5,000+ ec2 instances, but it's
somewhat understandable that they don't aim to support that level of usage
within a single AWS account.

On another bullet point: if you go serverless (API/HTTP Gateway, Lambda,
Dynamo DB), you automatically get full region DR. I personally recommend HTTP
Gateway if you can swing it, API gateway is only worth of it you are doing
personal projects (mostly free tier) or are seriously leveraging the API
gateway specific features

~~~
yekta_
Seems like there's some confusion on what that one really does.

It only notifies you about your own Service Limits, so you will know before
you hit one in an unfortunate moment. It's important to monitor that, but it
doesn't protect or notify you against cloud provider's own limitations. A
scale-out event can still fail if AWS has no more extra capacity ("full") even
if your limits allow you otherwise.

AFAIK there's currently no way to know it beforehand if they actually have the
capacity or not.

------
rzmnzm
That's been a recurring issue with the UK South region ever since they
introduced it.

Theregister even reported on it a couple of years ago

[https://www.theregister.co.uk/2017/05/04/microsoft_azure_cap...](https://www.theregister.co.uk/2017/05/04/microsoft_azure_capacity_woes_hit_uk_customers/)

------
tyingq
I'm watching many struggle also with on-prem VPNs, Citrix, WebEx, and so on.
Though there do seem to be honest efforts to shore those up and also try more
modern tools. I imagine a lot of stodgy companies will have a much better WFH
environment after all the dust settles.

~~~
jiggawatts
I'm running around fixing load balancers, SSL gateways, NetScalers, converting
double-hop to single-hop access, upgrading key components, etc...

It's kinda fun, but it's also infuriating that 90% of our customers decided to
wait until the _week before the lockdown_ that we warned them would be coming
_months_ ago.

------
swebs
Are there any good guides on switching from Azure to AWS?

------
abafazi
I bet AWS or GCP will never have a problem like this because they're not
stupid enough to operate at only 20% spare capacity at any given time

~~~
NomDePlum
It would appear GCP may be having some problems too:
[https://news.ycombinator.com/item?id=22694710](https://news.ycombinator.com/item?id=22694710)

Coincidental?

------
rcarmo
Disclaimer: I work for Microsoft. I have no particular info on this, but I did
read the article _yesterday_ over breakfast, followed links to complaints,
etc., and would like to point out two things:

1\. This appears to be a UK-centric thing (and those datacenters don't have
the full Azure portfolio, as can be seen here:
[https://azure.microsoft.com/en-us/global-
infrastructure/serv...](https://azure.microsoft.com/en-us/global-
infrastructure/services/?regions=non-regional,united-kingdom-south,united-
kingdom-west,europe-north,europe-west&products=virtual-machines))

2\. The very last paragraph on the linked article reads: "Note that Azure is a
huge service and it would be wrong to give disproportionate weight to a small
number of reports. Most of Azure seems to be working fine. That said, capacity
in the UK regions was showing signs of stress even before the current crisis,
so it is not surprising that issues are occurring now."

All of this is public info, so maybe people should read up on facts first? :)

~~~
tibiapejagala
Not sure if it is UK only issue. Couldn't start a VM yesterday in other
European regions. Today I have some other problems with azure search services.

