Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Azure has run out of compute – anyone else affected?
651 points by janober on Nov 25, 2022 | hide | past | favorite | 342 comments
Last week we at n8n ran into problems getting a new database from Azure. After contacting support, it turns out that we can’t add instances to our k8s cluster either. Azure has told they'll have more capacity in April 2023(!) — but we’ll have to stop accepting new users in ~35 days if we don't get any more. These problems seem only in the German region, but setting up in a new region would be complicated for us.

We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Is anyone else experiencing these problems?




> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You're new to Azure I guess.

I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.

One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.


It's worth pointing out that every cloud is the same when it comes to capacity / capacity risk. They all apply a lot of time and effort to figuring out the optimal amount of capacity to order based on track record of both customer demand and supply chain satisfaction.

Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.

On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.

What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.

The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.

This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.


In my experience there are differences between clouds so while all have the same basic problem in practice some may be better than others. I've never had issues getting GPUs on AWS but GCP constantly has issues with GPU/TPU capacity.


Is this region dependent? In us-east I can’t get them to approve a quota for GPU instance families (G,P) for anything more than 4 CPUs. At one point they rejected my request citing “unprecedented demand”. Of course this is small time, just my personal account.

It is true I can get an instance most of the time, but not if I need >16GiB GPU memory.


We've been having the same problem getting GPU instances on us-east. Multiple week-long delays to escalate and talk to yet the next person up who can make a decision. It's a mess.


There probably are difference occurrence rates. We had to modify how our test suite provisions instances, since we used to regularly run into instance availability constraints on EC2 during the holidays.


I’ve occasionally seen some of the internal AWS capacity management dashboards, and they can frequently be operating very close to 100% on some resource types.


I worked on a project about a year ago where we would have a colleague in a different time zone start instances with 4 gpus because it would almost always be unavailable during regular work hours for us-east


It may be a risk borne by every cloud provider, but why does this only really happen to Microsoft among large providers?

As far as chip shortages, it probably helps that Amazon makes its own chips. Microsoft could do the same rather than running out of capacity and blaming chip shortages.

Microsoft had to know that at some point they were going to run out of capacity. They should've either did something about it or let customers know.


There's all sorts of examples of AWS failing to be able to provide capacity too. Just do a search for "aws InsufficientInstanceCapacity" or similar. I remember Fortnite talking about capacity limits in relation to an incident, but I'm struggling to find the post-mortem I saw it in.

Even when Microsoft was being open about Azure having difficulty getting Intel chips, AWS, GCP etc. were in the same position and just not really talking about it. From my time in AWS there were some other times when some services with specialised hardware came really, really close to running out of capacity and had to scramble around with major internal "fire drills" against services to recoup capacity.

Most people won't run in to these issues, the clouds all tend to be good at it, but they still happen.

There are also advantages of the economy of scale and brand recognition. The more customers you have the more the capacity trends smooth out, the easier it is to predict need, even if you're still stuck with uncertainty on the ordering side.


It’s certainly true I run into these things with AWS as well, but it’s generally limited to a specific instance type/availability zone combination. I’ve never had all instance types unavailable.

If anything, I’m surprised we can just spin up a few hundred instances out of nowhere and not run into capacity issues.


AWS has capacity issues you can generally mitigate. Azure however will just lock you out of a solution completely and tell you to switch regions as if that was some trivial thing.


They have a lot of technical debt. They have like 6 different clouds (at least 4 gov clouds alone) and SLA commitments to things like O365 that silo their infrastructure.

MS also makes all sorts of crazy deals and commitments, and I wouldn’t be surprised if being collocated with a strategic customer may lead to local shortages of resources.


AWS has at least 3 publicly-discussed 'clouds' (or partitions, as they're called at AWS). There may or may not be other partitions that cannot be discussed publicly.


There’s a pretty clean demarc between the AWS clouds. With Microsoft because they have O365 and Azure AD dependencies sprinkled everywhere with varying features it’s a real mess. So you can do government contract with with device managed by Windows Autopilot & Intune in a commercial cloud, have email in a Gov Community Cloud, and deliver apps in a US Gov cloud, all with different identities etc.


> As far as chip shortages, it probably helps that Amazon makes its own chips.

IDK what chips you are talking about, all x86 (which I assume is most of their compute) is Intel or AMD. If they make their own they are only making the ARM instances.


AWS has three processors: Graviton, Inferentia, and Trainium. They're made in-house.

https://aws.amazon.com/silicon-innovation/


And none of the above are x86. Even if they're making their own silicon, it is for specialized use (ML) and not general server provisioning.


Amazon's own chips are ARM. ARM requires somewhat specialized builds of software that are likely different than development instances, CI/CD, and/or local dev machines. It's not insurmountable but does certainly complicate usage.


Your local dev machines might be Macs though, in which case it might be easier for you to go with ARM servers than x86.


They might be. My local dev machine is a Mac. I've found Intel or Intel+ARM container images; never an ARM only. Again, not insurmountable but certainly more resistance than the straight intel route.


> This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

Yup. And a few of the OEMs have stopped talking about supply chain integrity. Many folks have observed more memory and power supply problems since the pandemic.


All cloud providers are NOT equal here. Amazon over-provisions and sells the excess capacity as spot instances.


So does google, so does azure etc. etc. https://cloud.google.com/spot-vms, https://azure.microsoft.com/en-us/products/virtual-machines/...

Spot instances exist just to try to turn over-provisions in to not a complete loss. You're at least making some money from your mistake.

edit: You should consider "spot instances" in general to be a failure as far as a cloud provider is concerned. It means you've got your guesses wrong. You always want a buffer zone, but not that much of a buffer zone. The biggest single cost for cloud providers is the per-rack OpEx, the cost of powering, cooling etc.


Cloud providers aren't guessing at demand to plan capacity, they're literally building new data centers and then wheeling new racks into them as fast as they physically can (short-term decisions are more likely made at the other end, e.g. when to retire old systems, not add new ones). AWS was born out of the fact that Amazon's own compute needs are inherently variable so to meet peak demand they had to "over-provision" compared to average demand--this in turn meant they had a lot of excess compute power most of the time. At the point when Amazon still was a dominant consumer of AWS, spot instances were actually a deliberate convenience to Amazon, since it meant AWS could monetize resources while still ensuring Amazon could claim them instantly when needed (later they added a two minute warning, but early on they could literally disappear at any moment, and regularly did).


You're talking to someone who has spent the last decade working for major cloud providers, including AWS, on infrastructure and services sides of things, including work around data feeds for the capacity management teams. I have more than a passing familiarity with the way things actually work at a cloud.

They are constantly guessing at cloud capacity. Short, medium, and long term models with forecasting galore, all under constant recalculation based on customer actions (they literally take live feeds of creation/termination actions), and yes they also take in to account hardware failure and repair rates. Consolidating racks of equipment is a pain in the neck and tends to be avoided, unless you can safely live migrate away all instances.

They all build up various models, using all sorts of forecasting techniques. The longer range forecasts are involved in data center provisioning, along with other business analysis, market research, legal analysis etc. that helps define where future regions should be.

It's still a guess. They can't tell what the actual demand will be, and they can't tell what is going to happen with the supply chain (supply chain issues are the biggest nightmare for capacity planning teams). Sometimes they get it wrong.

The capacity management teams spend a lot of time and expertise to keep the company just sufficiently ahead of demand. It's a crucial part of keeping costs under control.


It's logistics no more and no less. Logistics has been a thing for ever (satisfy a resource requirement). My old man (is not a dustman) but he was Commander Supply for quite a lot of people. At one job, he and his staff would worry about things like Austrian plain chocolate covered mint centred frogs (I'm not joking) to Gurkha rice and not much else (some very concentrated protein etc) water-proofed combat rations. This was in Cyprus in the '80s. Logistics on the green line in Cyprus is probably still as mad now due to the number of countries in the UN.

Anyway, capacity planning is very well understood in general but of course the devil is in the details.

At the moment the IT supply chain is pretty spotty and that affects my little IT firm up to the big boys.

When you buy Cisco + HPE + Dell or whatevs, you go to your reseller (me). I go to my distributor and they suck hardware out of Dell etc and take their cut and I install the gear and take my cut. Sometimes a disty thinks they can do reseller too. The thinking is that they can roll up two lots of margin and shave a bit. That's fine if you can actually do logistics and the "teeth arm" job too.

Clouds think they can go even further and sometimes they can and sometimes not. Now we have a sodding complicated resource on offer with a supply chain that is a bit random.

The whole hyperscale cloud premise is based on infinite availability of raw resources and that is complete bollocks. You can't hyperscale if you can't source stuff indefinitely.

Those Austrian mint filled choccy frogs became a thing for a while. I gave no idea of the exact numbers but presumably Austria supplied quite a lot of them for the UN forces and families in Cyprus in the '80s - they became a bargaining chip for a while. They came in a cardboard package with a lid coloured light blue with outlines of frogs and I think the main box was dark brown or black.


So does Azure.


Never happened to me in AWS.

Wasn't the whole point of "the cloud" that these things shouldn't happen?


Azure has some of the biggest outages like when they went down on Feb29th for the whole day.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...


It seems like in nearly 3 out of every 4 years the whole internet is unusable on February 29... why pick on microsoft?


10 years ago, has there been something similar recently?


The last one I remember is this one from August this year: https://redmondmag.com/articles/2022/08/30/microsoft-blames-... It was not a complete outage but these DNS issues caused a lot of pain.


Having worked for a company that's a very large customer of AWS's, it's not much better.

I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.


We actually use Azure for ~2 years now. It worked the most time reasonably well, even though we had also a few issues. But our current issue + ready your and other comments will probably result in looking for a new home.


> One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

I don't believe that is even remotely correct.

It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.

I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.

It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).


Depending on your scale, things are really not that complicated. If you can run your company from a single machine, having two for redundancy, and two internet links for redundancy, will likely go a loooooooooong way until something bad happens...


Out of curiosity (from someone inexperienced with Azure), is it a skill/ability chasm between MS engineering and outsourced support?

TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).


Microsoft support is really awful. Basically, if you need it regularly, you just pay for resident engineers who can bypass the wall between the product groups and you. I’ve had nothing but great experiences with those guys.

Otherwise, especially if there’s a broader problem, they play lots of games with SLAs, etc.


YES! We tried a big project in the cloud (many many many high end VMs), and Azure was SO unreliable. From BGP configs fuck ups to obscure bugs in their stack.

Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.

Cloud isn't that magical unicorn!


Yes, and what is your contingency plan for said fiber going dark?


I have DB connection issues at least a few times a week. Annoying.


New Microsoft customer at all.


The common argument of "our own hardware would be more profitable in X years" is typically countered with "but you need to pay engineers to maintain it, which adds to the cost".

Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).

I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".


Depends on how stable your needs are, but sometimes its cheaper even when you considerer total cost and not just for big deployments.

In the past 2 or three years, we probably moved more services off the cloud than other way. That said one reason for that is that most new services are build in the cloud, so there are less services off the cloud than on it.

Cloud is best, when you are starting out, when you don't know what you need, need high velocity of adding new stuff, of have very burst like demand for either traffic or cpu etc. Or if you are just small developer only team.

But if you have applications that are relatively stable, are mostly feature complete and you don't expect much sudden growth etc, it's useful to run the numbers if cloud is still something you want/need.


Oof, that sucks and I feel for you. That said...

> setting up in a new region would be complicated for us.

Sounds to me like you've got a few weeks to get this working. Deprioritize all other work, get everyone working on this little DevOps/Infra project. You should've been multi-region from the outset, if not multi-cloud.

When using the public cloud, we do tend to take it all for granted and don't even think about the fact that physical hardware is required for our clusters and that, yes, they can run out.

Anyways, however hard getting another region set up may be, it seems you've no choice but to prioritize that work now. May also want to look into other cloud providers as well, depending on how practical or how overkill going multi-cloud may or may not be for your needs.

I wish you luck.


"Multi-cloud from the outset" is probably the single-worst generic cloud advice that I think anyone could be given. In professional cloud consulting the rule of thumb is to do one cloud with excellence until you even think about another one. And even that is really just kicking the conversational can, as both becoming excellent and actually needing multi-cloud combined is one-in-a-billion.


That pretty much binds your hands since in our experience the one provider who can do “one cloud with excellence” is AWS.

(As an aside I also agree that multi cloud from the get go is a YAGNI violation. Just keep in the back of your mind “could we have an alternative to this?” when using your provider’s proprietary features.)


That generalizes to every kind of lock-in: have a viable escape plan, but only execute it if you need or it becomes cheap enough that it won't harm you.

Just having the plan is already expensive enough.


My experience is the opposite: AWS has more features on paper but most of them exist only to tick a checkbox. Azure has more integrations between their offerings, as well as Azure Active Directory, and Microsoft 365.


Why do I want AD or Microsoft 365?


O365 = teams, docs, outlook, etc. workspace tools

AD = identity, access, privileging, SSO


I know what they are. No company I’ve worked for recently used them.

In my mind they’re legacy business products.

Sure, most businesses use them, but I don’t necessarily believe that is a forever thing. At one point most businesses had mainframes.


Refuse reality at your own peril. These tools are actively used at scale across most enterprises, and likely will be for the majority of your career. More than 95 percent of Fortune 500 companies use Azure/O365.


How many Fortune 500 companies were founded in the SaaS era?

I'm going to guess that the 5% that don't use Azure/Office 365 represent newer entrants to the Fortune 500.

71% of Fortune 500 companies use mainframes [3], and yet they are considered a dead technology with essentially no future. Do you know anyone or anyone who knows anyone who learned how to develop on mainframes in college in the current millennium? It sure wasn't part of my CS curriculum!

Small businesses represent almost half of US economic activity [1] and represent 99.7% of firms with paid employees [2]

[1] https://advocacy.sba.gov/2019/01/30/small-businesses-generat...

[2] https://cdn.advocacy.sba.gov/wp-content/uploads/2021/12/0609...

[3] https://www.precisely.com/blog/mainframe/9-mainframe-statist...


You personally? No idea. You probably don't. But many (most?) businesses use AD and Office and aren't particularly interested in migrating to alternative solutions.


I mainly work in startups. None of them in recent memory have bothered with AD or Office. Okta and Google Workspaces take their place.

Those MS products have an “IBM mainframe” problem. New businesses won’t choose them.

That’s why I say “why do I want them?” If I was starting a new business I’d have no reason to use them.


Yes, Microsoft focuses on the customer (corporate IT), not the user.

This is how the iphone was able to nuke windows phones which were designed to meet the needs of IT


We’ve had reliability and availability problems esp with azure and also Google. less with AWS.

None are ideal.


And Active Directory integrates horribly with everything outside Microsoft.


Azure Active Directory has both SAML and OpenID Connect endpoints… what’s missing?


We use it across many non-MS services without issues. Care to expand?


Just to steel man your statement: You should strive for excellence in deployment with all providers (dabble) but have your initial core setup on one cloud (YAGNI principle) until when multicloud capacity is needed (at scale)


If you use generic enough services (container hosts, load balancers, VMs, object stores, even hosted SQL DBs, etc), then the multi-cloud journey is not that hard. The challenge comes when you have build a whole architecture on top of some AWS magic that simply does not have an easy alternative in the non-cloud world.


Who are you quoting? I said "multi-region from the outset" and later acknowledged that multi-cloud would probably be overkill.


This was exactly what I was thinking, its amazing what people read into, probably myself included.


Completely agree, though certain aspects, such as running on k8s or Docker might make it easier to switch if you ever decide to, versus say, being tightly coupled with many bespoke cloud products.


My philosophy is to make switching to a new cloud possible. It doesn't have to be easy. We just shouldn't nail our feet to the floor.


Or you could just deploy on metal, which will be cheaper and sufficient for vast majority of cases. Plus you can always migrate to VMs with relatively low hassle.


You always only need another one when everything has gone to shit, either from failure or cost from vendor-lock-in, so drinking your chosen providers Kool aid equals taking the reactive route and scrambling to rearchitect when the issues hit.

Multi-cloud is really not a big deal. Main nuisance is billing differences, followed by slight variations in e.g. Terraform config.


On top of which, startups often don't have that luxury; you have often need to ruthlessly prioritise your effort.


True. When Nextel was bootstrapping in the 90's, a VP said, "We have to buy gas for the car now. Later we'll buy the seat belts"


Totally agree. If the service you're providing is so important, build your system so it can fly on one engine or at least land safely. Multi-cloud is the equivalence of trying to transfer all of your passengers to a different aircraft mid-air.

Multi-cloud should only be for mission critical infrastructure. Very little infrastructure is mission critical. Most other use cases can be temporarily wallpapered over with an "Under maintenance" page unless there's a good reason otherwise.

Multi-cloud introduces more risk than it prevents. Which is why things like simulated failovers and BCP testing is constantly required.


Surely it would depend on the reliability demands of your product.


I'll reply to my own comment in response to a since-deleted reply that went something to the effect of "this is terrible advice for a young startup trying to get to product market for":

I'm totally on board with the idea of being scrappy and taking shortcuts in order to get to PMF as soon as possible. However, it seems the proof is in the pudding here. If you can't service customers due to lack of compute resources, you can't get to PMF.

Also, yes there are certain infrastructure and network topologies that would absolutely be overkill for a young startup. I don't think multi-region is one of those things. I don't have experience with Azure directly, but on every other cloud providers, going multi-region is not something that requires huge amounts of time or resources. You just need to be mindful of it from the outset. And if you decide not to be, then at least be intentional and conscious about the risk and have a plan in place for what happens when you get bit by deciding not to go multi-region.


I'd add to this by asking: how much more PMF can you get when you have a two week horizon of new customers before you literally run out of compute resource in a major cloud provider data centre?

Sounds like customers are coming in thick and fast.

If this is the dynamic and the company can't spare a few weeks to solve it, something has gone seriously wrong in a very interesting way.


Also, n8n arguably has product-market fit so the advice was impertinent to start with…


GDPR might be a problem here. But this brings us to an important point: this is not your infrastructure, but someone elses.


There are quite a few European regions, but we don't know how the others stand with their computing limits...


GDPR isn't really related to the infrastructure, and isn't a problem if you built your product knowing you'll need to conform. Shopify is GDPR compliant, for all merchants, and runs on Google Cloud in multiple regions.


There are specific requirements in Germany that require user data to not leave the country. I believe that was what OP was referring to.


Genuine question: is knowledge on how to do this well known? Without that accessibility, I'm picturing folks operating in EU being unwilling to take the risk of not being compliant and just hosting everything in a single region.


GDPR is not some boogieman, it can be pain to do on existing products that were build pre GDPR, but if you are starting new project, being GDPR compliant is pretty straight forward and not hard/time consuming, unless you are explicitly trying to do something shady*

*privacy invasive that GDPR is explicitly set up to make harder, so duh


They’re wrong.


Privacy shield very much matters where your servers are. EU cracking down hard on extra territorial transfers in the past year with more to come.

Also, lots of companies assert GDPR compliance via magical thinking. They most often are wholly wrong. Shopify can say whatever they want, but there’s no certification body.

Source: I’m the person who evaluates and builds compliance systems for a range of services you almost definitely use.


Good point, I didn't think about that.


I take offense with you comment. It's not the first time I'm hearing about multi-region/multi-cloud in online tech forums, however reality doesn't match.

I don't want to be snarky, but when large service providers like AWS have their own crossregion downtime because one snowflake of a service in us-east-1 is down, I kind of dismiss the virtue signaling of high resilient multi-(az/region/cloud) ever existing in practice.

If you can somehow have a separate database per region/cloud, sure, I can understand that, but if you have to shard your database across many clouds, I'd dread having to tame such a beast, especially within a startup.


> dismiss the virtue signaling of high resilient ...

So you're saying it's impossible to improve reliability from 97% to 99% because you can never make it to 100%.


If your single-AZ, single-region cloud is not giving you 3 or 4 9's of reliability out of the box, you are using the wrong cloud.

Multi-AZ and multi-region add complexity and cost much more quickly than they add reliability.

Sometimes it is worth it. Sometimes it is not.


Depends on your needs, but having your data & database multi-az to ensure durability can avoid you having to restart from backups. I'm thinking about an old AWS incident were they actually lost EBS data: https://www.bleepingcomputer.com/news/technology/amazon-aws-... Also make sure your backups are in a different AZ (thinking about OVH ...) or region or even at a different provider.


Thanks a lot! You are totally right, it is for sure something we will find a solution for. But honestly, do I not want to. As a startup, you have very few resources and deliberately place some exact bets. Deprioritizing everything to work on something for a long time that was not prioritized, just to then end up again where you were before (a working cloud solution) is the last thing any startup should be forced to do. Anyway, it seems like we do not have much choice here.


I hear you. It's not a fun position to be in. And sometimes you're correct to take calculated risks and maybe the expected value was positive here, despite what ended up happening.

Without knowing the details about your services and infrastructure, it's hard for me to know what's involved in going multi-region now. Are you sure it's such a a gargantuan effort? I would've thought one person working full-time on this for a week or two would be enough, but again I don't know the details of your setup.

One option would be to pay a consultant who is an expert in Azure/cloud stuff to come in and help. May not be cheap, but could be a lot better and quicker for you and better for the business, especially if none of you are really big experts in Azure.

I've been here before (I think)...had to wear many hats and scramble to make sales, build the tech, act as de facto DevOps person even without a lot of experience doing it, etc. That is the way, but stuff happens.

Happy to chat about specifics if you want to bounce ideas off of me or go through your particular situation. Can't promise I'll have concrete advice, but happy to talk it through.


Thanks a lot, is really super nice of you and appreciated! Luckily we have somebody very knowledgeable on our team. Will tell him to reach out if he wants to have a peer to brain-storm some ideas.


Glad to hear you have the right people. Good luck, my friend.


I disagree with the other poster you should have been multiregion from the start. It adds a load of complexity and failure cases for early stage Startups.

Very poor position to be in, apparently this happened in azure UK recently too.


I do not think is a bad idea to be multi-region from start. For the most part Azure has at least two regions in each country (Germany North/West Central, UK South/West, Sweden South/Central, Norway East / West, UAE North / Central, France Central / South etc....)So if stuff happens being able to bring up your service in a different region in the same country could be helpful. I do not know specifics but it seems to me that having an abstraction layer on top of the region is not that hard to do (most of Azure services are supported in all regions). OF course, is a lot easier if done at the outset. Being forced to do it quickly and with little notice is no fun at all....


I feel for you. Also it sucks to be in this position.

Let the scar you get from this is be a learning experience, hopefully you will not fall into the same trap again to trust this company.

In my career I'm in a place where anyone suggesting I do work on Azure gets an instant doubling of my asking day-rate and I really hope the will be put off and find another victim for this gig.

That said, another learning experience would be to use terraform or something (tbh for azure the only sane thing is terraform, ARM templates are just garbage). Having terraformed your one region switching to the other would be much easier, tho not trivial.


Is it that much cheaper for you to build a new region on Azure versus getting setup on AWS?

If you rely on Kubernetes for orchestration and have minimal cloud API dependency, it may be worth that evaluating this option.

Also, do you have a TAM associated with your account? Are you just going through regular support channels? Can they deliver different instance types (not sure what the Azure parallel is), can they deliver short term capacity, etc?

I would try to push Microsoft more here. It's not like they've stopped on-boarding new customers into that region right? What happens if you create a new account in that region?


We already tried to push Microsoft, sadly have they been not very helpful. Still trying to get in contact with somebody that can actually make a difference. After all, are we also not asking for a hundred machines. Can really not imagine that they can not somehow make the resources we require available.


I'll ask again what the person above asked: do you have a TAM.

If you don't you're at a big disadvantage.


we tried escalating this through CASM and it did not work. The region is blocked for every quota, even a single instance.


Sorry, missed that. No, we do not.


Get one asap. Your TAM is the insider and should push for you.


Some tricks you can try is to switch to a different SKU. Most Azure databases have different generations of underlying compute. They may be out of just on model. Try a different one.

Similarly, just keep trying to change the size. Often it’ll go through when someone else decommissions something.


“Being multicloud from the outset” is a very silly idea for most use cases.

The way to get more from most cloud is by becoming a partner, not just a customer. And the way to do that is increase dependency and usage.


t. Sales department of Cloud provider


Good sales (which I'm not), is all about aligning with the customer: solve their problems for them, and they'd be happy to pay for it. Getting them to buy stuff they won't need is a sure way to loose future business.


> Deprioritize all other work, get everyone working on this little DevOps/Infra project.

This is doubly worthwhile as if this stumble kills the startup (it can happen) this will be excellent experience to take to the next employer :)


Also a lesson here is don’t create any prod asserts manually ever. Terraform or some other software to define your cloud assets. Then this issue is just a matter of adding a top level loop or maybe adding a region parameter to a layer of software. Cloud is only efficient if you take the software defined every thing seriously. Otherwise it is premium hosting where you are likely a small fish.


Thanks largely to k8s, running on multiple cloud providers and your own hardware is a lot more convenient than it was a few years ago. Component interfaces and protocols are a lot more consistent across platforms as well.


> everyone working on this little DevOps/Infra project.

Everyone? That's not going to help.


This is nothing new, Azure has been having capacity problem for over a year now[1]. Germany is not the only region affected at all, it's the case for a number of instance types in some of their larger US regions as well. In the meantime you can still commit to reserved instances, there is just not a guarantee of getting those instances when you need them.

The biggest advice I can give is 1. keep trying and grabbing capacity continuously, then run with more than what you need. 2. Explore migrating to another Azure region that runs less constrained. You mention a new region would be complicated, but it is likely much easier than another cloud.

1. https://www.zdnet.com/article/azures-capacity-limitations-ar...


>Azure has been having capacity problem for over a year now

This is also a problem internally for Microsoft. GitHub and LinkedIn still operate in private datacenters due to Azure capacity issues


> In the meantime you can still commit to reserved instances, there is just not a guarantee of getting those instances when you need them.

... wait, what? How are they defining 'reserved'?


RI are a billing concept (discounted rates for long term commitment).

Dedicated capacity exists, but it’s different (compute reservation groups or dedicated hosts).

You can combine CRG/DH with RI for the desired effect, although IMO it’s a bit confusing.

(Azure employee)


It's a billing mechanism. You pay less if you guarantee use. Sadly, they don't guarantee availability of things to use :)


Yup, I'm aware of reserved instances (from an AWS PoV) but I always assumed they were, at least theoretically, well, reserved!


Reminds me of the classic Seinfeld car reservation bit: https://m.youtube.com/watch?v=4T2GmGSNvaM


Great bit! Same with airline overbooking ;)


On AWS instances are only reserved if you reserve a specific instance type in a specific zone. Reservations across multiple zones or savings plans don't reserve capacity.


In the AWS context, they are, in fact. That's the original point of them - so during big AZ failures your reserved instances had first dibs on the available capacity.

The billing thing became more of the point as big AZ failures are so rare.


Non-zonal instance reservations do not reserve capacity. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capa...


Well, yeah. But the zonal ones still do.


With dedicated tenancy AWS Reserved Instances are physically reserved for you.


I worked briefly in an enterprise facing sales organization that targeted multi-cloud deployments. Azure always had capacity problems.

As ridiculous as it sounds, having an enterprise's applications exist on multi-cloud isn't terrible if the application is mission critical - not only does this get around Azure's constant provisioning issues but protects an organization from the rare provider failure. (Though multi-region AWS has never been a problem in my experience, there is a first time for everything.) Data transfer pricing between clouds is prohibitively expensive, especially when you consider the reason why you may want multi-cloud in the first place (e.g., it's easier to provision 1000+ instances on AWS than Azure for an Apache Spark cluster for a few minutes or hours execution - mostly irrelevant if your data lives in Azure Data Lake Storage).


Every cloud provider will have these issues with specific instance types in specific regions, although the Azure Germany situation sounds perhaps a bit more dire. At my past (much larger) employers we’ve always run into hardware capacity issues with AWS too - we’re just able to work around them.

Building on cloud requires a lot of trade offs, one being a need for very robust cross-region capability and the ability to be flexible with what instance types your infrastructure requires.

I’d use this as a driver to either invest in making your software multi regional or cloud agnostic. Multi regional will be easier. If you’re already on k8s you should have a head start here.


As much as this happens, I don't feel it's something to be expected or even okay.

The major cloud services are expensive. This extra cost is supposed to provide for cloud services' high level of flexibility. Running out of capacity should be a rare event and treated as a high priority problem to be fixed asap.

Without the ability to rapidly and arbitrarily scale, they're just overpriced server farms.


> Without the ability to rapidly and arbitrarily scale, they're just overpriced server farms.

I mean that's what cloud is (outsourced server farm). Sure they also offer services on top, but that's mostly because they want to lock you in, and can charge more for, so it's a win win for them.

And there is no magic here, someone has to get the chips, build servers and connect them to network. And while they will often overbuild for capacity, they will never do it to a degree, where they can't run out, because that would be way to expensive and not financially viable.

I don't think any cloud will ever be able to guarantee to never run out of resources.


> I don't think any cloud will ever be able to guarantee to never run out of resources.

I agree with this, but clearly there's a disconnect between how often people expect these kinds of issues and how often they actually happen. The whole point of the cloud is you pay a premium for the added flexibility. If it turns out that flexibility isn't there when you need it then maintaining your own servers becomes a lot more attractive.


Some problems can't be fixed (eg. chip supply chain problems) even if you have more money.


>Some problems can't be fixed (eg. chip supply chain problems) even if you have more money.

They can't magic chips into existence, but leaving a major region like Germany high & dry for almost half a year sounds like planning went wrong frankly. If it were a matter of chips I would have thought on a 3+ month timescale they can steal a few from another region that has a bit of fat


> they're just overpriced server farms

That's exactly what a cloud is. It's someone else's datacenter with an API.


A really good API that makes it close to a software defined everything world. Which has promise.


There is a "minimal viable product" of documenting the configuration of your system so you can (1) run development, test, staging instances, (2) jump to another region when necessary, (3) from other disasters.

Ideally you have a script that goes from credentials to the service to a complete working instance.


Yes it’s weird that you have to ask them for instances which some actual physical person looks at your request, thinks about it and says yes or no to.

Instead of providing you with a list of the resources they do have, you have to play this weird game where you ask for specific instances in specific regions and then within several hours someone emails back to say yes or no.

If it’s no, you have to guess again where you might get the instance you want and email them again and ask.

I envisage going to an old shop, and asking the shopkeep for a compute instance in a region. He hobbles out the back, and after a long delay comes back and says “nope, don’t have no more of them, anything else you might want?”.

It’s surprising this how it works. Not the auto scaling cloud computing used to bring to mind.


I briefly worked on an Azure team, and what I remember hearing (a few years ago) was that they were building out data warehouses as fast as they can, but they simply cannot keep up with demand. A good problem to have, I thought, but maybe not in light of this news!


I recently spoke to a datacenter planner. I wouldn't be surprised if the global spend on new datacenters for 2023 is on the order of $100 billion. If this continues for a few more years, this is planet-shaping change.

I just Googled it. Gartner estimated $125 billion.

https://www.fiercetelecom.com/telecom/cloud-and-colocation-d...


Is this a joke comment?


no, they have very low quotas by default, and you have to request increases through the portal, which then get rejected and you click the button to contact support/email and then you sometimes have to negotiate with them

you have to do this for every single instance type they have, can't even experiment or test other instance types cause its too much trouble to get quota


In the future it will be possible to use computers to figure out what’s available and automatically give it to customers.

21st century man…. it’s coming.


But who will determine when more computers are needed to figure out what's available to give to more customers because there's been a spike in demand?

Computers don't fix everything. They just allow you to f*ck up bigger, harder, and faster, usually in the most banal way imaginable.


The comment I replied to was not talking about changing quotas but actually creating instances.

> Yes it’s weird that you have to ask them for instances which some actual physical person looks at your request, thinks about it and says yes or no to.


well can't create an instance without having quota available

and low quota is low, like 10 cpu, so start a 2 node k8s cluster with 8cpu each? nope, go request quota increase


No, Microsoft still isn't up to the 'use what you want, pay for your usage' level that other companies tend to be. They even still mix "licensing" with "usage" so you have to pay for something to then be allowed to pay for using it...


No, this is my actual experience using azure.


I can go on Azure right now and create an instance and nobody will check anything manually and email me back something. Maybe you're confusing Azure with some other small town colocation provider.


If you want 1 instance, you're right. If you want 10 - 20 instances of one type in a region, the other poster's experience matches my own: you have to open a support request to ask for a quota increase, and that is not an automated process.


Accounts have instance count quotas; you can get them raised, but it is a support ticket to do so.

And sometimes, that is hard. I've had Azure support not able to understand what quota they need to raise / what quota is being requested. I had to at least link them to their own documentation on it… (partly the confusion is that quota support tickets allow selecting the quota as a piece of metadata on the ticket, but only for some quotas, and of course, mine was for one of the ones not listed. Why they don't just list all of them is anyone's guess.)


Nope, I went through this process exchanging more than 30 emails trying to get the instances I wanted.


"an instance" lol


I am sorry to say but at this point Azure is so f’ed up I think it should only be considered after AWS and GCP.

The documentation is terrible and the Azure portal is so slow and laggy I can’t even believe it. Not to mention how unreliable their stack is.


This is not as rare as public clouds may lead people to believe. I have had to move workloads around since AWS began (even between public clouds on occasion).

In particular, GPU availability has been a continuing problem. Unlike interchangeable x64 / arm64 instances with some adjustments based on the new core and ram count... if no GPU instances are available then I simply cannot run the job. AMD's improved support has increasingly provided an alternative in some situations but the problem persists.

I recommend doing the work to make the business somewhat cloud agnostic, or at the very least multi-region capable. I realize this is not an option for some services that have no equivalent on other clouds but you mentioned databases and k8s clusters which are both supported elsewhere.


GPUs are better run in your own office.

All cloud providers charge much, much more for GPUs than if you run a local machine.

Cloud GPUs are also a lot slower than state of the art consumer GPUs.

Cloud GPUs: much slower, less available, much more expensive.


This is generally true for all accelerators (I work with cloud and on-prem FPGAs for my startup, Arbitrand).

However, lots of people only need those accelerators once in a while, so time sharing (aka cloud computing) makes a lot of sense and saves a ton of money overall. For FPGAs and some compute GPU applications, not having to handle support for your accelerators is also nice.


Say you want 100 GPUs all inter connected to your multi petabyte data lake that’s being fed by your production workload.

Sure you could buy all that equipment but I’d wager it’s cheaper, more agile, and greater velocity from it being in the cloud


I would argue that the cost profile is different.

Local GPUs are a big up-front cost. But assuming that your workload is stable, in the long run I think local GPUs ends up being cheaper per-hour than cloud.

For startups, it doesn't make sense to make the up-front purchase, fine. But if you're optimizing for long-term (amortized) costs, I'd be curious if cloud is cost-effective.


The long run being three months


An nvidia dgx box is roughly 40k. And that’s not including power/storage/rack Space.

But yes, If a single workstation can meet your gpu training needs then it’ll be cheaper with sufficient usage


For small orgs this makes sense, but this really depends on how big your data sets are that you're training against and how your ML Ops / Data Ops is set up.

GPUs are better run close to your data. If you're training on-prem then your data needs to be on-prem too.


You want to be in a position where you can spin up in a nearby region and pretend it's local and have things be good enough for a while. Properly building out multi-region is hard, and multi-cloud isn't worth it because it improves how you handle rare events (where half the internet is already down) with ongoing operational toil.


I used to be a technical seller for Azure. This situation is obviously not great for you as a customer but there are proactive steps you can take to prevent this going forward. Reach out to your sales team and work with them on your roadmap for compute requirements going forward. The sales team has a forecast tool that feeds back into the department that buys and racks the equipment. If you can provide enough lead time, they will make sure you have compute resources available in your subscriptions.


What you describe is like the inverse of 90% of the reason companies host in the cloud. What makes needing to forcast and reach out to a sales guy to eventually stock hardware for your needs (while now competing against other customers for those resources) any better than hosting on prem.

AWS for sure has had resource constraints in different AZs (especially during flack Friday and holiday loads) but I have never had an issue finding resources to spin up especially if I was willing to be flexible on vm type.


Under most circumstance, this isn't needed unless you have a big ask. Say, you need 1,000 specific cores and GPUs, then this process is the best way to ensure you have them available.

The original poster probably has the ability to spin up other instance types in their region. If there is no compute capacity in the entire region, something went wrong operationally.

I'm not suggesting you should put in a request for every new resource you need, but if you have a specific instance type or a large number needed, it helps. You're not losing the ability to shut them down the next day if you don't need them, you're just telling the Azure team that you expect to spin some up around a certain time. If you're making a significant request of compute capacity, the team has the ability to reserve those instances for your subscriptions so that you're not competing with others for those cores.


Why work with a human in the Azure sales department and plan cloud resources a year ahead? What’s the point of the cloud at this point? Then it just becomes a 100x more expensive version of hiring an infrastructure person and plan with them your own physical resources a year ahead.


Thanks a lot, that is very helpful and great to know! Def. something we will do in the future.


What VM sizes?

Besides what’s already been said, internal capacity differs HUGELY based on VM SKU. If you need GPUs or something it’ll be tough. But a lot of the newer v4/v5 general compute SKUs (D/Da/E/Ea/etc) have plenty of capacity in many regions.

If changing regions sounds like a pain, consider gambling on other VM size availability.

(azure employee)


Actually nothing fancy, for sure no GPUs. Just Standard_E4s_v4.


Ah, bummer. If it helps, you can try this to list out VM sizes with comparable capabilities and see if you have better luck with any others (--all not really necessary since it filters by NotAvailableForSubscription and similar):

  az vm list-skus -l germanynorth -r virtualMachines --all > germanynorth.json
  jq '.[] | select( any( .capabilities[]; (.name == "vCPUs" and (.value | tonumber) >= 4 )) and any(.capabilities[]; (.name == "MemoryGB" and ( .value | tonumber ) >= 32) ) )' germanynorth.json
4/32 because that's what E4s_v4 would have.


Thanks a lot! Just checked internally. Apparently are there some instances which we could get but would not work cost wise (have for example a lot of CPUs but we mainly care about RAM). Additionally, is there also still a region-wide CPU limit that would still cause us problems. So sadly not a long-term solution. But thanks a lot!


looking at the time you seem to spend on this issue and the fact you're apparently only needing low double digits of those instances.

Are you really sure you shouldn't just buy a bunch of machines (500cores/2TiB go for ~60k€), throw them into a colo and then spend that time on actually doing stuff?


same goes for every App service, no matter which instance size


> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Yikes, this is totally the first thing you need to come to expect when working with MSFT.


When Amazon S3 was a new thing, when I managed to convince my company move to it, when we just moved to serving some of our stuff from S3, first week, Amazon has an outage.


Probably a good learning for the future ;-)


Maybe Microsoft had just got their AWS bill?


Well I thought that was funny :-)


Most of Europe expects the winter to be quite painful from a power perspective. It would not be surprising if cloud providers (major power users) are being asked to not increase (or even decrease) power usage.

The timeframe they gave would match that kind of ask.

I wonder whether you see the same behavior from other cloud providers there (ie if you ask them whether new capacity is available, what do they say)


> It would not be surprising if cloud providers (major power users) are being asked to not increase (or even decrease) power usage.

I doubt it. It will be easier - and probably safer - to ask citizens and physical industry (eg, factories) to bear the brunt than to risk having problems in critical IT infrastructure. Ask people and factories to turn the heat 3 degrees down and the effects will be more or less predictable. Asking to shut compute power down at random will have unpredictable consequences.


It's not about shutting down existing machines. The power grid operator might be less willing to approve upgrades to serve increases in capacity. (No idea whether that's the case.)


That makes more sense.


Someone else confirmed my guess https://news.ycombinator.com/item?id=33744179


Obviously Azure failed its customer here, but everyone with data centers in Europe is tightening their belts and preparing for the worst.

I suspect AWS and GCP just have more headroom in EU.


Message to cloud providers:

List what you do you have available so we can choose.

Do not force users to randomly guess and be refused until eventually finding something available.


Imagine if they did this in realtime. There's already DDOS attacks happening which are abusing the cloud free trials at scale - this would give them another attack vector.

I can see why they wouldn't want to do this.


It only has to be available to paying customers. Or even customers over a certain paid usage threshold.


I need big m4n instances with 100gbe for product demos, and spinning them up lately is like trying to get Taylor Swift tickets on Ticketmaster. We end up wasting money running them for days at a time instead of on demand because we’re afraid of losing them.

It’s infuriating that AWS doesn’t have an API that returns a list of AZs with available inventory for a given instance type.


Why not run them elsewhere?

There’s lots of providers apart from AWS/Azure/GCP.

Or buy a machine and put it in your office.

Self hosting can often be cheaper and more available and probably faster than using a cloud.


We used to run demos on a local hardware cluster, but we found that prospective customers were reacting negatively to demos that were not on the same platform they would be running in production (AWS).


I’ve had great experiences running bare metal instances on packet.io but haven’t used them since the acquisition. For accurate benchmarking it was fantastic (and much cheaper than EC2 bare metal instances).


Which regions are you trying?


Why would they make any promises, or be upfront about their resources at the risk of becoming less attractive compared to competitors with more resources? It’s not like many people are shunning the cloud for that reason today (although maybe they should).


Your price point and the clouds margin is tied to not sitting on lots of unused instances. you want there to be adequate capacity not excessive capacity


It goes both ways: cloud providers don’t want to make promises about capacity, and cloud users don’t want to make promises about usage.

I don’t know about price point. Dedicated servers can be cheaper than cloud in many cases, if you have the appropriate know-how, and the cloud business is very profitable for a reason.


At bare minimum there should be a feature like "Give me any VM that closely resembles an E8s_v5 with at least 32G ram". Or "anything from these 4 approved types".

I don't always care if you give me a E8_v4 or a D8 instead, just give something. With all the 100 of variants of VMs that are available, finding an exact match is obviously an unnecessary constraint. Maybe they already simulate this behind the scenes, I don't know, though given the sizes are advertised with HW capabilities I'd imagine they can't really simulate a v4 using a v5 and vice versa.

Only place I've seen compute be treated this fluidly is in Container instance, which is a bad choice for many many other reasons.


https://aws.amazon.com/ec2/spot/instance-advisor/

It's not the exact metric but you can find which have more availability without knowing the exact number (which is constantly changing anyway)


Interesting semi-confirmed anecdote: when lockdown hit, Azure began to refuse to allocate servers. One of the main reasons was they prioritised servers in this way:

1. Government/health/defence cloud customers

2. Teams, which was exploding in use and they wanted to capitalise on it

3. Regular cloud customers


Yeah this was real. I remember this. For a while they selectively deprioritized customers, like you say. I'm not judging, just confirming the observation.


Good news is that today is Black Friday, so the e-commerce industry is running at peak capacity. In 30 days it will be Christmas, and by then (the very latest!) everybody will scale back, so you have a good chance to gain access to more compute before you reach the end of your runway.


I've seen this before. I think it was in us-west1, ran out of VMs of the size we used for CI. Had to move to a different region. (Never moved back…)

It is shocking to me that it happened at all. Capacity planning shouldn't be so far behind in a cloud that wants to position it as being on-par with AWS/GCP. (Which Azure absolutely isn't.) To me, having capacity planning be solved is part of what I am paying for in that higher price of the VM.

> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Oh my sweet summer child, welcome to Azure. Don't depend on them being proactive about anything; even depending on them to react is a mistake, e.g., they do not reliably post-mortem severe failures. (At least, externally. But as a customer, I want to know what you're doing to prevent $massive_failure from happening again, and time and time again they're just silent on that front.)


I'm baffled to read stories that suggest Azure is a viable competitor to GCP/AWS - they're an absolute nightmare on capacity.

It took me six months to get approved to start six instances! With multiple escalations including being forcibly changed to invoice billing - for which they never match the invoices automatically, every payment requires we file a ticket.


> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You will be threatened by your own unreliability of building something that's dependant on one region or one cloud.


This is an insidious argument to make. When building a startup you should choose 1 reliable cloud provider and use their best practices to support high availability.


No matter the provider, their best practices all say to be multi-region.


Multi-region in AWS means building it yourself.

I suspect that the skills for real HA are atrophying because for 99% of the people multi-AZ is enough and most of the AWS stuff supports multi-az automagically.

The problem with multi-region is that it means configuration, and there are probably lots of services that you can't actually configure to be multi-region. Cognito is one off the top of my head. It looks like the various aurora flavors do multi-region, but what about Neptune? SQS? API Gateway? AWS Lambda? MediaLive?

Maybe you can hide all that behind DNS failover, maybe you can't.

Real multi-region is basically means going back to old-school HA, and that was hard to do when it was your data centers. On AWS it'll be even harder.

That isn't to say it's not possible, it's just a tremendous amount of work.

I mean really, if us-east-1 is down 80% of the internet is screwed...so from an expectations point of view does HA of your particular service matter if that happens? Even for a financial outages happen.

Once you have enough people it might be worth it. For a non mission critical startup? No fucking way.


Multi-AZ and then grow into multi-region if the need arises. Multi-region is a huge lift the moment all your data must live in two regions simultaneously. Very few shops are experienced enough to run clusters across datacenters in a way that can handle the unhappy paths.


Def not true with AWS, unless you reach a particular scale. Not for product market fit. My technology choices would be fully managed services so I could focus on my actual business.


Read the "Well Architected" paper. Go multi-region.


Unless eCF has made some major advancements in the last couple years, Amazon’s own retail business isn’t multi-region. So it’d be like the cobbler saying “Buy my shoes!” while wearing none, if AWS were to push everyone hard to have a multi-region strategy.

At least on AWS you typically can find capacity (outside of accelerators) by being flexible on instance types (C, M, R), instance sizes, and availability zones. Sounds like this region OP is in for Azure is constrained such that even this advice doesn’t work.


Can you link me to the well architected paper that talks about going multi-region for an early stage startup?


This may be the reference: https://learn.microsoft.com/en-us/azure/architecture/framewo...

I just watched the second video on the page, and it does discuss multi-region a bit.


def true with everything. what a ridiculous statement.


Cross-region architecture will be the first thing you hear about.


Totally agree, we could for sure have build from the get go multi-region and multi-cloud but we had good reasons not to do it. Depending on the product, technology, ... would actually also strongly recommend almost every startup to do the same.


No matter what stage of a service you are at you should have a documented procedure (ideally running a script) that can stand up a working instance of the system.

This has vast benefits for agility and fast development when developers are not always fighting the build system and have a "no fear" attitude about deployment.

If you have that, you can build a system in another region and be able to migrate wholesale to another region with more capacity and not be particularly concerned about the general problem of coordinating the service across multiple regions at the same time.


Seems bold to recommend everyone do the same as you when you are running in to problems you can't solve because of this exact choice you made.


I am still 100% sure it was exactly the right decision. Was however in hindsight probably the right one to choose Azure and/or that data center.


If you go under because of this, will you still be 100% sure?

Everything is for sure until it’s not.


A startup is about managing risks and spending your time/money appropriately. Your cloud provider running out of capacity isn't an obvious risk especially if it's just capacity for general compute.

For some clouds that seem to be run on a manual process (IBM, Oracle) that would be expected, since they're sort of clunky. For other places (Rackspace, etc) it would uncommon. For a major provider like Azure, well, it's bizarre. I mean, the whole point of cloud is that it's all-you-can-eat.

You would think that this would be something they would advertise/talk about up-front. But who would sign up if that was disclosed?


Azure Germany is a separate partition from the rest of Azure - presumably for compliance reasons. This is distinct from AWS, where Frankfurt is just another region, albeit one with high demand.


> AWS .. Frankfurt is just another region

Unlike GCP and Azure, all AWS regions are (were) partitioned by design. This "blast radius" is (was) fantastic for resilience, security, and data sovereignty. It is (was) incredibly easy to be compliant in AWS, not to mention the ruggedness benefits.

AWS customers with more money than cloud engineers kept clamoring for cross-region capabilities ("Like GCP has!"), and in last couple years AWS has been adding some.

Cloud customers should be careful what they wish for. If you count on it in the data center, and you don't see it in a well-architected cloud service provider, perhaps it's a legacy pattern best left on the datacenter floor. In this case, at some point hard partitioning could become tough to prove to audit and impossible to count on for resilience.

UPDATE TO ADD: See my123's link below, first published 2022-11-16, super helpful even if familiar with their approach.

PDF: https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-faul...


AWS has several different levels of region isolation.

There are aws region partitions - general, china, us gov cloud (public), us gov secret and us gov top-secret.

Inside a partition, there can be some regions that are opt-in - see https://docs.aws.amazon.com/general/latest/gr/rande-manage.h...

My understanding is that opt-in regions are even more isolated inside a specific partition for partition-global services like IAM and maybe some other stuff.


There is a reason why GCP and Azure have had many more global outages than AWS. Fault isolation always entails some level of inconvenience.


> Unlike GCP and Azure, all AWS regions are (were) partitioned by design. This "blast radius" is (was) fantastic for resilience, security, and data sovereignty. It is (was) incredibly easy to be compliant in AWS, not to mention the ruggedness benefits.

Could you elaborate on this a little? We use AWS, but are evaluating OCI for certain (very specific) cases, and I'll love to know what questions to ask for comparison purposes.


You likely won't get anywhere asking Oracle questions, their sales is very good at (not) answering.

Here is how partitioned/isolated OCI is by design:

https://www.wiz.io/blog/attachme-oracle-cloud-vulnerability-...

While that's fixed, it speaks volumes to the architecture. Very little has changed since 2018: https://www.brightworkresearch.com/how-to-understand-the-pro...

As noted there, I'd argue OCI is more akin to Softlayer/Bluemix than to GCP, Azure, or AWS, but depending on your certain very specific cases OCI may still be appropriate.


Cross-region extensibility points are few and far between. See https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso... for more details.


Yep, it's run by the Telekom entirely IIRC from my time back at MSFT. Microsoft "just deploys" Azure on it.


this: compliance plus lack of energy for new datacenter capacity. source: colleague who works at msft. they have a true crisis there and it will get worse.


We have had this issue in and since 2018 https://www.opencore.com/blog/2018/6/cloud-has-a-limit/

That said: We also had this issue on GCP last month.

We found that all three (AWS) are unreliable in their own ways.


I’m sure Microsoft is just as surprised as you are. Almost every European facility I ever worked with was constrained by either space or power so you had to be really on top of your capacity management. Facilities in the US seem to have unlimited power and floor space so you never have to deal with either issue.


Who else has heard countless times something like "with company X's cloud platform you don't need to file a ticket and wait weeks for another team to provision a physical server, just spin some more up bro." The reality is you do, you've just outsourced the problem.


EC2 us-east-1 is chronically stocked out, too. Black Friday is the worst day of the year for this. At work, we pre-allocated tons of EC2 machines we don't really need, to hedge against EC2 stockout coinciding with some kind of incident. Yes, we are part of the problem.


In a former role, I used EC2 in us-east-1 to host the front door e-commerce site for a consumer electronics company. AWS suggested that we go through the Infrastructure Event Mangaement process (https://aws.amazon.com/premiumsupport/programs/iem/) for Black Friday and Cyber Monday, so that staff on Amazon's side could guarantee that they'd have capacity to run our system at its forecasted peak.

The strategy they helped us arrive at was two-pronged:

1. Pre-launch all needed infrastructure. Yes, for all their "cloud scale", it was actually suggested that we preallocate all of our servers the week before, rather than rely on autoscaling.

2. Order capacity reservations for all of those instances (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capa...). This ensure that, if any of those instances go bad, we'd be able to relaunch them without going to the back of the line, and finding out that there was no more compute capacity available.


Part of what problem? I don't remember us-east-1 ever running out of instances


This is a bit tangential, but now might be a good time to experiment with raising the price of your product. It might extend the time you have until you have to stop accepting new users entirely, in case your migration is taking longer than needed.


People saying "shame on you for not being multi-region" are missing the point: This is a German company with German customers subject to German data residency laws. For them to store German data in a region besides Germany requires getting informed consent from the "data subject", who must be "pre-informed about the potential risks involved in cross-border data transfer". [1] This is why Azure has a dedicated German partition, just as it has a dedicated Chinese partition.

Now, they could go the GDPR/Cookies route and prompt absolutely every user on pageload, but doing so would annihilate the purpose of the law into monotonous smithereens, just as it did with Cookies. Good on them for defaulting to the "more secure" mode, but yes this is a potential consequence.

Happy to hear from any German amigos present if I've got something wrong. (But watch out... you might be putting HN at risk - their servers aren't (likely) in Germany!)

[1]: https://incountry.com/blog/which-german-data-privacy-laws-yo...


This is a good point, but it's a reminder that a lot of these privacy laws are impractical to deal with. When they're universal, it's one thing, but if you're a medium-sized country trying to flex your legislative might, you're going to make the experience worse for your citizens and businesses.


No reason why Azure can't run multiple "regions" within Germany proper. Their current region is in Frankfurt; no reason why they couldn't launch regions in Munich or Hamburg as well. Then German companies could go multi-region while staying fully compliant with German data sovereignty laws.


There are already two Azure regions in Germany: Germany North and Germany West Central.


Really? "No reason"? You can't think of one rea$on?


Also note that the German Azure is in transition - I am not 100% sure the new version of Azure unique to Germany is set up just yet - held by a specific data trustee is ready...

https://learn.microsoft.com/en-us/previous-versions/azure/ge...


Time to sign up for AWS or GCP, then. If you're using kubernetes anyway, you'll be fine with the switch.


I’ve had capacity issues before with both gcp and aws in smaller regions so not a panacea


Says someone who has never done a large migration of any type…


You should be good to go except for debugging accounts billing monitoring habits documentation security evaluation …


Ran into a similar issue last year in the East US region. We contracted support and they gave a similar response. From my understanding talking to people who use AWS and GCP this isn't uncommon across cloud platforms.

While we could've just swapped a deployment parameter to deploy to another region, we opted to just use a different SKU of VMs for a short period and switch back to the VMs when they were available again.

We haven't seen issues since.


yeah AWS tends to have capacity issues during high volume periods like black Friday (I think this is now actually because most large users pre reserve a buffer pool of vms sitting unused) -- but I have never had an issue where AWS has told me there would be no capacity for months. Its usually swapping AZ or regions or being slightly flexible on your sku. And if you are sensitive to this and find it happening take a look at your sku loadout you may be choosing a very high demand vm type and shifting just slightly gets way more capacity.

^^ and by capacity I am talking like 10's or 100s of vms being available not 1.


Get in touch with your CSAM. They will be able to get you assigned a capacity manager, if you don't already have one assigned.

It is the function of the capacity manager to help you plan ahead based on what the data center capacities look like going into the future.

Meet monthly with your capacity manager. Get representation across different technology interests - database, compute, storage, event hubs, etc. Don't ever skip these meetings.


> Get in touch with your CSAM

Well that's an unfortunate acronym collision.


I thought for sure it was a military term (recalling SAM missiles), until I saw it in the news just today.

GODDAMN.

Sidebar: MSFT is the king of acronym collisions.


Not much better than "meet with Infrastructure in Nov to plan next years capacity and server purchases" for on prep -- has Azure really degraded down to this?


It's quite a bit better than that, in fact. They talk to their customers to try and understand all the big deployments coming to understand if there is going to be a crunch at the region/AZ level.

I'd be surprised if other cloud providers aren't doing that in some form. I only have experience with Azure (so far).


Wow.

It’s crazy that this could be valid advice, but it is.


Ask your VCs/angels for help, this is the kind of thing they can definitely help with.

(Speaking from experience - one of our portfolio companies had a similar challenge and we used our network to get to one of the execs of the vendor involved)


Thanks a lot. Yes that is also something we are trying in parallel.


One of the biggest benefits of k8s is that you can easily mix in pools of different hardware types without a “rebuild”.

Something to try in scenarios like this is to add the “weird and wonderful” VM SKUs that are less popular and may still have capacity remaining.

For example, the HPC series like HBv2 or HBv3. Also try Lsv3 or Lasv3.

Sure they’re a bit more expensive, but you only have use them until April.


There is no such thing as unlimited when it comes to resources and/or scalability in the hosting market. You might want to find a local colocation provider, buy a few network switches and servers as a secondary production and backup environment for your startup. Deploying your own infrastructure gives you full control over your startup. Yes it will raise your overhead and yes it’s not cheap but for a sustainable operation it’s a requirement in my opinion. I currently use Azure but I also have my own deployment with my own IP addresses and ASN which I keep spare capacity and keep some important servers on there incase something happens with Azure. Definitely helps me sleep better at night.


I’ve heard from a friend who works at Microsoft that due to energy crisis in Europe plus their data locality laws, Microsoft is indeed running short on datacenter capacity there and can’t do anything about it no matter how much they are willing to spend.


Very interesting. As mentioned in another post, I am sure they are not trying to screw us or anybody else up. Is after all for sure not in their interest. But not flagging that to users I do not get at all. I would expect to get at least a warning email a week about that, plus warning in the dashboard, but there was literally nothing.


as I’ve heard it actually affects much more than Azure, also all their cloud-based productivity suite.


Couldn't even put up their own solar panels?


lol, what?


While some may immediately run a comparison between Azure, AWS and GCP let it be noted that any cloud platform facing this and making it to headlines is not good for the cloud industry over all.


The thing is: if cloud vendors struggle getting new machines, imagine your small company trying to order and get delivered new on-prem servers quickly.

I worked for a company that worked mostly on-prem until 1y ago and last time they had ordered machines availability from Dell was scarce with huge delays.


I remember, in the early days of the pandemic, that Azure Australia ran out of compute too. It happens at the regional level.

Are you stuck only to the German region, and can't go to other European regions?


>We never thought our startup would be threatened by the unreliability of a company like Microsoft

Had you never heard about (and this is unfortunately not a joke) Microsoft’s music service they once had, shut down after a few short years leaving customers without the ability to listen to the music they had paid to listen to?

The service was called, this was the trademarked name, Microsoft “Plays for Sure.” You cannot make this stuff up.


That's also the name of the DRM system it had.


Azure, despite being smaller than AWS, I think has more regions. So each one must be smaller, which likely means less spare capacity.

I also sort of suspect the spot market is less robust there. Lots of Azure is lift and shift on premises workloads, and those aren't using spot. Without people using spot, it's even harder to have spare capacity...


Azure uses much smaller datacenters than AWS or GCP. Microsoft wasn't a big compute user before cloud, and it's a lot easier to manage and build for smaller DCs. Amazon and Google both needed huge DCs before being clouds.


Sad to ear that, but people have a wrong idea about the cloud, it's just others people hardware and like everything, there's a limit.

They cannot warn you because it's very hard to predict how many new customers will come or if the existing ones will create more instances.

I know about a bank with the same issue, basically, they've hogged all the resources in a specific region and yet, they need more. Unfortunately this things take time, MS cannot setup a new datacenter in a couple of days.

>but setting up in a new region would be complicated for us.

Why? it's easy: https://learn.microsoft.com/en-us/azure/azure-resource-manag...

Latency issues from app to DB?


we have the same issue and escalated it through multiple azure teams.

our quota has been silently set to 0 while there where still instances running. this worked fine until auto-scale scaled the instances down in the night to 1. at the start of the day auto scale was not able to scale back up to the initial amount which did lead into heavy performance issues and outages. we needed to move the instances as azure support did not help us. after many calls with azure and multiple teams involved we finally did not get the quota approved (even if we did have it already and was not asking for „new“ quotas).

also we decided to not be able to host in the German azure region anymore. Even if we could get the quota this is a business risk we don’t want to bear anymore to not be able to scale for unexpected traffic.

this is huge for us as our application requires German servers. We are still in research where to host in future.


Interesting is that you can get instances in dev/test subscriptions without any trouble.


> but setting up in a new region would be complicated for us.

I've never done K8 on Azure, but my understanding is that Azure is pretty good about coordinating between your own datacenter running windows and Azure. Maybe you can spin up some windows boxes in a cheap datacenter to make it work?


Hetzner has a German presence I believe, and would work for running k8 on bare metal for n8n to burst to temporarily for running their orchestration and/or workflow runners. Might even be cheaper in the long run versus a cloud provider. Just gotta wire up the helm charts, containers, and whatever message bus is pushing their messages around. Can write to blob storage from anywhere if that’s a component of the app.


> Hetzner has a German presence I believe

I sure hope so, as a German company


Just don't trust in marketing and save yourself a lot of money. On prem for all base or long term (6+ month) resources. Cloud only for peaks. And never use single cloud providers dependent features. Then you will never have such troubles at all.


"There's no such thing as cloud - it is always someone's else computer". Although we may try to rely on the unwillingness of the cloud provider to lose revenue, probability of events like this can never be fully discounted.


Yes, my company found this out trying to add both a database and a serverless app to our existing infrastructure in Germany West Central in July. They had no ETA for more GWC capacity back then and told us to move to the North and West Europe regions.


My understanding is that the German region is not run by Microsoft, but a German company. This provides a legal shield required by Germany to try and prevent the US government from accessing data on those servers.


We've been having this problem in Singapore for a couple of years now. Can't add any VMs to our k8s cluster and can't provision a number of services which made our multi-region BCP more complicated.


Years?!?! Guess I then have to be happy that in our case it is "just" around 4 months.


>These problems seem only in the German region, but setting up in a new region would be complicated for us.

This seems like your fundamental problem. If you design an architecture that is limited to a single region of a single cloud provider, you are very likely to encounter issues at some point.

Luckily you have a full month to solve this problem before it will prevent you from accepting new users. My suggestion is to start making your app multi-regional or multi-provider ASAP.


The Batch Service schedule history monitor sucks. It is inaccurate and doesn't sync the job order correctly. You can call them, they will get on the phone and then say they fixed it. Then you call them again because they didn't and they give you the same answer. Can't blame them, most of them are on H1B's. Nobody wants to be the squeaky wheel in that position. So you will just get the runaround all the time.


I worked in a top 15 Azure customer. This is not unusual at all, especially in the newer regions. Talk to your TAM before you make attempt major capacity changes in a region. They may have advice on specific SKUs to use or which zones have capacity (e.g. when austrailaeast was being built 80%+ of the capacity was in one zone for many months).

If you aren't a big spender you may not have a TAM who can get this info for you. Welcome to Azure.


Perhaps it’s a per customer limit to ration capacity? If so maybe you can legitimately work around it by creating multiple Azure billing accounts.


Could be possible. But as far as I know would two accounts in the same data center not work for us for technical reasons.


Is this related to the hardware shortage during the pandemic? I'm assuming they couldn't scale at the rate which they intended pre-pandemic..

This seems like a much larger issue than they're making it seem. The promise of the cloud was unlimited scalability. I never thought of cloud resources as finite.


While you're at it at making your "infrastructure as code" cloud agnostic perhaps take a look at tools like Terraform (the only one I'm familiar with). I've just started the work of defining whatever we need to provision in their notation with the objective that it can be done with a single push of a button in the future.


There is nothing “cloud agnostic” about using Terraform. Anyone who says this has no experience actually trying to implement it.

Terraform has different providers for each cloud provider and the code is not transferable any more than saying if you use Python to script your infrastructure it will be transferable.


Agreed. I've advised people same before. You can build to Kubernetes cluster-agnostic (mostly), but the stuff that gets you to that point will be very cloud-specific.

The reason for Terraform, and it's a good one, is your Terraform-related tooling doesn't have to change, e.g. if you route all your infra change approvals through Terraform Cloud), and you can coordinate multi-service changes, e.g. update Auth0 infra to do X, then AWS to do Y.


It has actually been done that way. For technical reasons is sadly a move to a new data center even with that very complicated and time consuming.


In Norway East Azure were incapable of provisioning new VMs for several(4-5) days, caused by some IP issue. The only solution was 'try to provision in the night, and don't turn it off if you get one'. Their status page showed green through the whole period though, even though nothing needing compute worked. So that was cool....


I don't have much knowledge about azure but is it possible to add different instant types and/or sizes? E. g. in the EC2 world if AWS was out of m5.xlarge I would try to add a worker group with m6.xlarge or m4.xlarge. If that did not work I might try to replace my xlarge with 2xlarge...


Learn to catch the errors before you deploy to production.

https://www.ernestech.com/product-details/146/docker-contain...


Stockouts have happened on both AWS and GCP too. Most of the time the problem is no longer a problem if you build your infrastructure not to rely on a single region or availability zone. On EC2 especially, even if you can't change to a different region, try changing to a new instance type and that might work.


Sort-of. I have a Postgres flexible database in the West Germany Central region that can no longer be scaled. It was only created for testing purposes, so no biggie. The backend is basically a managed Compute resource.

If you need more reliability, I see only one way out: Go multi-region or even multi-cloud.


>We never thought our startup would be threatened by the unreliability

Daily reminder that cloud services are vastly less reliable than traditional hosting; it’s just that they manipulate the terminology to deflect that, replacing reliability with availability, aka “making impression of working”.


Infinite scaling clouds, they said. In AWS at work we spin up large numbers of EMR nodes and every few days get stuck waiting for availability of certain instance types in our region too. I guess we could reserve more, but that defeats a lot of scale up and down advantages.


Serverless runs out of servers.


Worth repeating again, AWS, Azure and GCP are all adding capacity, and new Datacenter as fast as they could. We have enough demand to drive the next two generation of leading edge node. That is TSMC N3 and N2. And I assume it will be similar in N1.4 or 14A.


They basically have far too many small regions and are growing like crazy, multi-region deployments will be a must unfortunately.

Maybe you can spin up some part of the infrastructure that are not latency sensitive in the nearby region?


I think there is some general rule in business that you should not depend on a provider that if they lost your business it would be less then one percent of their revenue. Or be ready when they drop the guillotine.


For anyone getting started, that means no dependencies at all. Even colocating would be out of the question, according to your metric.


It depends on how long it would take you to find another colocating company. If there is another co-location on the other side of the street you could simply take your computer there - then there is no dependency.


In the olden days you use to buy computers from Dell and were well under 1% of their revenue. But if they dropped you as a customer, you bought them from HP instead, no problem.


Use Azure and AWS so that you're not dependent on either one.

(You could depend on another startup with no revenue).


Infinite resources is only marketing and no hyperscaler on the market should ever promise that or give people that impression if they haven't accomplished scaling all throughout the entire supply chain.


I am so glad we made the decision to pull https://Bigger.Bio off azure a while ago. It was nothing but problems on their platform.


Reading these comments it looks like everyone runs into this all the time. As a counterpoint: never run into this on Azure, scaling up/down 20-30 vm's a day. Hope it stays that way...


As part of launching our global GPU edge network, we need to support low-volume regions, which means a small number of T4 gpu in different timezones. Azure ran out last Christmas, or at least refused us capacity, and is only adding the next tier of A10's (~2x+ costlier?). We haven't had as much of a problem getting GPUs of different grade on GCP + AWS. I get a form email every 2w from Azure IT that they are working on it. Not as much of an issue for bigger GPUs.

(Also... If into k8s, python, GPUs, graphs, viz, MLOps, working with sec/fraud/supplychain/gov/etc customers on cool deploys, and looking for a remote job, we are hiring for someone to take ownership here!)


Why not just creating a bigger DB instance in another region for a few months? Sure, you'll take a performance hit, but 99% of users won't notice or care


Ah yes, that is what we did in the end for the database. But that is not our main issue, rather that we do not get any more instances for our k8s cluster and those we can sadly not just spin up somewhere else.


this is due to the energy crisis in europe caused by the war


Maybe they are doing this to push people into regions with lower energy costs. Of course Northern Virginia or Canada is going to give you much higher ping times.


I do honestly not think there is any bad intent behind it. I am just surprised that this is happening at all (esp. not with a resolution time of multiple months). They must have known for a long time that this would happen, so I would have expected an early heads-up!


Interesting thought. It would be crazy if turning down business was preferable to just raising prices to reflect increased energy costs. I'm not a cloud expert, but maybe they don't have the infra to price differently in some regions?


That’s the problem with charging average costs (assuming they do that) but the new user costs are at the margin which can be muuuuch higher.


Azure definitely has the ability to charge differently per region. They do it pretty frequently.


Why wouldn't they just price higher in those regions? People want/need regions for policy and compliance reasons, not just for ping, particularly with Europe and Germany I'd expect.


And potential data residency issues


M$ just don't want your money. We had experienced this problem many times in Irland and German regions. Never experienced it with Hetzner or AWS.


Any optimisations you can make? Will have the advantage of saving you money across all platforms/regions


Yes, some are possible and we are already doing that. Sadly will it only delay the time we run out of resources. If we would talk about a few weeks, we could for sure make it, but over 4 months is sadly not possible.


Damn! It's leaky abstractions again


Surprised to see no mention of T-Systems, the subsidiary of Deutsche Telekom, that operates Azure Germany.


Create a new nodepool/scaleset in another region (i think that should be possible)


Is there a secondary market for reselling Azure capacity? Can you bid against other Azure customers?


I'm having trouble getting a instance with GPU in east US, but that's always a problem.


Not the first time this has happened to Azure, they are always under-provisioned. Move to AWS


What sort of nodes are you using, can you add a node pool with a different SKU?


If this is a serious problem for your business, you use K8s and require assistance quickly moving your workloads, consider contacting:

https://www.giantswarm.io/

(I work at Giant Swarm.)


Looks like github is down right now. Or is it only me?


Big fan of n8n!


Thanks a lot! That is always great to hear!


If you want help duplicating your k8s cluster workload, hmu. I love K8s and love contract work. $45/hr. Good luck!


Same issue in France fyi


it's a european site during the world cup haha


And people roll their eyes when I say I’m dedicated to AWS.


[flagged]


Clowns are much more solid than clouds, which are famously density-light. Given their traditional proximity to solid ground, too, clowns are a much better choice of foundational substrate than a cloud to build on.


Clown-car cluster sounds like it'd be a good name for a compute product.


help


I’M. having a tough time ALso, with microsofft.

They seem to IIgnoRe, then repent.: finally APologgise.:(

I think u should switch to a new COMpuute. GCc.-pp.??

When we were running our own compute back in 09: and resources ran out or were unreliable, we cld shOUt at the server maintainer and/OOr install better hardware oUUrselves. NOt-THE.case anymore.:( :((

-Vip


Ha. I knew something like this would happen eventually. Isn't limitless scalability one of the biggest selling points of using "the cloud"? If you have to buy your own computers anyway why even use the cloud? You could try using different clouds providers but eventually the clouds run out.

Which brings me to another important point. If we run out of computers meaning supply can't keep up with demand, then who are the winners? The people who own the computers. Cloud providers and self hosters. Because of the high demand cloud providers can raise their prices and that's directly converted to profit since expenses remain the same, i.e. price gouging. Good job all you cloud loyalists who use the cloud for everything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: