Hacker News new | past | comments | ask | show | jobs | submit login

It seems global to me. This is really strange compared to AWS. I don't remember an outage there (other than s3) impacting instances or networking globally.

You obviously don't recall the early years of AWS. Half of internet would go down for hours.

Back when S3 failures would take town Reddit, parts of Twitter .. Netflix survived because they had additional availability zones. I can remember the bigger names started moving more stuff to their own data centers.

AWS tries to lock people in to specific services now which makes it really difficult to migrate. It also takes a while before you get to the tipping point where hosting your own is more financially viable .. and then if you trying migrating, you're stuck using so many of their services you can't even do cost comparisons.

Netflix actually added the additional AZs because of a prior outage that did take them down.

"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."


We went multi-region as a result of the 2012 inc. source: I now manage the team responsible for performing regional evacuations (shifting traffic and scaling the savior regions).

That sounds fascinating! How often does your team have to leap into action?

We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.

Do you do any chaos testing? Seems like it would slot right in, there.

I'd say yes. I heard about this tool just a week ago at a developer conference.


Netflix was a pioneer of chaos testing, right? https://en.m.wikipedia.org/wiki/Chaos_engineering

they have invented the term, so probably yes :)

I think some Google engineers published a free Meap book on service relatability and uptime guarantees. Seemingly counterintuitive, scheduling downtime, without other teams’ prior knowledge, encourages teams to handle outages properly and reduce single points of failure, among other things.

Service Reliability Engineering is on OReilly press. It's a good book. Up there with ZeroMQ and Data Intensive Applications as maybe the best three books from OReilly in the past ten years.

Derp, Site Reliability Engineering.


I think you’re misremembering about Twitter, which still doesn’t use AWS except for data analytics and cold storage last I heard (2 months ago).

Avatars were hosted on S3 for a long time, IIRC.

I am not sure if a single S3 outage pushed any big names into their own "datacenter". S3 has still the world record of reliability that you cannot challenge with your inhouse solutions. You can prove it otherwise. I would love to hear a solution that has the same durability, avabiality and scalability as S3.

For the downvoters, please just link here the proof if you disagree.

Here are the S3 numbers: https://aws.amazon.com/s3/sla/

It's not so much AWS vs. in-house. But AWS (or GCP/DO/etc.) vs. multi/hybrid solutions. The latter of which would presumably have lower downtime.

I don't see why multi/hybrid would have lower downtime. All cloud providers as far as I know, though I know mostly of AWS, already have their services in multiple data-centers and their endpoints in multiple regions. So if you make yourself use more then one of their AZs and Region, you would be just as multi as with your own data center.

Using a single cloud provider with a multiple region setup won't protect you from some issues in their networking infrastructure, as the subject of this thread supposedly shows.

Although I guess depending on how your own infrastructure is setup, even a multi cloud provider setup won't save you from a network outage like the current Google cloud one.

Hum, I'm not an expert on Google cloud, but for AWS, regions are completely independent and run their own networking infrastructure. So if you really wanted to tolerate a region infrastructure failure, you could design your app to fail over to another region. There shouldn't be any single point of failure between the regions, at least as far as I know.

Why would you think that self-managed has lower downtime than AWS using multiple datacenters/regions?

Actually, I imagine that if you could go multi-regional then your self-managed solution may be directly competitive in terms of uptime. The idea that in-house can't be multi-regional is a bit old fashioned in 2019.

For several reasons, most notably: staff, build quality, standards, knowledge of building extremely reliable datacenters. Most of the people who are the most knowledgeable about datacenters also happen to be working for cloud vendors. On the top of that: software. Writing reliable software at scale is a challenge.

Multi/hybrid means you use both self managed and AWS datacenters.

Cannot challenge with your own inhouse solutions, you say?

Challenge Accepted... and defeated: https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr...

but to be fair, storage is core to Dropbox's business... this is not true for most companies.

disclaimer: I work for Dropbox, though not on Magic Pocket.

> For the downvoters, please just link here the proof if you disagree.

> Here are the S3 numbers: https://aws.amazon.com/s3/sla/




>> Here are the S3 numbers: https://aws.amazon.com/s3/sla/

> 99.9%


There doesn't seem to be an SLA on S3-cross-region-replication configurations, but I am not aware of a multi-region S3 (read) outage, ever.

> https://azure.microsoft.com/en-au/support/legal/sla/storage/....

> 99.99%

99.99% is for "Read Access-Geo Redundant Storage (RA-GRS)"

Their equivalent SLA is the same (99.9% for "Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) Accounts.").

Azure is a cloud solution. The thread is about how a random datacenter with a random solution is better than S3.

Wow, he’s comparing the storages SLA of the two biggest cloud services in the world. Pedantic behavior should hurt.

> For the downvoters, please just link here the proof if you disagree.


How can they possibly guarantee eleven nines? Considering I’ve never heard of this company and they offer such crazy-sounding improvements over the big three, it feels like there should be a catch.

11 9s isn't uncommon. AWS S3 does 11 9s (upto 16 9s with cross region replication?) for data durability, too. AFAIK, AWS published papers about their use of formal methods to ascertain bugs from other parts of the system didn't creep in to affect durability/availability guarantees: https://blog.acolyer.org/2014/11/24/use-of-formal-methods-at...

This is a pretty neat and concise read on ObjectStorage in-use at BigTech, in case you're interested: https://maisonbisson.com/post/object-storage-prior-art-and-l...

You have to be kidding me. 14 9's is already microseconds a year. Surely below anybody's error bar for whether a service is down or not.

16 9's and aws should easily last as long as the great pyramids without a second worth of outage.

What a joke

The 16 9's are for durability, not availability. AWS is not saying S3 will never go down; they're saying it will rarely lose your data.

This number is still total bullshit. They could lose a few kb and be above that for centuries

It's not about losing a few kb here and there.

It's about losing entire data centers to massive natural disasters once in a century.

None of the big cloud providers have unrecoverably lost hosted data yet, despite sorting vast volumes, so this doesn't seem BS to me.

AWS lost data in Australia a few years ago due to a power outage I believe.

on EBS, not on S3. EBS has much lower durability guarantees

Not losing any data yet doesn't give justification for such absurd numbers

Those numbers probably aren't as absurd as you think. 16 9s is, I think 10 bytes lost per exabyte-year of data storage.

There's perhaps the additional asterisk of "and we haven't suffered a catastrophic event that entirely puts us out of business". (Which is maybe only terrorist attacks). Because then you're talking about losing data only when cosmic-ray bitflips happen simultaneously in data centers on different continents, which I'd expect doesn't happen too often.

This is for data loss. 11 9s is like 1 byte lost per terabyte-year or something, which isn't an unreasonable number.

This is why I linked the SLA page which you obviously have not read. There are different numbers for durability and availability.

For data durability? I believe some AWS offerings also have an SLA of eleven 9's of data durability.

11 9s of durability, barely two 9s of availability

I'm sure that's okay if you do bulk processing / time-independent analysis, but don't host production assets on wasabi.

I was asking numbers of reliability, durability and availability for a service like S3. What does wasabi has to do with that?

Always in Virginia, because US-east has always been cheaper.

I know a consultant who calls that region us-tirefire-1.

I and some previous coworkers call it the YOLO region.

The only regions that are more expensive than us-east-1 in the States are GovCloud and us-west-1 (Bay Area). Both us-west-2 (Oregon) and us-east-2 (Ohio) are priced the same as us-east-1.

I would probably go with US-EAST-2 just because it's isolated from anything except perhaps a freak Tornado and better situated on the eastern US. Latency to/from there should be near optimal for most eastern US/Canada population.

One caveat with us-east-2 is that it appears to get new features after us-east-1 and us-west-2. You can view the service support by region here: https://aws.amazon.com/about-aws/global-infrastructure/regio....

Fair point. It depends on what the project is.

And for those of us in GST/HST/VAT land, hosting in USA saves us some tax expenditures.


At least in EU services bought from overseas are subject to reverse charge, i.e. self-assessment of VAT (Article 196 of https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02... ).

Though note that if you are an EU AWS customer, you are not buying from outside EU, you are buying from Amazon's EU branches regardless of AWS region. If Amazon has a local branch in your country, they charge you VAT as any local company does. Otherwise you buy from an Amazon branch in another EU country, and you again need to self-assess VAT (reverse charge) per Article 196.

My experience is with Canadian HST.

Since AWS built a DC in Canada, I’m paying HST on my Route53 expenses, but not on my S3 charges in non-Canadian DCs.

I’m not an HST registrant (small supplier, or if you’re just using services personally), so there’s nothing to self-assess.

Even if self-assessment was required, you get some deferral on paying (unless you have to remit at time of invoice?).

Makes sense.

I believe it works differently in EU (i.e. US DCs taxed) as per Article 44 the place of supply of services is the customer's country if the customer has no establishment in the supplier's country.

AWS is registered for Australian GST - they therefore charge GST on all(ish) services[0].

IBM/Softlayer, Rackspace, Google Cloud, Microsoft and I imagine everyone else large enough to count also does, too.

For Australian businesses, at least, being charged GST isn't a problem - they can claim it as an input and get a tax credit[1].

[0] https://aws.amazon.com/tax-help/australia/

[1] https://www.ato.gov.au/Business/GST/Claiming-GST-credits/

You know, normally you still have to pay that tax - just through a reverse charge process

Not the case in Canada if you’re not an HST registrant (non-business or a small enough business where you’re exempt).

Even if you did have to self-assess, better to pay later than right away.

Mostly because those sites were never architected to work across multiple availability zones.

Years ago, when I was playing with AWS in a course on building cloud-hosted services, it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage, so while technically all our VMs hosted out of other AZs were extant, all our attempts to manage our VMs via the web UI or API were timing or erroring out.

I understand this is long-since resolved (I haven't tried building a service on Amazon in a couple years, so this isn't personal experience), but centralized failure modes in decentralized systems can persist longer than you might expect.

(Work for Google, not on Cloud or anything related to this outage that I'm aware of, I have no knowledge other than reading the linked outage page.)

> it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage

Maybe you mean region, because there is no way that AWS tools were ever hosted out of a single zone (of which there are 4 in us-east-1). In fact, as of a few years ago, the web interface wasn’t even a single tool, so it’s unlikely that there was a global outage for all the tools.

And if this was later than 2012, even more unlikely, since Amazon retail was running on EC2 among other services at that point. Any outage would be for a few hours, at most.

Quoting https://docs.aws.amazon.com/general/latest/gr/rande.html

"Some services, such as IAM, do not support Regions; therefore, their endpoints do not include a Region."

There was a partial outage maybe a month and a half ago where our typical AWS Console links didn't work but another region did. My understanding is that if that outage were in us-east-1 then making changes to IAM roles wouldn't have worked.

The original poster said that none of AWS services are in a single AZ, the quote you referenced says that IAMs do not support regions.

Your quote cd mean two things.

- that IAM services are hosted in one region (not one AZ)


- that IAM is for the entire account not per region like other services (which is true)

Just this year an issue in us-east-1 caused the console to fail pretty globally.

Quite possibly, it has been a number of years at this point, and I didn't dig out the conversations about it for primary sourcing.

Where are you based? If you’re in the US (or route through the US) and trying to reach our APIs (like storage.googleapis.com), you’ll be having a hard time. Perhaps even if the service you’re trying to reach is say a VM in Mumbai.

I am in Brazil, with servers in southamerica. Right now it seems back to normal.

I have an instance in us-west-1 (Oregon) which is up, but an instance in us-west-2 (Los Angeles) which is down. Not sure if that means Oregon is unaffected though.

us-west-1 is Northern California (Bay area). us-west-2 is Oregon (Boardman).

Incorrect. GCE us-west1 is the Dalles, Oregon and us-west2 is Los Angeles.

What I said is correct for AWS. In retrospect I guess the context was a bit ambiguous.

(I will note that I was technically more right in the most obnoxiously pedantic sense since the hyphenation style you used is unique to AWS - `us-west-1` is AWS-style while `us-west1` is GCE-style :P)

EUW doesn't seem to be affected.

My instance in Belgium works fine

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact