We're having what appears to be a serious networking outage. It's disrupting everything, including unfortunately the tooling we usually use to communicate across the company about outages.
There are backup plans, of course, but I wanted to at least come here to say: you're not crazy, nothing is lost (to those concerns downthread), but there is serious packet loss at the least. You'll have to wait for someone actually involved in the incident to say more.
AWS tries to lock people in to specific services now which makes it really difficult to migrate. It also takes a while before you get to the tipping point where hosting your own is more financially viable .. and then if you trying migrating, you're stuck using so many of their services you can't even do cost comparisons.
"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."
For the downvoters, please just link here the proof if you disagree.
Here are the S3 numbers:
Although I guess depending on how your own infrastructure is setup, even a multi cloud provider setup won't save you from a network outage like the current Google cloud one.
Challenge Accepted... and defeated: https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr...
but to be fair, storage is core to Dropbox's business... this is not true for most companies.
disclaimer: I work for Dropbox, though not on Magic Pocket.
> Here are the S3 numbers: https://aws.amazon.com/s3/sla/
There doesn't seem to be an SLA on S3-cross-region-replication configurations, but I am not aware of a multi-region S3 (read) outage, ever.
99.99% is for "Read Access-Geo Redundant Storage (RA-GRS)"
Their equivalent SLA is the same (99.9% for "Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) Accounts.").
This is a pretty neat and concise read on ObjectStorage in-use at BigTech, in case you're interested: https://maisonbisson.com/post/object-storage-prior-art-and-l...
16 9's and aws should easily last as long as the great pyramids without a second worth of outage.
What a joke
It's about losing entire data centers to massive natural disasters once in a century.
There's perhaps the additional asterisk of "and we haven't suffered a catastrophic event that entirely puts us out of business". (Which is maybe only terrorist attacks). Because then you're talking about losing data only when cosmic-ray bitflips happen simultaneously in data centers on different continents, which I'd expect doesn't happen too often.
I'm sure that's okay if you do bulk processing / time-independent analysis, but don't host production assets on wasabi.
At least in EU services bought from overseas are subject to reverse charge, i.e. self-assessment of VAT (Article 196 of https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02... ).
Though note that if you are an EU AWS customer, you are not buying from outside EU, you are buying from Amazon's EU branches regardless of AWS region. If Amazon has a local branch in your country, they charge you VAT as any local company does. Otherwise you buy from an Amazon branch in another EU country, and you again need to self-assess VAT (reverse charge) per Article 196.
Since AWS built a DC in Canada, I’m paying HST on my Route53 expenses, but not on my S3 charges in non-Canadian DCs.
I’m not an HST registrant (small supplier, or if you’re just using services personally), so there’s nothing to self-assess.
Even if self-assessment was required, you get some deferral on paying (unless you have to remit at time of invoice?).
I believe it works differently in EU (i.e. US DCs taxed) as per Article 44 the place of supply of services is the customer's country if the customer has no establishment in the supplier's country.
IBM/Softlayer, Rackspace, Google Cloud, Microsoft and I imagine everyone else large enough to count also does, too.
For Australian businesses, at least, being charged GST isn't a problem - they can claim it as an input and get a tax credit.
Even if you did have to self-assess, better to pay later than right away.
I understand this is long-since resolved (I haven't tried building a service on Amazon in a couple years, so this isn't personal experience), but centralized failure modes in decentralized systems can persist longer than you might expect.
(Work for Google, not on Cloud or anything related to this outage that I'm aware of, I have no knowledge other than reading the linked outage page.)
Maybe you mean region, because there is no way that AWS tools were ever hosted out of a single zone (of which there are 4 in us-east-1). In fact, as of a few years ago, the web interface wasn’t even a single tool, so it’s unlikely that there was a global outage for all the tools.
And if this was later than 2012, even more unlikely, since Amazon retail was running on EC2 among other services at that point. Any outage would be for a few hours, at most.
"Some services, such as IAM, do not support Regions; therefore, their endpoints do not include a Region."
There was a partial outage maybe a month and a half ago where our typical AWS Console links didn't work but another region did. My understanding is that if that outage were in us-east-1 then making changes to IAM roles wouldn't have worked.
Your quote cd mean two things.
- that IAM services are hosted in one region (not one AZ)
- that IAM is for the entire account not per region like other services (which is true)
(I will note that I was technically more right in the most obnoxiously pedantic sense since the hyphenation style you used is unique to AWS - `us-west-1` is AWS-style while `us-west1` is GCE-style :P)
There's some irony in that.
I’m not in SRE so I don’t bother with all the backup modes (direct IRC channel, phone lines, “pagers” with backup numbers). I don’t think the networking SRE folks are as impacted in their direct communication, but they are (obviously) not able to get the word out as easily.
Still, it seems reasonable to me to use tooling for most outages that relies on “the network is fine overall”, to optimize for the common case.
Note: the status dashboard now correctly highlights (Edit: with a banner at the top) that multiple things are impacted because Networking. The Networking outage is the root cause.
this column of green checkmarks begs to differ: https://i.imgur.com/2TPD9e9.png
Not long after that incident, they migrated it to something that couldn't be affected by any outage. I imagine Google will probably do the same thing after this :)
Reminds me of when I was working with a telecoms company. It was a large multinational company and the second largest network in the country I was in at the time.
I was surprised when I noticed all the senior execs were carrying two phones, of which the second was a mobile number on the main competitor (ie the largest network). After a while, I realised that it made sense, as when the shit really hit the fan they could still be reached even when our network had a total outage.
Like the black box on an airplane, if it has 100% uptime why don’t they build the whole thing out of that? ;)
So memegen is down?
If there are not locks that work this way it sure seems like there should be. Using cloud services to enable cool features is great. But if those services are not designed from the beginning with fallback for when the internet/cloud isn't live that is something that is a weakness that often is unwise to leave in place imo.
This does mean you need to setup a code in advance of people showing up, but it's an under 30 second setup that I've found simpler than unlocking once someone shows up. The cameras dropping offline are a hot mess though, since those have no local storage option.
If the cloud is down, revocations aren't going to happen instantly anyway. (Although you might be able to hack up a local WiFi or Bluetooth fallback.)
It's a fake trade-off, because you're choosing between lo-tech solution and bad engineering. IoT would work better if you made the "I" part stand for "Intranet", and kept the whole thing a product instead of a service. Alas, this wouldn't support user exploitation.
It's also my Plex media server, file server, VPN, I run some containers on there. I used to use it as a print server but my new printer is wireless so I never bothered
There’s a fine line or at least some subtlety here though. This leads to some interesting conversations when people notice how hard I push back against NIH. You don’t have to be the author to understand and be able to fiddle with tool internals. In a pinch you can tinker with things you run yourself.
There are also advantages to being part of the herd.
When you are hosted at some non-cloud data center, and they have a problem that takes them offline, your customers notice.
When you are hosted at a giant cloud provider, and they have a problem that takes them offline, your customers might not even notice because your business is just one of dozens of businesses and services they use that aren't working for them.
Who are you getting this steal of a deal from?
Cloud costs roughly 4x than bare metal for sustained usage (of my workload). Even with the heavy discounts we get for being a large customer it’s still much more expensive. But I guess op-ex > cap-ex
I've never seen any of the providers listed offer "tons of ram" (unless we consider hundreds / low thousands of megabytes to be "tons") at that price point.
If you don't like that you can order a KVM-VM with dedicated cores at similiar prices and the problem is not yours anymore.
You want velocity for your dev team? You get that. You want better uptime? Your expectations are gonna have a bad time. No need for rapid dev or bursty workloads? You’re lighting money on fire.
Disclaimer: I get paid to move clients to or from the cloud, everyone’s money is green. Opinion above is my own.
With on-prem solutions, you can at least access the physical servers and get your data out to carry on with your day while the infrastructure gets fixed.
You can run your own hardware and pull in multiple power lines without establishing your own country.
I’ve ran my own hardware, maybe people have genuinely forgotten what it’s like, and granted, it takes preparation and planning and it’s harder than clicking “go” in a dashboard. But it’s not the same as establishing a country and source your own fuel and feed an army. This is absurd.
Fun related fact: My first employee's main office was in former electonics factory in Moscow's downtown powered by 2 thermal power stations (and no other alternatives), which have exact same maintenance schedule.
I didn’t know the cloud-to-butt translator worked on comments too. I forgot that was even a thing.
And then “your data is in my butt” was just a play on that.
But yeah, it's still a thing, and the message behind it isn't any less current.
I made IoT using cheap (arduino, nrf24l01+, sensors/actuators) for local device telemetry, MQTT, node-red, and Tor for connecting clouds of endpoints that aren't local.
Long story short, its an IoT that is secure, consisting of a cloud of devices only you own.
Oh yeah, and GPL3 to boot.
Edit: ah, looks like the LB is sending LA traffic to Oregon.
Can confirm with Gmail in Europe. Everything works but it's sluggish (i.e. no immediate reaction on button clicks).
Sounds like Google and Amazon are hiring way too many optimists. I kinda blame the war on QA for part of this, but damn that’s some Pollyanna bullshit.
Shouldn't that outage system be aware when service heartbeats stop?
Could this be a solar flare?