Status page reports all green, however the outage is affecting YouTube, Snapchat, and thousands of other users.
We're having what appears to be a serious networking outage. It's disrupting everything, including unfortunately the tooling we usually use to communicate across the company about outages.
There are backup plans, of course, but I wanted to at least come here to say: you're not crazy, nothing is lost (to those concerns downthread), but there is serious packet loss at the least. You'll have to wait for someone actually involved in the incident to say more.
AWS tries to lock people in to specific services now which makes it really difficult to migrate. It also takes a while before you get to the tipping point where hosting your own is more financially viable .. and then if you trying migrating, you're stuck using so many of their services you can't even do cost comparisons.
"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."
For the downvoters, please just link here the proof if you disagree.
Here are the S3 numbers:
Although I guess depending on how your own infrastructure is setup, even a multi cloud provider setup won't save you from a network outage like the current Google cloud one.
Challenge Accepted... and defeated: https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr...
but to be fair, storage is core to Dropbox's business... this is not true for most companies.
disclaimer: I work for Dropbox, though not on Magic Pocket.
> Here are the S3 numbers: https://aws.amazon.com/s3/sla/
There doesn't seem to be an SLA on S3-cross-region-replication configurations, but I am not aware of a multi-region S3 (read) outage, ever.
99.99% is for "Read Access-Geo Redundant Storage (RA-GRS)"
Their equivalent SLA is the same (99.9% for "Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) Accounts.").
This is a pretty neat and concise read on ObjectStorage in-use at BigTech, in case you're interested: https://maisonbisson.com/post/object-storage-prior-art-and-l...
16 9's and aws should easily last as long as the great pyramids without a second worth of outage.
What a joke
It's about losing entire data centers to massive natural disasters once in a century.
There's perhaps the additional asterisk of "and we haven't suffered a catastrophic event that entirely puts us out of business". (Which is maybe only terrorist attacks). Because then you're talking about losing data only when cosmic-ray bitflips happen simultaneously in data centers on different continents, which I'd expect doesn't happen too often.
I'm sure that's okay if you do bulk processing / time-independent analysis, but don't host production assets on wasabi.
At least in EU services bought from overseas are subject to reverse charge, i.e. self-assessment of VAT (Article 196 of https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02... ).
Though note that if you are an EU AWS customer, you are not buying from outside EU, you are buying from Amazon's EU branches regardless of AWS region. If Amazon has a local branch in your country, they charge you VAT as any local company does. Otherwise you buy from an Amazon branch in another EU country, and you again need to self-assess VAT (reverse charge) per Article 196.
Since AWS built a DC in Canada, I’m paying HST on my Route53 expenses, but not on my S3 charges in non-Canadian DCs.
I’m not an HST registrant (small supplier, or if you’re just using services personally), so there’s nothing to self-assess.
Even if self-assessment was required, you get some deferral on paying (unless you have to remit at time of invoice?).
I believe it works differently in EU (i.e. US DCs taxed) as per Article 44 the place of supply of services is the customer's country if the customer has no establishment in the supplier's country.
IBM/Softlayer, Rackspace, Google Cloud, Microsoft and I imagine everyone else large enough to count also does, too.
For Australian businesses, at least, being charged GST isn't a problem - they can claim it as an input and get a tax credit.
Even if you did have to self-assess, better to pay later than right away.
I understand this is long-since resolved (I haven't tried building a service on Amazon in a couple years, so this isn't personal experience), but centralized failure modes in decentralized systems can persist longer than you might expect.
(Work for Google, not on Cloud or anything related to this outage that I'm aware of, I have no knowledge other than reading the linked outage page.)
Maybe you mean region, because there is no way that AWS tools were ever hosted out of a single zone (of which there are 4 in us-east-1). In fact, as of a few years ago, the web interface wasn’t even a single tool, so it’s unlikely that there was a global outage for all the tools.
And if this was later than 2012, even more unlikely, since Amazon retail was running on EC2 among other services at that point. Any outage would be for a few hours, at most.
"Some services, such as IAM, do not support Regions; therefore, their endpoints do not include a Region."
There was a partial outage maybe a month and a half ago where our typical AWS Console links didn't work but another region did. My understanding is that if that outage were in us-east-1 then making changes to IAM roles wouldn't have worked.
Your quote cd mean two things.
- that IAM services are hosted in one region (not one AZ)
- that IAM is for the entire account not per region like other services (which is true)
(I will note that I was technically more right in the most obnoxiously pedantic sense since the hyphenation style you used is unique to AWS - `us-west-1` is AWS-style while `us-west1` is GCE-style :P)
There's some irony in that.
I’m not in SRE so I don’t bother with all the backup modes (direct IRC channel, phone lines, “pagers” with backup numbers). I don’t think the networking SRE folks are as impacted in their direct communication, but they are (obviously) not able to get the word out as easily.
Still, it seems reasonable to me to use tooling for most outages that relies on “the network is fine overall”, to optimize for the common case.
Note: the status dashboard now correctly highlights (Edit: with a banner at the top) that multiple things are impacted because Networking. The Networking outage is the root cause.
this column of green checkmarks begs to differ: https://i.imgur.com/2TPD9e9.png
Not long after that incident, they migrated it to something that couldn't be affected by any outage. I imagine Google will probably do the same thing after this :)
Reminds me of when I was working with a telecoms company. It was a large multinational company and the second largest network in the country I was in at the time.
I was surprised when I noticed all the senior execs were carrying two phones, of which the second was a mobile number on the main competitor (ie the largest network). After a while, I realised that it made sense, as when the shit really hit the fan they could still be reached even when our network had a total outage.
Like the black box on an airplane, if it has 100% uptime why don’t they build the whole thing out of that? ;)
So memegen is down?
If there are not locks that work this way it sure seems like there should be. Using cloud services to enable cool features is great. But if those services are not designed from the beginning with fallback for when the internet/cloud isn't live that is something that is a weakness that often is unwise to leave in place imo.
This does mean you need to setup a code in advance of people showing up, but it's an under 30 second setup that I've found simpler than unlocking once someone shows up. The cameras dropping offline are a hot mess though, since those have no local storage option.
If the cloud is down, revocations aren't going to happen instantly anyway. (Although you might be able to hack up a local WiFi or Bluetooth fallback.)
It's a fake trade-off, because you're choosing between lo-tech solution and bad engineering. IoT would work better if you made the "I" part stand for "Intranet", and kept the whole thing a product instead of a service. Alas, this wouldn't support user exploitation.
It's also my Plex media server, file server, VPN, I run some containers on there. I used to use it as a print server but my new printer is wireless so I never bothered
There’s a fine line or at least some subtlety here though. This leads to some interesting conversations when people notice how hard I push back against NIH. You don’t have to be the author to understand and be able to fiddle with tool internals. In a pinch you can tinker with things you run yourself.
There are also advantages to being part of the herd.
When you are hosted at some non-cloud data center, and they have a problem that takes them offline, your customers notice.
When you are hosted at a giant cloud provider, and they have a problem that takes them offline, your customers might not even notice because your business is just one of dozens of businesses and services they use that aren't working for them.
Who are you getting this steal of a deal from?
Cloud costs roughly 4x than bare metal for sustained usage (of my workload). Even with the heavy discounts we get for being a large customer it’s still much more expensive. But I guess op-ex > cap-ex
I've never seen any of the providers listed offer "tons of ram" (unless we consider hundreds / low thousands of megabytes to be "tons") at that price point.
If you don't like that you can order a KVM-VM with dedicated cores at similiar prices and the problem is not yours anymore.
You want velocity for your dev team? You get that. You want better uptime? Your expectations are gonna have a bad time. No need for rapid dev or bursty workloads? You’re lighting money on fire.
Disclaimer: I get paid to move clients to or from the cloud, everyone’s money is green. Opinion above is my own.
With on-prem solutions, you can at least access the physical servers and get your data out to carry on with your day while the infrastructure gets fixed.
You can run your own hardware and pull in multiple power lines without establishing your own country.
I’ve ran my own hardware, maybe people have genuinely forgotten what it’s like, and granted, it takes preparation and planning and it’s harder than clicking “go” in a dashboard. But it’s not the same as establishing a country and source your own fuel and feed an army. This is absurd.
Fun related fact: My first employee's main office was in former electonics factory in Moscow's downtown powered by 2 thermal power stations (and no other alternatives), which have exact same maintenance schedule.
I didn’t know the cloud-to-butt translator worked on comments too. I forgot that was even a thing.
And then “your data is in my butt” was just a play on that.
But yeah, it's still a thing, and the message behind it isn't any less current.
I made IoT using cheap (arduino, nrf24l01+, sensors/actuators) for local device telemetry, MQTT, node-red, and Tor for connecting clouds of endpoints that aren't local.
Long story short, its an IoT that is secure, consisting of a cloud of devices only you own.
Oh yeah, and GPL3 to boot.
Edit: ah, looks like the LB is sending LA traffic to Oregon.
Can confirm with Gmail in Europe. Everything works but it's sluggish (i.e. no immediate reaction on button clicks).
Sounds like Google and Amazon are hiring way too many optimists. I kinda blame the war on QA for part of this, but damn that’s some Pollyanna bullshit.
Shouldn't that outage system be aware when service heartbeats stop?
Could this be a solar flare?
Cloud services live and die by their reputation, so I'd be shocked if Google ever tried to get out of following an SLA contract based on a technicality like that. It would be business suicide, so it doesn't seem like something to be too worried about?
https://www.zdnet.com/article/some-internet-outages-predicte... 768k Day
According to https://twitter.com/bgp4_table, we have just exceeded 768k Border Gateway Protocol routing entries, which may be causing some routers to malfunction.
I was actually surprised, as they tend to have excellent networking. Now I'm not nearly as distrusting as I was initially, knowing it was likely their ISP getting screwed by routing table overflow.
From that linked page:
"Customer Must Request Financial Credit
In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Customer must also provide Google with server log files showing loss of external connectivity errors and the date and time those errors occurred. If Customer does not comply with these requirements, Customer will forfeit its right to receive a Financial Credit. If a dispute arises with respect to this SLA, Google will make a determination in good faith based on its system logs, monitoring reports, configuration records, and other available information, which Google will make available for auditing by Customer at Customer’s request."
Might be a good month to rebuild all your models ;)
I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.
But let's work backwards from the goal instead.
If you charge twice as much, and then 20-30% of months are refunded by the SLA, you make more money and you have a much stronger motivation to spend some of that cash on luxurious safety margins and double-extra redundancy.
So what thresholds would get us to that level of refunding?
Minimum spends and a 50,000% markup based on adding that term to your contract.
Besides, a provider credit is the least of most company's concerns after an extended outage, it's a small fraction of their remediation costs and loss of customer goodwill.
Just take the premium that
you'd be willing to pay and
put it in the bank
The fines are large enough that (for example) companies will have a heavy plant mechanic on site who does nothing on the vast majority of jobs - they're just standing by, to mitigate the risk of a breakdown leading to such a fine. Some business analyst with a spreadsheet has worked out the heavy plant breakdown rate, the typical resulting delays, the expected fines, and the cost of having the mechanic on standby... and they've worked out it's a good business decision.
The purpose of having an SLA isn't to get yourself money when your provider fails. The purpose is to make costly risk mitigation a rational investment for your suppliers.
> I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.
It takes a lot of effort (exponential) to reliably (I. E. Designed to fail-working) build something that is guaranteed to have this level of uptime at these penalties.
So I'm sure that I can build something that works like this, but would you pay me $100 per GB of storage per month? $100 per wall-time hour of CPU usage? $100 per GB of Ram used per hour? Because these are the premium prices for your specs.
AWS refunded me in the first reply on the same day!
GCP sales rep just copy pasted a link to a self support survey that essentially told me, after a series of YES or NO questions that they can't refund me.
So why not just tell your customers like it is? Google Cloud is super strict when it comes to billing. I have called my bank to do a chargeback and put a hold on all future billing with GCP.
I'm now back to AWS and still on a Free Tier. Apparently the $300 Trial with Google Cloud did not include some critical products, AWS Free tier makes it super clear and even still I sometimes leave something running on and discover it in my invoice....
I've yet to receive a reply from Google and its been a week now.
I do appreciate other products such as Firebase but honestly for infrastructure and for future integration with enterprise customers I feel AWS is more appropriate and mature.
I really wanted to try out their new autoML but I was paranoid of entering my credit card and getting banned from Google
this is FUCKED. its aking to holding my youtube and google play accounts hostage.
That way Google won't ban your main account for non-payment.
It's the only way, especially considering Google cloud has no functionality to cap spending.
For a ban, they need something concrete like using the same browser cookies, recovery email address or phone number.
IP isn't enough alone - you could be on shared WiFi.
Also, after account creation, you can log in from the same place without risk, or even use multi-login to log into both accounts at the same time.
Phone and desktop OSs should grow a pair and create a virtualization protocol to randomize tracking info to keep PII anonymous.
With that said, If I delivered you services and then you credit card chargebacked me, I'd cut all relations with you as well.
I think it's weird to say you get credit in dollars and then not be able to spend it on everything. That's not how money works. But that's the way hosting providers work and afaik it's quite well known. Especially with a large sum of "free money", even if it's not well known, it was on you to check any small print.
I didn't read it that way. I thought they were complaining about poor customer service that made it difficult to understand the bill or respond to it appropriately.
AWS is mostly easy going.
Only some people at the partner programm can vary.
I had a guy who wanted to help me out even tho I was just a one person shop. After he left I got a woman who threw me out of the program faster than I could look.
There's too much liability. And no support.
>I have called my bank to do a chargeback
You're issuing a chargeback because you made a mistake and spent someone else's resources? And you're admitting to this on HN? I'm not a lawyer, but that sounds like fraud and / or theft to me.
It’s pretty convenient for companies like Comcast and Google that have poor customer service.
Of course, I get one free pass at that and if I did it over and over, I’m hosed. The difference is that my utility is regulated and has a phone number and a human whose job it is to talk to all customers.
OP sounds like they're just defending their selves from ambiguous draconian billing robots.
The infinite money spout that is Google Ads has created a situation in which devs are at Google just to have fun - there really is no incentive to maintain anything because the money will flow regardless of quality.
Source: I interned at Google.
So no matter where you go for your cloud services, you're guaranteed a useless status page. Yippee.
Having an excel file where people enter statuses is not very useful to me as a customer. That’s more like a blog.
And No, I don’t want to install a separate app to get push notifications about service disruptions for every service I use.
Now the web development side and I'm all "Wait a minute...are there any progress bars that are based on, anything real!?!?!"
I should have known...