"errorCode" : "InternalError"
When I attempt to use the AWS Console to view s3
Apologies if you find this to be in poor taste, but GCS directly supports the S3 XML API (including v4):
and has easy to use multi-regional support at a fraction of the cost of what it would take on AWS. I directly point my NAS box at home to GCS instead of S3 (sadly having to modify the little PHP client code to point it to storage.googleapis.com), and it works like a charm. Resumable uploads work differently between us, but honestly since we let you do up to 5TB per object, I haven't needed to bother yet.
Again, Disclosure: I work on Google Cloud (and we've had our own outages!).
Our production Cloud SQL started throwing errors that we could not write anything to the database. We have Gold support, so quickly created a ticket. While there was a quick reply, it took a total of 21+ hours of downtime to get the issue fixed. During the downtime, there is nothing you can do to speed this up - you're waiting helplessly. Because Cloud SQL is a hosted service, you can not connect to a shell or access any filesystem data directly - there is nothing you can do, other than wait for the Google engineers to resolve the problem.
When the Cloud SQL instance was up&running again, support confirmed that there is nothing you can do to prevent a filesystem crash, it "just happens". The workaround they offered is to have a failover set up, so it can take over in case of downtime. The worst part is that GCS refused to offer credit, as according to their SLA this is not considered downtime. The SLA  states: "with respect to Google Cloud SQL Second Generation: all connection requests to a Multi-zone Instance fail" - so as long as the SQL instance accepts incoming connections, there is no downtime. Your data can get lost, your database can be unusable, your whole system might be down: according to Google, this is no downtime.
TL;DR: make sure to check the SLA before moving critical stuff to GCS.
Sure you get downtime all the same but not the waiting for support to solve an instance crash part
We've had to use it and can confirm that it works as advertised.
It's not in bad taste, despite other comments saying otherwise. We need to recognize that competition is good, and Amazon isn't the answer to everything.
I think there is little GCP does better than AWS. Pricing is better on paper, but performance per buck seems to be on par. Stability is a lot worse on GCP, and I don't just mean service outages like this one (which they had their fair share) but also individual issues like instances slowing down or network acting up randomly. Also lack of service offerings like no PostgreSQL, functions never leaving alpha, no hosted redis clusters etc... Support is also too expensive compared to AWS.
Management interfaces are better on GCP and sustained use discount is a big step up against AWS reservations. Otherwise, I think AWS works better.
Just last week I got an email saying that they'd discovered an issue on Google Cloud Datastore where certain (strongly consistent!) queries could have been returning incorrect results for a week long period and that I should check my logs to see if anything important had been affected in my application.
That's not the sort of behaviour that inspires confidence in a service.
Most notably, I know many people who run these types of sites and outside of GAE being mediocre, I've never heard them complain about anything like that.
Other services are a different story - from my perspective Google are better at supporting legacy interfaces than most.
> We are writing to inform you that we are winding down sales and renewals of Google Site Search (GSS). Starting April 1st, 2017, new purchases and renewals of GSS will not be available.
Site Search seems like an infra offering to me.
Not an expert by any means but I would put more weight to Google's ONE year promise over (to give an example) HPE's twenty years promise. I know it is a cheap shot because I am pretty sure HPE will be bought and sold at least once in the next twenty years.
We were users of the Google Mini Search appliance, went to a 3rd party in-house installed search solution that we did not like and then a year ago went to GSS. We are looking again for something suitable. The best part of the Google Site Search was search fidelity.
I.e., some douchebag who has no interest or stake in what you do has just dumped a potentially substantial amount of technical debt into your product backlog and, quite possibly, prioritised it all the way to the top.
As somebody else noted above: I don't need people creating more work from me. I can do that quite well enough on my own, thanks very much, and for side-projects this kind of chopping and changing is a pain in the ass.
By definition, with side-projects time is limited, so you absolutely have to focus on the most valuable activities to the exclusion of all else. For this reason, I only consider AWS and Azure for my projects: Google are just too fickle. Lucky you, if you have the time to deal with their nonsense.
(Btw, I'm not dissing Google on a technical level - they obviously do great, interesting work, and they're certainly one of the pioneers of PaaS. I just don't need the hassle of having to fix stuff because they keep killing APIs, projects, services.)
It might not, but doing it so much for other services destroys trust across the entire brand.
This whole idea of being angry at a vendor for deprecating something with 1yr notice is just ridiculous!
People need to realize they are choosing lock-in, and are choosing the risk of deprecation every time they decide to use a cloud service with no drop in competition/open source/etc.
Own your choices people, don't blame others...
The expectation of stability beyond a year is certainly not unreasonable when you're asking people to build their businesses/infrastructure on your platform.
And, building redundancy across providers can be impractical, owed to learning curve, cost duplication, higher outbound bandwidth costs, effort duplication, solution complexity, etc.
Then, about a year or two ago - humans actually started responding to and fixing problems. A welcome change!
I used to work on the Azure Portal Team. As much negative things as I can say about Microsoft, they take making things just work for developers seriously, despite high prices and misc. service issues.
The since nixed compute container project I initially worked on really exemplified this.
I tend to use Colo or AWS when possible but I have a client that insisted on Google GCE and Endpoints.
I've spent so much time time digging through source code and working around broken dev tooling, and dealing with incorrect or out of date documentation thanks to that requirement.
In my personal opinion Google has a way to go in mature tooling. Silent failures, or worse failures that don't result in build failures are not acceptable. Requiring paid support contracts to resolve an issue in google infra is not acceptable. Incredibly poor support for local dev environments is not acceptable.
After dealing with this stuff, I find it unlikely that I will ever rely on their systems in the future. AWS/Colo or, with reservations, Azure all the way.
And good luck getting accurate documentation.
I suspect though that most people affected deemed the risks and costs of failure low enough to be acceptable, and for many people it still is - even with this outage. But that's a conscious decision, rather than plain ignorance.
Twice the persistence means always having at least one backup and thus the occurance of downtime reduces not up
Sounds like it basically coincides with Diane Greene coming on board to run the show -- which is great news for all of us with increased competition on not just the technical front but also support (which is often the deal maker/breaker)
I was at a talk last year, where she spoke, and as much as I love Google, it was one of the boat boring talks I've ever heard in my life. So monotone and uninteresting... and I'm probably one of the biggest Google fans out there.
Look at Safra Catz's public speaking (Oracle). Terrible public speaker, terrific operator .
 though we may easily disagree with their business practices.
Managing stateful services is still difficult but we are starting to see paths forward  and the community's velocity is remarkable.
K8s seems to be the wolf in sheep's clothing that will break AWS' virtual monopoly on IaaS.
 We (gravitational.com) help companies go "multi-region" or on-prem using Kubernetes as a portable run-time.
 Some interesting projects from this comment (https://news.ycombinator.com/item?id=13738916)
* Postgres automation for Kubernetes deployments https://github.com/sorintlab/stolon
* Automation for operating the Etcd cluster:https://github.com/coreos/etcd-operator
* Kubernetes-native deployment of Ceph: https://rook.io/
In addition to Rook, Minio  is also working to build an S3 alternative on top of Kubernetes, and the CNCF Landscape is a good way of tracking projects in the space .
Disclosure: I'm the executive director of CNCF, which hosts Kubernetes, and co-author of the landscape.
Anyway, one needs an on-ramp to containers on Google Cloud. And one can't open source the one that one has, which despite being nearly mature enough to own a driver's license, wouldn't really fulfill the precise need that Kubernetes fills without some frontend work. So one writes Kubernetes. An almost entirely different fundamental architecture, by the way, so it's interesting for those who've seen both to compare.
In other words, you're not entirely off the mark even with the generalization.
I remember reading somewhere in the K8s documentation that it is designed such that nodes in a single cluster should be as close as possible, like in the same AZ.
It took me about 15 minutes to spin up the instances on Google Cloud that archive these objects and upload them to Google Storage. While we didn't have access to any of our existing uploaded objects on S3 during the outage, I was able to mitigate not having the ability to store any future ongoing objects. (our workload is much more geared towards being very very write heavy for these objects)
It it turns out this cost leveraging architecture works quite well as a disaster recovery architecture.
Disclosure: I don't work for google but have an upcoming interview there.
Disclosure: I took a tour there one time and have used google.
EDIT: I realized that I was being mean, but why was that disclaimer relevant?
Also it could look suspicious if grandparent gets the job and at some point in the future someone looks back at this comment.
If in doubt, disclose. Especially in the tech industry, that's what Gamergate was actually about.
- transparency is always good
- adding a small disclosure to the bottom of a post is very low impact
- someone who is interviewing for a job at a company is likely to have a set of biases that influence what they say even if they think that they're being honest and objective.
I also want to personally thank Solomon (@boulos) for hooking me up with a Google Cloud NEXT conference pass. He is awesome!
I use CloudFlare. They handle generating a SSL certificate, can have a CNAME at the APEX, full-site static caching, 301 http => https redirects, etc.
Been trying to get one for IO (can't attend NEXT unfortunately)
There are a large number of people out there looking intently at ACD's "unlimited for $60/yr" and wondering what that really means.
I recently found https://redd.it/5s7q04 which links to https://i.imgur.com/kiI4kmp.png (small screenshot) showing a user hit 1PB (!!) on ACD (1 month ago). If I understand correctly, the (throwaway) data in question was slowly being uploaded as a capacity test. This has surprised a lot of people, and I've been seriously considering ACD as a result.
On the way to finding the above thread I also just discovered https://redd.it/5vdvnp, which details how Amazon doesn't publish transfer thresholds, their "please stop doing what you're doing" support emails are frighteningly vague, and how a user became unable to download their uploaded data because they didn't know what speed/time ratios to use. This sort of thing has happened heaps of times.
I also know a small group of Internet archivists that feed data to Archive.org. If I understand correctly, they snap up disk deals wherever they can find them, besides using LTO4 tapes, the disks attached to VPS instances, and a few ACD and GDrive accounts for interstitial storage and crawl processing, which everyone is afraid to push too hard so they don't break. One person mentioned that someone they knew hit a brick wall after exactly 100TB uploaded - ACD simply would not let this person upload any more. (I wonder if their upload speed made them hit this limit.) The archive group also let me know that ACD was better at storing lots of data, while GDrive was better at smaller amounts of data being shared a lot.
So, I'm curious. Bandwidth and storage are certainly finite resources, I'll readily acknowledge that. GDrive is obviously going to have data-vs-time transfer thresholds and upper storage limits. However, GSuite's $10/month "unlimited storage" is a very interesting alternative to ACD (even at twice the cost) if some awareness of the transfer thresholds was available. I'm very curious what insight you can provide here!
The ability to create share links for any file is also pretty cool.
- It supports Python 2.7 only. We need Python 3.4+ support.
- We can't increase CPU allocation without increasing RAM allocation, making them far more expensive than we need.
- Using psycopg2 on it is a PITA due to their handling of system dependencies.
- The system is entirely proprietary, making it impossible to run it locally for testing.
- Cloudwatch sucks for finding errors in the functions and is atrociously expensive.
- API gateway is an extremely crufty system, and used not to let you pass around binary data (this has changed)
- We can't disable/change the retry-on-error policy.
We have a pretty hard tie-in to S3 and Redshift, but when GCF can do better on a majority of these points, we'll begin moving to it. But yes, Python 3 at a minimum would be a requirement.
I assume that you are referring to emulating the triggering of lambdas behind API gateway...? I've found a project that sets up a node environment to do this. Very handy for js/lambda development. A google search suggests similar options may exist for python.
I had a lot of "chicken and egg"-type questions about using it, and seeing that critical step of bootstrapping the whole thing via the API Gateway was really informative.
In support of my flippant remark I see three indicators that hold parallels to Betamax with detail to follow. I qualify that it is largely informed by my own anecdotal experience. Specifically by objections and responses that I've received/observed while myself and peers have proposed or implemented cloud adoption at various companies.
1. market share.
2. proprietary tech stack.
3. technical superiority syndrome.
1. Currently AWS has a major lead, then Azure, then Google. The implication is that market share translates to mindshare, which in turn yields blog articles, OSS libraries/tools, etc. This becomes a virtuous cycle.
For .NET shops that marketshare will tend to favour Azure on the premise that MS knows best.
2. Some of Google's technology stack has a learning curve that is unique to Google. Take GAE as an example and compare to AWS's nearest equivalent Beanstalk (or Heroku). Beanstalk requires few if any changes to an existing application whereas GAE requires that you do it the App Engine way. It might provide a number of benefits, but it's invasive. Containers are shifting the requirement, however not everyone is in a position or has the desire to start with containers on day 1.
Further Google Cloud's project oriented approach while not a bad organisation mechanism detracts from learning. If you assume the premise that exploration is part of learning it forces the user to hold two items in their head: their objective and Google Clouds imposed objective.
AWS on the other hand generally provides defaults that allow you to launch resources almost immediately after sign-up. Google's approach is better for long-term support, maintenance and organisation but the user needs to have the maturity to understand that benefit.
3. It may be technically superior but that statement in of itself is divisive and can shudder some away. It is not enough to simply be technically superior and from my observation the statements tend to originate from X/Googlers.
A number of people will latch onto feature set (for beta, number of films available was a factor). The absence of features will often discount a choice out of the gate (even if those features are irrelevant) as an example:
- regional coverage:
AWS - 15 regions/~38 zones
Azure - 36 regions/zones
Google - 6 regions/18 zones
- partially/fully managed services: AWS is continually growing these, at a level that seems to outpace competitors.
- Outwardly Google appears to tackle the "hard problems" with technically superior solutions (e.g. TensorFlow, BigQuery) but often appears to neglect the "boring" problems a number of companies want as well (e.g. Cloud VDI's, SnowBall, etc).
- Some areas seem to be ossified due to tight coupling (e.g. servlet 3.0 and python support in GAE).
There is no silver bullet solution. Every provider will have an outage at some point and this could be a big reason that GCE won't be knocked out of the game. I also think Google is working really hard to build community and mindshare. I don't have a crystal ball so only time will tell what happens but technical superiority has rarely been the sole reason that drives adoption.
The S3 keys it produces are tied to your developer account. This means that if someone gets the keys from your NAS, he will have access to all the Cloud Storage buckets you have access to (e.g your employer's).
I use Google Cloud but not Amazon. Once I wanted a S3 bucket to try with NextCloud (then OwnCloud). I was really frightened to produce a S3 key with my google developer account.
As another option, you can continue using the XML API and switch out only the auth piece to Google's OAuth system while changing nothing else.
There's a lot more detail available at: https://cloud.google.com/storage/docs/migrating
Disclaimer: I work on Google Cloud Storage.
I like GCS (and the gsutil tool) but occasionally a S3 style bucket is needed. For example you need a S3 bucket or a webdav server in order to send alerts with images from Grafana to Slack. A minor issue but nice to have if possible without having to deal with Amazon's control panel.
To be honest, I do find the GCS permissions a bit complex. You have IAM, you have ACLs and you have S3 keys. Everything is set in a different place and ACLs aren't fully represented on the developers console. S3 keys give full access to everything, IAM service accounts give access per project and ACLs are fine grained (per bucket/object). On the other hand, IIRC, IAM has a write only setting, while ACLs do not. So I can have an account that can write only to all the buckets of my project but not an ACL (not that useful).
Kicked the tires, not impressed at all. Notes went missing from the interface could only get them back after manually digging through folders via FTP.
Your Egress prices are quite a bit more compared to CloudFront for sub 10TB (.12/GB vs .085/GB).
The track record of s3 outages vs time your up and sending Egress seems like S3 wins in cost. If all your worried about is cross region data storage, your probably a big player and have AWS enterprise agreement in place which offsets the cost of storage.
As to our network pricing, we have a drastically different backbone (we feel its superior, so we charge more). But as you mention CloudFront, the right comparison is probably Google Cloud CDN (https://cloud.google.com/cdn/) which has lower pricing than "raw egress".
Not only is webpagetest.org a google product but it's also much better suited for the minute by minute billing cycle of google cloud compute. For any team not needing to run hundreds of tests an hour the cost difference between running a WPT private instance on EC2 versus on google cloud compute could easily be in the thousands of dollars.
Just saying, it gets you a foot in the door.
if you are api compatible with s3, could you make it easy /possible to work with google storage inside spark?
remember i may or may not run my spark on Dataproc.
The timeline, as observed by Tarsnap:
First InternalError response from S3: 17:37:29
Last successful request: 17:37:32
S3 switches from 100% InternalError responses to 503 responses: 17:37:56
S3 switches from 503 responses back to InternalError responses: 20:34:36
First successful request: 20:35:50
Most GET requests succeeding: ~21:03
Most PUT requests succeeding: ~21:52
So it's likely that the first 500s were the backend for s3 failing, then they took the failing backends offline causing the load balancers to throw 503 because they couldn't connect to the backend.
There are a number of services behind the front end fleet in S3's architecture that handle different aspects of returning a response. Each of those will have their own code paths in the front end, very likely developed by different engineers over the years. As ever, appropriate status codes for various circumstances are something that always seems to spur debate amongst developers.
The change in status code would likely be a reflection of the various components entering unhealthy & healthy states, triggering different code paths for the front end... which suggests whatever happened might have had quite a broad impact, at least on their synchronous path components.
S3 has started working as of about 20 minutes ago, and things are running smoothly.
Update at 2:08 PM PST: As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally."
For legacy customers, it's hard to move regions, but in general, if you have the chance to choose a region other than us-east-1, do that. I had the chance to transition to us-west-2 about 18 months ago and in that time, there have been at least three us-east-1 outages that haven't affected me, counting today's S3 outage.
EDIT: ha, joke's on me. I'm starting to see S3 failures as they affect our CDN. Lovely :/
Q: Why computers don't crash at the same time?
A: Because network connections are not fast enough.
(I think we are starting to get there)
Perspective is everything.
What's the odds of the server with your repo and your own hard drive crashing at the same time?
Quite interesting really!
+I would suggest that for situations where the probability of my machine and github's/bitbucket's servers being down due to the same event would be events of such magnitude that I would not be worried about my project anymore being more focused on basic survival...
I think the problem is globally accessible APIs are impacted. As others have noted, if you can use region/AZ-specific hostnames to connect, you can get though to S3.
CloudFront is faithfully serving up our existing files even from buckets in US-East.
EDIT: less arrogant. I need a coffee.
Even data replication has options for this, too.
And I work in Ops.
EC2: why are you replicating EC2 instances or AMIs across regions? Why aren't you using build tools to automatically create AMIs for you out of your CI processes?
ELB: Eh? Why do I need ELBs to be multi-regional? I'm a little confused by this on, sorry.
EBS: My systems tend to be stateless, storing as much log, audit, or data in external systems such as RDS, DynamoDB, S3, etc. Storing things on the local system's storage is a bit risky, but if you have to there are disk replication solutions available. EFS comes to mind for making that easier. Backups also come to mind in the event of data loss.
VPC: Why does a VPC need to be cross regional? This one is also lost on me.
RDS: Replication is easy -- it's done for you. Convincing developers their application needs to potentially work with a backup endpoint to the data is harder than data replication problems at times. More often than not, it's simply a case of switching to a read-only mode whilst you recover the write copy of your RDS instance, but this is the role of the developers, not ops.
Lambda, ElastiCache, API Gateway... all these things aren't arguments against my original point: architect correctly. Yes it involves more work (from the developer's perspective, mostly), but more often than not in the event of a failure you're left head and shoulders above your nearest competition and left soaking up the profits as a result.
Based on your responses, however, I think we can safely agree to disagree and move on.
Have a great day! I hope you weren't too badly effected by the S3 outage!
Our webservers were hit by this outage. In order to make these cross-regional, I'd need to set up VPCs properly, security groups, instances, datastores (several databases), so on and so forth. I don't store anything on the local disk, but I'm not going to run a server in Europe hitting my db servers in us-east-1. AWS doesn't offer all the databases we use. Cloudformation isn't trivial to use once you get past the tutorial examples either.
Basically, your comment is a version of "you're holding it wrong!"
Some solutions present more difficulties than others, that's for sure. From the limited information you've given me, your solution is far from being a unique situation that poses many difficulties.
CloudFormation in YAML format is pretty easy. I recommend Terraform, however, which is much nicer again for this kind of stuff. It makes it rather "trivial" to get a multi-region solution in place.
As for the database replication: I highly doubt the solutions you're using don't offer replication, and if they don't, and they're not some very esoteric, highly specialised engines, then I would replace them with something that does.
It reads to me as though your primarily contention point is your databases. Not an easy problem to solve, I'll admit, but not impossible, neither.
Exactly to avoid single region outages?
HashiCorp's Terraform makes it a lot easier to go multi Cloud, and abstracting away configuration of the OS and applications/state with Ansible makes the whole process a lot easier too.
Disclosure: I work on Google Cloud (and didn't test this, but some other comment makes that clear).
EDIT: Found my answer. "Just to stress: this is one S3 region that has become inaccessible, yet web apps are tripping up and vanishing as their backend evaporates away." -- https://www.theregister.co.uk/2017/02/28/aws_is_awol_as_s3_g...
"Amazon EC2 Instance scheduled for retirement"
When I checked the logs it was clear the hardware failed 30 mins before they scheduled it for retirement. EC2 and root device data was gone. The e-mail also said "you may have already lost data".
So I know that Amazon schedules servers for retirement after they already failed, green check doesn't surprise me.
I order drives off newegg directly to my DC and I'm yet to lose data with the cheapest drives available in RAID10.
Simple solutions to this do scale - Linode and DigitalOcean don't have such issues for example - and while they're not Amazon scale, they are quite large and I'd say they prove the concept.
Local storage is not intended for permanent storage, and is more use at your own risk. That's also why most of the new EC2 instances don't even support local storage.
Availability =/= durability of course
For higher performance, you can use
1. EBS Provisioned IOPS (kind of expensive)
2. Aurora (for DB use)
3. The new I3 instances (super fast local storage at a reasonable price.)
I guess this just boils down again to Amazon not being cost effective enough for my use case in yet another way.
Oh, and good luck creating snapshots of your home RAID!
Definitely not $50 to my knowledge but for ~$170 you can get a Samsung 850 EVO which is rated for 98k IOPS. They're fairly reliable drives and much, much faster than anything you'll get on EBS. You could be running that full 3x replication in less than a year of paying for EBS.
> Oh, and good luck creating snapshots of your home RAID!
LVM, ZFS and Btrfs all do snapshotting quite nicely. FreeNAS - commonly used for consumer grade NASes will automatically manage ZFS snapshots for you too. Amazon will sell you extra space to store snapshots, sure, but increasing the size of your devices usually solves that problem. And quite cost effectively as you can probably tell by now...
Dropbox targets end users who don't have the knowledge required to use the alternative, if you're smart enough to use EBS you're probably smart enough to use ZFS snapshotting just as easily. Or could within a day or two. It's really not that hard.
Like I said, there are systems that pretty much manage the whole thing for you and just warn you when something is about to blow up like FreeNAS.
Shadow volume replication is entirely possible with several filesystems or Hot Copy kernel mod. Also LVM does snapshotting fairly easily
I may have got my prices a bit mixed up (I saw 120GB at Fry's for $60 last month) but my point stands.
Also why is discomfort such a big problem for folks? Learn stuff.
zfs snap tank/data@$(date '+Y%m%d')
zfs send tank/data@$(date '+Y%m%d') | zfs recv backup/data
advanced magic for off-system backup
zfs send tank/data@$(date '+Y%m%d') | ssh cheapdiskserver zfs recv tank/data
So about as many as this SD card, and nothing compared to a real SSD.
SD cards have much worse write IOPS.
It is, yes, but I wouldn't refer to it as comparable to an SSD.
> SD cards have much worse write IOPS.
Surprisingly not! Testing in ATTO I got read and write speeds that were almost identical, and a peak of 2000 IOps.
EBS (gp2) is flash based, has far better performance than high end magnetic disks, with excellent latency and consistent performance. So, it's more comparable to SSD than anything else.
>Surprisingly not! Testing in ATTO I got read and write speeds that were almost identical, and a peak of 2000 IOps.
Really? Were you looking at 4K write? Typically that would be under 1 MB/s for an SD card.
It's a relatively high-quality SD card, unfortunately hampered by my reader's inability to use bus speed over 25MB/s.
Amazon should take notes.
I notice even Cloudflare is starting to have problems serving up pages now.
Seriously: I don't understand why you guys stay with AWS.
You can use Adwords as a self-service user. Without knowing so much of details you can run your ads but also you can bery easily ruin your budget. But many enterprise customers use it very differently than those users and they are extremely optimizing the cost. Cloud is the same. If you don't know how big customers use AWS, it is normal that you are surprised because AWS is still leading the market.
You say GCP is better than AWS. Which part is better? GCP does not have many services of AWS we benefit from. How can you compare totally different providers? You can only say AWS EC2 is worse than GCP. But you cannot compare whole platforms in one sentence.
After spending a year evaluating both AWS and GCP (with an emphasis on their managed database services; both SQL and no-SQL) my general feeling is this:
"Microsoft Windows is to Unix as AWS is to GCP".
(Or perhaps closer to the truth: "VMS is to Unix as AWS is to GCP".)
Baically AWS services seem like they are badly designed by buerocratic mediocre engineers following some bureocratic template for "a service".
GCP feels a lot saner (both API- and UI/console-wise). I often got the feeling it's designed by people who:
a) are smart and well-rounded in terms of experiences. It does take cleverness and experience to design something elegant that is also useful.
b) take pride in their work (it does show)
(And then, as a bonus: It's cheaper!)
I specifically spent a lot of time on Lambda and found it quite annoying compared to GCP AppEngine. So much bureaucracy. Just this thing that you have to specifically register every single Lambda API call and its parameters using an interface built by non-thinking people.. Sheesh.
For on-demand processing I just want a single HTTP-ish entry point, like AppEngine provides. (That way I can I move my service between different providers, if I wanted to move away from e.g. AWS.)
Personally I've been using it for ages and I know most services inside and out. They do suffer downtime in some regions occasionally, but it'd be too expensive at this point to move.
And who doesn't suffer downtime? You can't avoid it; you just need a plan to deal with it. For example, having a backup replica bucket in another region and the ability to quickly switch your CDN over would probably be a good idea here; that's what I did.
If you want to go further you can replicate your data to another cloud provider entirely and use low TTLs to switch to a backup CDN if your system is that mission-critical (in the event of a worldwide AWS failure doomsday scenario).
All systems will fail you and it's our responsibility as IT professionals to have a plan to mitigate this.
Anyway, I agree with your conclusion.
I do agree that we should all plan for failures.
However, I also think it's a sign of failure in planning and architecture foresight if it's too expensive to move away from a particular cloud provider.
There are plenty of cases where it just wouldn't make sense to switch after looking at the costs, opportunity costs, etc. For example, if his site makes him $10 a month, outages cost him $1 a month that could be mitigated by moving, and it would cost $1000 of labor to swap providers. (Depends on interest rates.)
Perhaps it was originally a failure to not have a plan to easily move from a provider, but it doesn't seem unreasonable to me that right now it may cost too many hours of work to justify the move.
There needs to be a clear financial win. Even taking into account the failures we've seen so far, I don't see a compelling reason to leave AWS.
Still stand behind the other two points I made in that post though.
Who do you recommend instead (assuming in-house or Hetzner-equiv is out of reach)? Google Cloud? Azure? Rackspace?
(I'm guessing a relatively large part is also selfish attachment to the market leader because of employment reasons. I hate wasting money, both for myself and for my employer, so I don't really understand this kind of thinking - but I do understand how it could flourish in a venture capital-rich time/locale.)
I also recommend reading:
I have used GCP for some time without being affected from any incident.
Disclosure: I work on Google Cloud (and wouldn't want to be an incident responder at AWS today...)
All instances going down in all regions is an order of magnitude worse than a single service going down in a single region. You're deluding yourself if you think GCE is any more reliable than any other reputable cloud hosting platform.
Looks like CDN has a 10MB limit:
(work at Google Cloud)
B2 is based out of a single DC (or at least, was at launch and I don't see anything that suggests that has changed?) You've got to decide what's most important to you. Data persistence or $$$.
The last year or two has seen a remarkable improvement according to those customers of mine that host there.
Which would make sense (and is sorta-kinda a best-practice) if Amazon wrote services such that they "crashed early"—but instead they're seemingly written so the backend lock up and be rendered completely useless at "doing its job" but will continue to run just fine.
Either of those two design decisions is potentially a good thing on its own, but they need to be considered in light of one-another if you want your status page to make any sense. If you want to report cluster failures, code your clusters to actually fail. If you want to keep your clusters up, write your monitoring checks as whole-stack acceptance tests.
You don't seem to have enough experience to comment on the issue.
Comparing technology and saying "it seems" or "i feel" isn't really a good argument to convince me one way or the other.
I tried them all and Amazon is still the best.
Being able to run distributed D4M/GraphBLAS queries in Cloud Bigtable would be killer.
"From NoSQL Accumulo to NewSQL Graphulo:
Design and Utility of Graph Algorithms
inside a BigTable Database" https://arxiv.org/pdf/1606.07085.pdf
> Increased Error Rates
> We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.
The worst "increased error rate" problem I had was when the API was failing and my autoscale system couldnt deal and launched thousands of instances because it couldnt tell when instances were launched (lack of API access) and the instances pummelled the fuck out of all other parts of the system and we basically had to reboot the entire platform....
Luckily, amazon is REALLY forgiving with respect to costs in these (and actually most) circumstance....
Yes. Yes they are. Thankfully.
At best when there are problems (not like now I guess) I will see the "note" green icon https://status.aws.amazon.com/images/status1.gif
They had some convoluted but fairly specific wording in their TOS, whoever wrote must have had a lot of fun.
> 57.10 Acceptable Use; Safety-Critical Systems. Your use of the Lumberyard Materials must comply with the AWS Acceptable Use Policy. The Lumberyard Materials are not intended for use with life-critical or safety-critical systems, such as use in operation of medical equipment, automated transportation systems, autonomous vehicles, aircraft or air traffic control, nuclear facilities, manned spacecraft, or military use in connection with live combat. However, this restriction will not apply in the event of the occurrence (certified by the United States Centers for Disease Control or successor body) of a widespread viral infection transmitted via bites or contact with bodily fluids that causes human corpses to reanimate and seek to consume living human flesh, blood, brain or nerve tissue and is likely to result in the fall of organized civilization.
Second, I know the lawyer and yes he had fun.
I'd bet that something broke (causing InternalError responses) and then nodes started marking themselves as failed (causing the timeouts and 503s soon after).
It's possible that the console won't work however as I believe that's served from us-east-1.
From https://status.aws.amazon.com/ : Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
Then I refreshed and the event disappeared altogether.
Increased Error Rates
We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.
We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.
Amazon Simple Storage Service (US Standard) Service is operating normally
A pyrrhic victory... ;)
 - http://status.hrpartner.io
EDIT UPDATE: Well, I spoke too soon - even our status page is down now, but not sure if that is linked to the AWS issues, or simply the HN "hug of death" from this post! :)
EDIT UPDATE 2: Aaaaand, back up again. I think it just got a little hammered from HN traffic.
You don't use S3 but because they do, your entire infrastructure crumbles.
Disclosure: I'm one of the cofounders
My startup's op team had a great discussion today because of this that basically boils down to "if we hit our sales goals, an incident like this a year from now would end our company".
Looks like our plans to start prepping for multi-cloud support will be a higher priority.
I'm genuinely curious, what kind of business are you in that a four hour outage would end the company? High frequency trading or something?
99.9964583 = 100 - 153/(30*24*60)
99.6458333 = 100 * (1 - 153/(30*24*60))
99.9997089 = 100 - 153/(365*24*60)
99.9708904 = 100 - 100*153/(365*24*60)
153 is the number of minutes they were down going off the reported updates at https://status.aws.amazon.com/ - 11:35AM PST was when they fixed the status page, 2:08PM PST was when S3 was fully back online. (And 153 is underestimating it, because there were errors going on for long before they fixed the status page, but I don't have timestamps on that.)
The dashboard not changing color is related to S3 issue.
See the banner at the top of the dashboard for updates.
Warning sign, octagonal sign, no Entry (all filtered by HN).
There are plenty of possibilities.
outage first reported around 11:35CST.
"We are investigating increased error rates for Amazon S3" translates to "We are trying to figure out why our mission critical system for half the internet is completely down for most (including some of our biggest) customers."
I've been fuzzing S3 parameters last couple hours...
And now it's down.
(Yes it sucks and yes we're working on fixing it. We hate slow software too!)
CloudFront is currently experiencing problems with requesting objects from Amazon S3.
edit: Since posting my comment they added a banner of
"Increased Error Rates
We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region."
However S3 still shows green and "Service is operating normally"
I did that out of paranoia but it turns out this could happen to us. Does that sound like a sensible approach?
Fortunately all my company's stuff is in eu-west-1 which still seems to be fine.
Somewhere a sysadmin is having to explain to a mildly technical manager that AWS services are down and affecting business critical services. That manager will be chewing out the tech because the status site shows everything is green. Dishonest metrics are worse than bad metrics for this exact reason.
Any sysadmin who wasn't born yesterday knows that service metrics are gamed relentlessly by providers. Bluntly there aren't many of us, and we talk. Message to all providers: sysadmins losing confidence in your outage reporting has a larger impact than you think. Because we will be the ones called to the carpet to explain why <services> are down when <provider> is lying about being up.
Looks like they are both using the same solution for their status pages. The icon for trellostatus did also fail to display.
I would rather access HN without Cloudflare as man-in-the-middle, especially over HTTPS.
You will never know the exact damage, the only thing you can do to play it 100% safe is to rotate all credentials on sites using Cloudflare.
And you can't access HN without going through Cloudflare (unfortunately, but HN is having a hard enough time to keep up with traffic as it is, without Cloudflare it would perform a lot worse than it does).
Google Analytics, Cloudflare, AWS, those are things you can never escape from.
AWS Employee #1: Hey, people are catching on that our status page isn't accurate
AWS Employee #2: Tell them it's cause of S3
The status information is hosted there.
Poor show when a service disruption means the status page can't be updated....
Disclosure: I work on Google Cloud, and we've had our fair share of outages.
edit: oh, it is actually because of the outage! So if they can't get a fresh read on the service status from s3, they just optimistically assume it's green... even though the service failing to provide said read... is one of the services they're optimistically showing as green XD
That raises many more questions about how well accounted outages have been in the past and equally reported.
Then the design aspect that in itself highlights if you run things in the cloud, what fallback do you have if that goes wrong. So certainly the impact from this outage is going to echo for a while, with many questions being asked.
But isn't that the whole point of lying: to the less technical manager (often the only person whose view matters at major customers), the status board saying "up" means the problem is the sysadmins, not the vendor.
For example, by experience and gossip I know Wind stream has awful reliability, but they handwave that away. By including a requirement I knew they couldn't meet (dynamic E911), they were knocked out of a 200 site VoIP RFP early.
Greenish ELB, RDS.
Yellow EC2, Lambda.
Red S3, Auto Scaling.
EDIT: A few dozen services in us-east-1 are down/degraded.
When SLA's are in play and so are job performance scores and bonuses there is probably a strong incentive to fudge numbers. It can be done officially ("Ah but sub-chapter 3 of chapter X in the fine print explains this wasn't technically an outage") or unofficially.
- Architectural SPOFs (single points of failure) need to be carefully weighed up in any design, and "ALL our files are on $single_provider" is one such huge red flag. Unfortunately these considerations are all too frequently drowned out by the ease of going with the least path of resistance.
For example GitHub occasionally goes down, which breaks a remarkable amount of infrastructure: a huge number of people don't know how to use Git, do full clones from scratch each time, and have no idea how to work without a server (even though Git is built to work locally); CI systems tend to want to do green-field rebuilds, so start out with empty directory trees and need to do full clones each build (I'm not sure if any CI systems come with out-of-the-box Git caching); GH-powered authentication systems fall apart; etc. Kinda crazy, scary and really annoying, but yeah.
In terms of "nail in the coffin", that depends on a lot of factors, including a subjective analysis of how much local catastrophe was caused by the incident; subjective opinions about the provider's reaction to the issue, what they'll do to mitigate it, perhaps how transparent they are about it; etc.
Ultimately, the Internet likes to pretend that AWS and cloud computing is basically rock-solid. Unfortunately it's not, and stuff goes down. There were some truly redundant architecture experiments in the 80s (for example, the Tandem Nonstop Computer, one of which was recently noted to have been running continuously for 24 years: https://news.ycombinator.com/item?id=13514909) but x86 never really went there, and superscalar computing is built on a sped-up version of the same ideas that connect desktop computers together, so while there are lots of architectural optical illusions, well, stuff falls apart.
- Everyone in this thread is talking about Google Compute Engine, but it really depends on your usage patterns and requirements. GCE is pretty much the single major competitor to AWS, although the infrastructure is _completely_ different - different tools, different APIs, different pricing infrastructure. The problem is that it's not like like MySQL vs PostgreSQL or Ubuntu vs Debian; it's like SQL vs Redis, or Linux vs BSD. Both work great, but you basically have to do twice the integration work, and map things manually. With this said, if you don't have particularly high resource usage, VPS or dedicated hosting may actually work out more cost-effectively.
TL;DR: you go back to the SPOF problem, where _you_ have to foot the technical debt for the reliability level you want. Yay.
The bad guys are the providers who report false positives to preserve metrics.
But if you go to your personal health dashboard (https://phd.aws.amazon.com/phd/home#/dashboard/open-issues) they report an S3 operational issue event there.
Edit: Mine is reporting region us-east-1
Edit 2: And now the event disappeared from my personal health dashboard too. But we are still experiencing issues. WTH.
* Slack file sharing no longer works, hangs forever (no way to hide the permanently rolling progress bar except quitting)
* Github.com file uploads (e.g. dropping files into a Github issue) don't work.
* Imgur.com is completely down.
* Docker Hub seems to be unavailable. Can't pull/push images.
This appears to be a normal doctor's office where there are routine appointments. Emergencies would be referred to the ER anyway. And while I obviously don't know the details of how his office is run, you'd think that you could get by on a pen-and-paper fallback to manage the office. Maybe that's an advantage to keeping experienced office staff on board.
they just now put up a box at the top saying "We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region."
increased error rates? really?
Amazon, everything is on fire. you are not fooling anyone
edit: in the future, please subscribe to @MyFootballNow for timely AWS service status updates https://pbs.twimg.com/media/C5xdm9_WMAAY7y_.jpg:large
“The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.”
It does exist, apparently.
"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates."
As for other cloud providers seeing green: Or maybe people will come to their senses and will see that monocultures are bad, whether in biology or hosting.
I bet they're related. The moment I got an alert of the S3 outage I started refreshing a bunch of status pages at a feverish pitch. Multiply that by a thousands of others doing the same and boom you've got the equivalent of a DDOS.
Ray ID: 33863460edf54231
The dashboard is not changing color due to the S3 issue. We're updating the banner in place of that.
Edit: Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
Then there's the frontend, that apparently periodically reads this file from S3 and caches the results.
I guess the comment they added on the top after two hours of being in the dark was likely manually added to the web frontend.
Obviously all of this would be hilariously badly designed if it was made this way. Still...
This would address issues that are only visible from the outside.
Fun story, when I was an intern at Amazon there was actually a warehouse fire. The result was a lot of manual database entry updating as products were determined to be destroyed or still fit for sale.
The military is not exactly known for being great at keeping track of things that aren't nuclear weapons, and sometimes falls short even on those.
They showed up in my town in the early 1980s after one of our local malls had a smokey fire. They sold a bunch of stuff that came from other places, too, including a ton of 15mm miniature soldiers.
This is getting crazy.
So this is what centralization looks like.
"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates." https://twitter.com/awscloud/status/836656664635846656
"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates."
Also, there are incentives based on colors, so the managers really don't want to admit any failure.
This is a great case in point if true.
As a cute example, one of their senior people (in a stats heavy role) couldn't explain how they'd detect if people wanted to be able to automatically order socks and tshirts on a buying cycle outside of what I call the "scheduling horizon", eg every 3-6mos. (Things I need regularly, but sparsely enough it doesn't stand out to do proactively -- eg, I buy socks when they all have holes, not on a reasonable replacement cycle.)
A textbook case of "wrong incentives". #1 incentive should be satisfied customers.
Such approach has better ROI than actually doing high quality products or services, which is why so much of what we buy is utter shit. That's especially true on the mass market, when satisfaction of individual customers doesn't impact your company at all, as long as they're not complaining too loud.
This would extend to a service like amazon actually, where survival of the service would be an extraordinary effort in case this problem lasted for a long time.
The way you imagined it, as 100% uptime, is incorrect.
We've seen this story play out in other industries and it never works out well for average people. It's been astounding for me to watch the pace of this centralizing and who is helping it along.
The tldr; point is that a single service provider should not have the amount of control Amazon does over the Internet. At least that's my take.
I know my opinion on this wildly differs from the HN crowd and SV "decision makers" these days - what is so curious to me is that this is a complete 180 from that same demographic even 10 years ago.
If you want other infrastructure companies or decentralized internet, you are free to do that yourself via voluntary means.
Through some dumb luck (and desire to procrastinate a bit), I opened HN and, subsequently, the AWS status page and actually read the US-EAST-1 notification.
HN saves the day.
AWS Department: "Wellll, if we don't change the status to red, it's as if we were up all the time!"
"We've identified the issue as high error rates with S3 in US-EAST-1, which is also impacting applications and services dependent on S3. We are actively working on remediating the issue."
I do love corporate-speak.
It's a rapidly oxidising waste receptacle (rather than a dumpster fire).
I don't know about reliability, but it's a fraction of the price of S3.
Disclosure: I work on Google Cloud.
If you're interested:
You can set a cache invalidation time too.
Always online is a slightly different feature I believe.
we also offer a large number of boilerplates such as Flask, ASP,net, node.js
we have been making lots of changes lately check us out!
static page deploy guide: https://www.ibm.com/blogs/bluemix/2014/08/deploying-static-w...
Disclosure: I work for GitLab
I'm trying to downgrade to an older version because our install is not working but can't get the DEB unfortunately.
ping cloudron.io -> 188.8.131.52 -> server-54-192-7-94.dfw3.r.cloudfront.net (Amazon Technologies) 
Yup, they took those portions of our service down, but we now have redundant status page hosting setups and prerendering that is not tied to S3 (the latter is the only part of our service that was affected, and it was fixed within an hour of the outage)
The only con is that it is a Google product that could be deprecated at any point in time. But, with all the acquisition stuff happening over at RS, I'd be lying if I said I wasn't worried about them killing of their cloud offering.
1) Google Cloud Storage can host static websites:
2) Google Cloud Platform has a 1 year deprecation policy, which would never happen with a product that so many companies and customer rely on (Google Reader had a small but passionate base)
Disclaimer: I work on Google Cloud Platform
Also just wanted to say that I've been extremely happy with GCP thus far and all the services I've tried thus far have more features than RS. I really hope GCP is here for the long haul.
"Increased API Error Rates - 9:52 AM PST We are investigating increased error rates in the US-EAST-1"
"S3 operational issue - us-east-1"
What else should I add?
S3 is not a CDN!
I'm curious how much $ this will lose today for the economy. :)
Many aws SDK libs don't remove \n for you.
(I hope it wasn't me who broke it lol)
You would have to host your own software which can also fail, but then at least you could do something about it. For example, you could avoid changing things during critical times of your own business (e.g. a tradeshow), which is something no standard provider could do. You could also dial down consistency for the sake of availability, e.g. keep a lot of copies around even if some of them are often stale - more often than not this would work well enough for images.
S3 offers alternative region replication functionality and you can use Cloudfront of another CDN to load balance between buckets
Here's  their official SLA. This outage so far brings them to less than 3 nines of uptime this month (43.8 minutes) but still more than 2 nines (7.2 hours) so it sounds like everyone gets 10% off their S3 bill.
Very curious if Amazon will apply this automatically or only if you complain.
Edit: from further down the same page, it looks like only if you write in to support do you get these broken SLA credits. Kind of lame since everything else about their billing is so precise and automatic.
Well good thing I have my backups on [some service that happens to also use S3 as a backend].
No, they didn't. Large portions of AWS's documentation details how you, the developer, are responsible for using their tools to engineer a fault-tolerant, highly available system. Everything goes down. AWS promises varying amounts of nines everywhere, not 100%.
S3 is not the cloud, it's one system running in the cloud. The cloud is not down, S3 and services dependent on (and possibly related to) it are.
One of the selling points of the cloud is that dynamically provisioned services from multiple providers enable engineering fault tolerant systems that are relatively secure against the failure of any single backend. But, yeah, if you are dependent on one infrastructure vendor's service -- particularly running in one particular region/zone -- you are probably better off than running on a single server for reliability against failures, but you aren't anywhere close to immune to failures. I don't think even cloud vendors have been particularly reluctant to make that point.
As someone who's really only a yellow belt (assuming you're all black belts!), just so I understand ('cos I'm cacking myself!) ...
I'm seeing the same issue. Does this mean there's a problem with Amazon? I can't access either of my S3 accounts even if I change the region, and I'm concerned it may be something I've done wrong, and deleted the whole lot. It was working yesterday!!!
Would be massively grateful for a heads up. Thanks in advance.
"Believe" is not inspiring.
(I think the AM means PM)
> Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.
"500 The server encountered an error processing your request." message
It appears to be impacting gotomeeting, I get this error when trying to start a 12pm meeting here:
CloudFront is currently experiencing problems with requesting objects from Amazon S3.
Edit: ironically, my missed 12pm meeting was an Azure training session.
There is something to be said about not being located in the region where everything gets launched first, and where most the customers are not [imo all the benefits of the product, processes and people, but less risk].
Good luck to everyone impacted by this...crappy day.
Some big names and services popular with HN mentioned there. Quora, AirBnb, SendGrid, Downdetector(heh).
AMZN stock down $3.45 (0.41%).
"I know I'm piling on here, but Amazon's stock price is a better uptime indicator than their status page. #AWS #S3 #awscloud"
"http://www.isitdownrightnow.com/" and DownDetector are down.
YES! Buy on rumor, sell on fact as the saying goes.
The only services my team uses directly are EC2 and RDS, and I'm thinking of moving RDS over to EC2 instances.
We are entirely portable. We can move my entire team's infrastructure to a different cloud host really quickly. Our only dependency is a Debian box.
I flipped the switch today and cloned our prod environment, including VPN and security rules, over to a commodity hosting provider.
Change the DNS entry for the services, and we were good to go. We didn't need to do anything because everyone was freaking out about everything else being down. But our internal services were close to unaffected.
At least for my team.
Obviously, we aren't Trello or some of the other big people affected. And we don't have the same needs they do. But setting up the DevOps stuff for my team in the way that I think was correct to begin with (no dependencies other than a Debian box) really shined today. Having a clear and correct deployment strategy on any available hardware platform really worked for us.
Or at least it would have if people weren't so upset about all our other external services being down that they paid no attention to internal services.
Lock-in is bad, mmkay?
If your company is the right size, and it makes sense, do the extra work. It's not that hard to write agnostic scripts that deploy your software, create your database, and build your data from a backup. This can be a big deal when some providers are flipping out.
All-your-junk-in-one-place is really overrated, in my opinion. Be able to rebuild your code and your data at any given point in time. If you don't have that, I don't really know what you have.
I don't necessarily disagree with what you are saying but there is cost of doing everything yourself.
You would have been equally protected if you had been in more than one region.
But the developer cost here (my time) was worth it. Our shit wasn't down, while everyone else's was.
I also want to point out that I spent minimal time setting this up. We can deploy to GCE or commodity VPCs at a moment's notice, and that a project I did over a couple of weekends piggybacking on the ansible playbooks I wrote for AWS.
It's not that hard. You have to get your developers on board with being provider agnostic, and you have to be agnostic yourself. But it is not insurmountable.
It also help when you're the lead dev or your team and also have a good relationship with the devops guy. :)
The EC2 instances themselves are fine, but the affected ELBs are spitting out 500s.
Hearing reports of EBS down as well.
From http://status.aws.amazon.com/ Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
[edit- looks like they do have a pretty heavy reliance on S3, per https://github.com/WhisperSystems/Signal-Server/blob/master/... and various other sources.]
As part of the release they wanted to make sure everybody gets a chance to see "red" metrics.
Increased Error Rates
Update at 11:35 AM PST: We have now repaired the ability to
update the service health dashboard. The service updates
are below. We continue to experience high error rates with
S3 in US-EAST-1, which is impacting various AWS services.
We are working hard at repairing S3, believe we understand
root cause, and are working on implementing what we believe
will remediate the issue.
The dashboard not changing color is related to S3 issue.
See the banner at the top of the dashboard for updates.
is there a part of this hosted on S3? I cannot open Atom anymore, it keep crashing on the check for updates screen...
In the last couple of minutes that forum post has gone from not existing to 175 views and 9 posts.
Amazon Elastic Compute Cloud (N. Virginia) Increased Error Rates less
11:38 AM PST We can confirm increased error rates for the EC2 and EBS APIs and failures for launches of new EC2 instances in the US-EAST-1 Region. We are also experiencing degraded performance of some EBS Volumes in the Region.
Amazon Elastic Load Balancing (N. Virginia) Increased Error Rates more
Amazon Relational Database Service (N. Virginia) Increased Error Rates more
Amazon Simple Storage Service (US Standard) Increased Error Rates more
Auto Scaling (N. Virginia) Increased Error Rates more
AWS Lambda (N. Virginia) Increased Error Rates more
In the meantime, EC2, ELB, RDS, Lambda, and autoscaling have all been confirmed to be experiencing issues.
When I go to my orders I get "There's a problem displaying some of your orders right now.
If you don't see the order you're looking for, try refreshing this page, or click "View order details" for that order."
It seems that Amazon is eating its own dog food.
And then I see the news.
"Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue."
(I work on Cloud, specifically Datastore.)
For example, I have images and various assets stored on S3; would there be a way to change the storage provider on the fly on a website?
The other case is could I have apps hosted on Heroku and set up a service to duplicate the app code and database over to Google for redundancy? This isn't super critical as the apps are not customer focused, but they generate content that is customer focused.
It shows up in the event log now too.
For S3, we believe we understand root cause and are working hard at repairing. Future updates across all services will be on dashboard.
Amazon Web ServicesVerified account @awscloud 8m8 minutes ago
The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.
Increased API Error Rates
09:52 AM PST We are investigating increased error rates in the US-EAST-1 Region.
S3 operational issue
February 28, 2017 at 6:51:57 PM UTC+1
S3 promises four nines of availability (11 nines of durability), so today we got about 3-4 years worth of downtime in one fell swoop. Oops.
https://aws.amazon.com/s3/sla/ shows 99.9%
At least now we can see all the network failures in full RGB.
Half internet is down the data center in Virginia the one with the cloud is totally dead apparently. Enjoy the cloud bullshit :)
$ s3cmd ls
WARNING: Retrying failed request: / ([Errno 60] Operation timed out)
WARNING: Waiting 3 sec...
WARNING: Retrying failed request: / ([Errno 60] Operation timed out)
WARNING: Waiting 6 sec...
I'd rather my app load but appear broken so I can show my own status rather than just shutting down every single app...
Interesting tweet from last month.
Technology leads to technology (and wealth) monopolies, in other words: more centralization. Which has always been bad.
Just like with Cloudflare leaking highly sensitive data all over the Internet, a couple of days ago.
After two hours, they have finally updated their dashboard.
It seems their statu