Dynamo DB is throttling API access and seems to be having issues with the management of meta data.
We noticed that RDS and ElastiCache backup and restores were taking much longer than expected, and once the first set of errors about Dynamo DB came in we decided to abort and try it again at a different date. An hour later we got notifications that RDS was having issues as well. I'm disappointed that it takes so long to update the AWS status page when things aren't working properly.
Here's a copy from the status page.
3:00 AM PDT We are investigating increased error rates for API requests in the US-EAST-1 Region.
3:26 AM PDT We are continuing to see increased error rates for all API calls in DynamoDB in US-East-1. We are actively working on resolving the issue.
4:05 AM PDT We have identified the source of the issue. We are working on the recovery.
4:41 AM PDT We continue to work towards recovery of the issue causing increased error rates for the DynamoDB APIs in the US-EAST-1 Region.
4:52 AM PDT We want to give you more information about what is happening. The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information. Our recovery efforts are now focused on restoring metadata operations. We will be throttling APIs as we work on recovery.
5:22 AM PDT We can confirm that we have now throttled APIs as we continue to work on recovery.
5:42 AM PDT We are seeing increasing stability in the metadata service and continue to work towards a point where we can begin removing throttles.
7:12 AM PDT We continue to work on removing throttles and restoring API availability but are proceeding cautiously.
7:22 AM PDT We are continuing to remove throttles and enable traffic progressively.
7:40 AM PDT We continue to remove throttles and are starting to see recovery.
7:50 AM PDT We continue to see recovery of read and write operations and continue to work on restoring all other operations.
8:16 AM PDT We are seeing significant recovery of read and write operations and continue to work on restoring all other operations.
If you want alerts on this sort of thing, my side project StatusGator https://statusgator.io will alert you when services post downtime on their status pages. My dashboard blew up this morning with a ton of red and yellow as soon as Amazon started flaking.
Edit: I suppose it's time to invest in a multi-region setup. Since StatusGator is hosted on Heroku in the US-East region, it is in theory affected by this problem though so far is still up.
> Our service provider is still working towards resolution of this issue. We will update when we have news, or in 1 hour.
I wonder why they don't tell that AWS is their service provider. Is it wrong to make the information less obscure?
It's because Heroku's choice of vendors shouldn't matter to their customers. They see it as an implementation detail, and their responsibility to manage.
So I don't think that's an obfuscation. The people I know at Heroku all have an attitude of, "The buck stops here."
I assume this is to maintain their SLA. We really need independent third parties to record uptime for SLAs instead of trusting hosts to do it themselves.
This outage may be the last straw with Heroku for me. They've also previously stated years ago that they would end their dependance on AWS East and yet today shows that obviously hasn't happened.
Parent seems to imply there's something wrong with choosing AWS. There is not. (forgive me if I mistook the tone)
Blaming it upstream is hiding passing the buck on your decision.
AFAIK this outage only affected dyno restarting (which may have been triggered by a number of reasons) and creation of new dynos. Perhaps your EU dynos were lucky enough to not have done either of these things during the outage?
Maybe they should host the ticket system on a separate provider?
At least in the event of an instance outage you could conceivably migrate off Amazon to another VPS provider. No one using DynamoDB has an alternative.
No, they are fixing their infrastructure. The point here is that all single provider systems are destined for periodic failure. Not relying on one single provider is, in theory, a service's means of providing higher reliability. This is the general argument for regions and availability zones on Amazon, but that relies on trusting there is no single point of failure with the system (i.e. inside Amazon).
I've worked for several companies who run the majority of their services on AWS, yet maintain functional systems on other providers in the case of an AWS outage.
It might, or might not, depends how you built it.
> you have all-hands of one of the biggest tech companies in the world working to fix your infrastructure.
And it still was out for hours... You might see it as a great success, but you can also interpret it as "what good is it, even if with so many people behind the scenes they still failed for hours...".
The problem is that because that services ends up being embedded in so many services and products, half the internet ends being down. Even Amazon dogfoods their own stuff so services rely on each other. Dynamo is down, maybe SQS will be down or analytics as well.
This was as also part of the strategy when we decided to open source kubernetes. Having an open alternative made the commercial offering (Google Container Engine -- GKE) much stronger because of the reduced dev lock-in.
You don't need automatic failover, but you do need your own scripts to talk to the AWS API. The whole point of AWS is automation.
Serious AWS web apps are also distributed across multiple regions/availability zones, but many companies don't do that. It would be good for hosting services like Heroku, which I use, would replicate across zones also.
To be fare to Amazon, they tell you you architect and built across availability zones.
"The replica tables are intended to serve as read-only copies of the data; however, it is possible to write data to a replica table. If you write data to a replica, those changes will not be propagated to the master, or to any other replicas."
This prevents you from getting too locked in and losing control over your application.
I honestly can't believe that Netflix can't even load their home page without DynamoDB and all this other stuff.
Edit: Looks like Netflix is back online (10:53AM EST)
I mean, if even Netflix can't stay up during this, what hope does a startup have?
We do serious hosting for government and company and if we had a outage like amazon had yesterday, we would loose at least half of our clients and pay heavy cash penality.
I don't understand why people give so much a break to amazon. They have one of the worst uptime.
You can create a Datomic backup from any of these databases and restore them into a different one with the exact same semantics.
However - unlike DynamoDB - it has a quite usable free version and an even better starter version, which are good enough for production use. (The DDB local version is really only good for development)
By production use, I mean for example a non-replicated MongoDB/PgSQL/MySQL, which is what I see from many smaller companies... They would be already really better of with Datomic.
If you really need more features, like high availability and transparent query caching with memcached, then the cheapest paid version has a one-time cost of $3000:
You can easily save that amount of money by not spending 3-6 months developing tons of unnecessary queries, try to maintain that code and try to optimize its performance for scale...
On the "open-sourceness" note, I feel obliged to mention https://github.com/tonsky/datascript which IS open-source, and has a Datomic-like interface, but it's "just" an in-memory implementation.
But we are getting off-topic here I guess...
You say you don't use SQS or SNS? When they go down, you might not be able to get Logs or even login to the web Console.
Same goes for things like AutoScaling, OpsWorks, etc.
It is not really measuring the time you're going to be up. That interpretation is based on faulty assumptions. It's like the statement "the sun will burn out before one bit is flipped" is wrong. It is quite likely that by that time, all the bits will be gone.
Not funny when its Sunday.
So we wound up with over 1,000 of these machines running which then due to our fan out of their DB they needed to load into memory from other machines, our whole environment crashed until we could kill off the erroneously launched instances.
This meant an effective full reboot of our entire platform...
It's was not a fun weekend.
Don't advertise it if you can't offer it then.
> If you think those nines include software upgrades, you are probably over optimistic.
If you advertise a product with a specific SLA, and you can't meet that SLA, you're a liar. Don't try to blame the victim because of inaccurate/untruthful marketing or engineering.
Not meeting a SLA is not lying.
However, us-east is usually the cheapest one as well
That's why Netflix stays up when us-east or us-west are down.
So are they saying they are throttling SQS because of the DynamoDB issue?
3:14 AM PDT We are investigating increased error rates in the US-EAST-1 Region.
4:06 AM PDT We can confirm increased error rates for CreateQueue, SendMessage and ReceiveMessage API calls in the US-EAST-1 Region and continue to work towards resolution.
5:07 AM PDT We can confirm increased error rates for CreateQueue, SendMessage and ReceiveMessage API calls in the US-EAST-1 Region. As we work towards recovery, error rates may temporarily increase.
6:06 AM PDT We can confirm significantly increased error rates for CreateQueue, SendMessage and ReceiveMessage API calls in the US-EAST-1 Region. As we work towards recovery, error rates may temporarily increase in error rates.
* If adding to SQS fails, temporarily store the item on disk or S3, then add to SQS when it's back up?
* any other options?
Not completely sure though whether DynamoDB would benefit from relying internally on SQS.
Hopefully there is a relatively fast recovery on this.
Can anyone even log into their aws console right now?
I wonder if SQS uses DynamoDB, not the other way around.
Not sure if this is related.
The morning is going to be started. Traffic will be ramped up, and not sure if new sevrers will be launched because CloudWatch is failed. Polling SQS to find lifecycle notification message fail too.
This might come across as tooting our horn a bit. But it's more about sounding a warning to other startups providing SaaS service built on public cloud. My own misgivings about relying on a cloud provider specific stack (both for the reasons of visibility/debuggability as well as for vendor lock-in) meant that PacketZoom services were not affected by this failure at all because we only use them as one of the many providers of raw machines. We use our own techniques to load-balance/failover among multiple cloud providers too (so even if the raw compute/network went away, our service would take a perf hit but not be completely down).
Not when the original goal of the very service is to have presence in all geographical regions. If aws us-east is hit, I want the users to transparently failover to a server on east coast (perhaps one hosted by google or softlayer) rather than be directed all the way to us-west or eu.
And as for ELB, one doesn't use ELB for a custom protocol that load-balances/fails-over itself from the client :-)
"RWC2015ppv.com has been affected by an internet outage. Watch here. Not all mobile devices are compatible"
> The Datomic software consists of the peer library and the transactor
> These components connect to one of several storage service options
I'm pretty well versed in computer science and buzzwords - but that means nothing to me. That was from the overview which I expected a dead simple "this is what it does and why you need it". There are several important questions that don't seem to be addressed:
- I'm guessing it's a database proxy that's intelligent?
- Is it better than HAProxy?
- How is it different/better?
And most importantly:
- Do I need to modify my current code base to interact with this thing?
In case you think I'm being overly dramatic - here are 2 examples:
On the front page I know exactly what they offer.
On the product page I can see exactly what slack is and offers.
To gain some insights into our differences, here is my thought process:
I agree that description is quite fuzzy, but to me it clarifies it's a database, not just a proxy. It suggests ACID properties by saying transactional and the cloud ready part hints scalability to me.
Then you would think "Oh, not again, another DB", so it continues with the "Why Datomic?" heading.
Afterwards the 1st link is "Read the Rationale", which explains everything everything about the software quite concisely. It's a database though, so don't expect to understand it a few seconds.
Being said that, the video on the 1st page gives a very solid explanation in its first 40seconds...
Thanks for the examples; now I understand what were you thinking.
Those tag lines are really well done, but I think they had an easier job since they were describing a lot simpler services.
We detached this comment from https://news.ycombinator.com/item?id=10247727 and marked it off-topic.
We are constrained by time. We can't invest our time into investigating every claim that we come across.
Drugs can help some, but they are not necessarily the answer in all cases (including this case).
we are using the starter version combined with dynamodb in production and we found the payment structure very clear; no obfuscation whatsoever, unlike a microsoft or adobe pricing matrix ;)
(it's made by the same guy who made the very open source clojure programming language, btw, and he is very much against obfuscation)
anyway, it's a competitive advantage for those guys' who are building a bank on top of it in brazil (https://www.youtube.com/watch?v=7lm3K8zVOdY).
we feel we can avoid writing a lot of authorization and audit log code by using datomic, maybe u can save such work too.
Just to clarify since you appear to be linking the 'onetom' to the project, but I can't see any evidence she is in any way related.
Only before that, eh?
Also, mind disclosing the fact you are blatantly advertising your own services?
How did you come to this conclusion? From smoyer's comment?
(btw, thx smoyer, nice joke. :)
Also I wrote about ADHD on HN in the past.
I really didn't mean it in an offensive way:
In the background of such high volatility, it would be hard to pinpoint such a material effect from such a small disruption (in the big picture of course, I'm betting there are some pretty angry customers today due to the loss of a few sigma of reliability from this outage alone).
Now if a study were published indicating customers were switching providers over incidents this, then I think you'd have some material evidence. But is anyone else better yet? Azure was out for 12 hours last year apparently...