Hacker News new | past | comments | ask | show | jobs | submit login
AWS status updates not working due to S3 (twitter.com/awscloud)
126 points by joshua_wold on Feb 28, 2017 | hide | past | favorite | 47 comments

It baffles me that AWS, a leader in cloud computing can make such a rudimentary mistake. Seriously, I interviewed there and they asked me to write a b+ tree and I failed. And then you see fundamental errors like this which possibly cannot be made by people who had the smarts to write b+ trees in 15 minutes...

I want to take this opportunity to complain about the interview system. Hire people who care about the product and company. Such mistakes cannot be made by people who care.

Writing a B+ tree from memory and making sure your infrastructure isn't doing something stupid are fundamentally different skills. One requires that you regurgitate the contents of a text book on a white board, the other that you can engineer a solution. I wish them well on an interview set up for hiring the former; I try to hire the later.

> Writing a B+ tree from memory and making sure your infrastructure isn't doing something stupid are fundamentally different skills.

It's funny. You know it. I know it. Entire HN knows it. And yet _no_ interview follows any such common sense rules. Just go to a Google/FB interview and they ask you all sort of questions. It doesn't matter what you are interviewing for. In fact, in many cases they don't even tell you which group/team/project you will be assigned to. Since they will "assess" where you fit best.

> Entire HN knows it

I don't think that's true. A large portion of companies/hiring managers don't get this, all the way down to actual engineers.

I've come to believe it's largely a "if I had to go through that BS, you do too!" tit for tat thing engineers now play.

> Since they will "assess" where you fit best.

Disclaimer: I work for Google

At Google this just isn't true unless we are talking about new grads. After you pass the interviews, you have to do team matching, where you will have informal 2 way interviews with prospective teams. Only after you find a team that you like and that likes you can you get an offer, assuming your application is approved.

It's definitely not a case of sticking you in a team without your input. It's true that at the interview stage you won't know, but by the time you have an offer you know what team you'll be on.

I believe at Facebook you have even more freedom. You get to go to boot camp for 3 months and after that get to choose what team you want to be on. I haven't worked there so this is just based off what recruiters and friends have told me though so hopefully someone else can correct me or elaborate.

Thanks for the reply. I am talking about the interviewing stage. I have more than 10 years of engineering experience but the Google HR actually sent me a PDF of all the topics I should be well versed it. The booklet was basically my entire grad course and masters and more. I can confirm from more than 5 sources that this is the case with Google interviews.

You stated two things: the fact that Google asks algorithms questions and "Since they will "assess" where you fit best". The GP was refuting the second of these things (the GP even quoted it) while you were defending the first statement.

BTW, I think that second point is a common misconception that deserves rebuttal because I think that until a five years ago or so Google wouldn't tell you which team you would join (or give you much choice) before deciding on an offer, even if experienced.

Lazy interviewers looking for magic shortcuts / lazy managers not mentoring/teaching/structuring how to dig into candidate ability to solve real, hairy problems quickly and get a window into candidate problem-solving thought-processes. Hiring by random "intuition" is both evidence-free and scatter-gun half-assery that results in wasting everyone's time by not attracting/selecting for great candidates.

> cannot be made by people who had the smarts to write b+ trees in 15 minutes...

Writing B+ trees at the drop of a hat is probably more a signal of memorization and recency of taking a data structures class than smarts, particularly the smarts necessary to develop and maintain robust distributed infrastructure.

1. The number of people at Amazon who need to be able to write a b+ tree is negligible.

2. The number of people at Amazon who need to be able to choose a search/sort/etc. algorithm or data structure and understand why one is more appropriate than another for a given use case is much higher.

3. The number of people at Amazon who need to demonstrate common sense is very high. This skill is much more closely related to #2 than #1.

Yup, it's BS in-lieu of real problems.

CS fundamentals are nice to know, but how often does one implement something custom like BigTable/Colossus from scratch vs. buy/use OTS? The support/scalability/technical debt/unforeseen costs of implementing something entirely new is typically much greater than using adequate "lego" that already exist.

Judgement of cost/benefit DIY vs. OTS can be gained (hopefully) without too much wasted effort, time, money, morale & business life-expectancy.

I doubt whether tech companies have data to backup for this type of interview, something like the correlation between people who do well in whiteboard binary inversion interview and people who do well in real world jobs. Or they just do it because that's the way Google does it.

>Such mistakes cannot be made by people who care.

That's a very naive assertion. Humans make mistakes, they always have and they always will, no matter how smart they are and how much they care. That's why pilots have checklists that they go through before they're even allowed to leave the gate.

This statement (it starts with 'such') was about a very specific mistake. This is not some hard engineering problem. People who 'care' about the authenticity of the status page will not make the error of basing it on the infrastructure it monitors.

Takeoff in a plane isn't a hard engineering problem either, but without checklists even one of the best pilots in the world forgot to unlock his rudder before heading down the runway. [1]

Of course no one who cares would intentionally make the choice of building the status page on the infrastructure that it monitors, but it's not that difficult for something to creep into the dependency chain and then you don't find out until the next outage a year later.

Mistakes were made, but attributing it to lack of caring is misguided. Good intentions don't work, you need mechanisms to enforce good practices. Those mechanisms obviously failed here, but the solution is to fix the mechanisms, not ask people to care more.

[1] http://www.ww2hc.org/emailarchives/2011/checklistorigin.htm

Because bureaucracy, siloization, Tragedy of the Commons and likely a massive infrastructure with tons of technical debt and inability to make major changes except by gradual incrementalism, oft too late. Infrastructure needs active SimianArmy-style breakage finding without "sacred cows" but with less than 100% uptime across all services to ferret-out outage edge-cases.

So now you know that a deadman switch is the better way to report availability. The logic was backwards for this signal. The default condition is failed. Not failed requires proof.

It's interesting how easy it is to accidentally invert logical operations. I see it in code all the time. A condition will test that A is true when what they really need to know is if B and C are both false. It's like some kind of cognitive tick.

That's good practice, sure, but their problem was even more fundamental than that. Their status page was dependent on the service it was reporting on being up. That fails the most basic requirement of a status page.

This should be the official anti-pattern when designing a status page.

It is...it's literally the reason products like statuspage.io exist, because if your status page has any dependencies on the services for which is provides statuses, then it's not really a useful status page.

And yet, I cannot find any obvious information on where statuspage is hosted.

http://metastatuspage.com/ should give a hint

Builtwith seems to think they are hosted on EC2 https://builtwith.com/statuspage.io

Their status page says so anyway - http://metastatuspage.com/incidents/lb3rpt031vmx

The status page should work with just IPv4 or IPv6, BGP and round-robin on a bunch of location-diverse, simple, real metal, web-boxes that only serve status.

Looks like they've fixed it now. (The status page, not s3)

This is a problem of "monoculture" dependencies and failure to implement HA by using multiple services. All Github releases are down, atom downloads are down and so on. Companies, including Amazon, should be using other CDNs for HA purposes, even if NIH.

It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.

Assuming that it actually makes business sense to do so. There are certainly cases where you can make a perfectly rational business decision to depend on someone else's services and you're OK with your uptime not being any better than their uptime.

> It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.

This one is relatively easy to "fix" at least, it's nothing having multiple DNS providers for public records can't handle as well as ensuring redundancy for your internal DNS services. Bonus points if you run your own recursive resolver so you aren't dependent on some other party not screwing up somehow.

> The dashboard not changing color is related to S3 issue.

I don't understand this. The icon URL is in the HTML. Both icons https://status.aws.amazon.com/images/status0.gif and https://status.aws.amazon.com/images/status3.gif have been working for us all along. Plus clearly they are able to update the status page contents, because they added the "increased error rates" message there too. I don't want to believe it but is it fair to assume they did not want to replace status0.gif with status3.gif in HTML? Please correct me if I'm not getting this straight.

In any case, it's a bad day for AWS folks, I'm feeling their pain too. Being a cloud provider is a tough business to be at and the pressure is really high.

One explanation might be that they use an internal tool to update the status page definitions, and parts of that tool are hosted on S3. Or that the status definitions themselves are hosted on S3 (and then read and transformed into the HTML page everyone sees)

Drone crashes into my living room with groceries. Receive email that my package was successfully delivered.

I would hate to imagine drones dropping out of the sky if S3 went down in the future.

Meanwhile, on InfoWars "S3: NWO AI mothership ready to hijack everything" ;)

Cracker! ROTFL

So the obvious answer would be to host it on like azure or google cloud storage but I can just imagine the institutional push back that would get trying to do that.

What if I told you you could make a red dot without hosting an image anywhere?

Red dot as a service?

what if I told you that if you the status text didn't update?

Seems like another commenter beat me to the punch, but to me the obvious answer would seem to be not hosting images at all as they could create a red circle in css.

Or as others noted reverse the logic so that it shows red icons by default but as long as the services are working then it replaces that with a green icon. And when those external services are down it would go back to a red icon.

The status text also didn't update. Seems the S3 dependency is more than just icon hosting.

Just to be clear...best practices with designing status pages:

1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)

2. make sure your service reports to your status page instead of your status page looking for the service.

3. redundancy for your status page?

anything anyone-else wants to add?

> 1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)

Many people forget DNS in the equation.

If it's on a subdomain of your regular site, it will go down in case the domain is accidentally/maliciously transferred or legal authorities seize/block it (we're seeing the extremely long arm of the US law enforcement with Mr. Dotcom, as well as Erdogan and other dictators or the Chinese firewall).

If it's on a different domain that's on the same DNS hoster (e.g. Amazon's Route 36, or for that matter your own hoster!) you're screwed if the DNS fails.

If it's via the same registrar, you're screwed if someone obtains access to your registrar account (this once again includes law enforcement).

Obviously this also holds true for the TLD itself - e.g. imagine Verisign (holding .com and .net) has problems, you want a .info, for example.

Conclusion: different datacenter/provider for the HTTP server part, different DNS provider(s), different TLD. For the datacenter and DNS provider level you can use high-availability (multiple different NS entries, multiple different servers), this can also protect from legal overreach.

Also, your status page may have a negligible load as long as your service is operating fine, but people tend to go to status pages and manically press Cmd+R until there's a green light - so best use nginx/lighttpd with static pages and minimal assets only.

If you're running HTTPS on your main site and you do choose to name it "status.mydomain.com", also deploy HTTPS on your status page - else people visiting status.mydomain.com may transmit session cookies in cleartext in case you forgot the SECURE flag or the client does not honor this (for whatever reason).

Oh, and do buy a separate HTTPS cert instead of using your usual wildcard cert or your primary cert with the status page as SAN, so your status page stays up when your primary cert expires...

Not sure about 2 - What are your arguments for this?

If the status page relies on getting updated information from the service, it may not even notice when the whole thing just crashes and goes down in flames. Attempting to do some predefined calls to the service to evaluate whether it is working correctly appears like a better solution?

Yea I was wondering about that comment too. I mean you can do both. Your status page should be static, updated by a service which both polls and accepts information from your services. You ideally want to go yellow if one of the two fails.

But yes, in general, the status page and status services should be entirely on their own independent infrastructure; and in a different data centre. A number of providers offer independent status page services. If your entire company runs off Digital Ocean, your status page/services should probably be running on Linode or AWS or whatever.

ironic .the status update doomed by its own downtime

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact