I want to take this opportunity to complain about the interview system. Hire people who care about the product and company. Such mistakes cannot be made by people who care.
It's funny. You know it. I know it. Entire HN knows it. And yet _no_ interview follows any such common sense rules. Just go to a Google/FB interview and they ask you all sort of questions. It doesn't matter what you are interviewing for. In fact, in many cases they don't even tell you which group/team/project you will be assigned to. Since they will "assess" where you fit best.
I don't think that's true. A large portion of companies/hiring managers don't get this, all the way down to actual engineers.
I've come to believe it's largely a "if I had to go through that BS, you do too!" tit for tat thing engineers now play.
Disclaimer: I work for Google
At Google this just isn't true unless we are talking about new grads. After you pass the interviews, you have to do team matching, where you will have informal 2 way interviews with prospective teams. Only after you find a team that you like and that likes you can you get an offer, assuming your application is approved.
It's definitely not a case of sticking you in a team without your input. It's true that at the interview stage you won't know, but by the time you have an offer you know what team you'll be on.
I believe at Facebook you have even more freedom. You get to go to boot camp for 3 months and after that get to choose what team you want to be on. I haven't worked there so this is just based off what recruiters and friends have told me though so hopefully someone else can correct me or elaborate.
BTW, I think that second point is a common misconception that deserves rebuttal because I think that until a five years ago or so Google wouldn't tell you which team you would join (or give you much choice) before deciding on an offer, even if experienced.
Writing B+ trees at the drop of a hat is probably more a signal of memorization and recency of taking a data structures class than smarts, particularly the smarts necessary to develop and maintain robust distributed infrastructure.
2. The number of people at Amazon who need to be able to choose a search/sort/etc. algorithm or data structure and understand why one is more appropriate than another for a given use case is much higher.
3. The number of people at Amazon who need to demonstrate common sense is very high. This skill is much more closely related to #2 than #1.
CS fundamentals are nice to know, but how often does one implement something custom like BigTable/Colossus from scratch vs. buy/use OTS? The support/scalability/technical debt/unforeseen costs of implementing something entirely new is typically much greater than using adequate "lego" that already exist.
Judgement of cost/benefit DIY vs. OTS can be gained (hopefully) without too much wasted effort, time, money, morale & business life-expectancy.
That's a very naive assertion. Humans make mistakes, they always have and they always will, no matter how smart they are and how much they care. That's why pilots have checklists that they go through before they're even allowed to leave the gate.
Of course no one who cares would intentionally make the choice of building the status page on the infrastructure that it monitors, but it's not that difficult for something to creep into the dependency chain and then you don't find out until the next outage a year later.
Mistakes were made, but attributing it to lack of caring is misguided. Good intentions don't work, you need mechanisms to enforce good practices. Those mechanisms obviously failed here, but the solution is to fix the mechanisms, not ask people to care more.
It's interesting how easy it is to accidentally invert logical operations. I see it in code all the time. A condition will test that A is true when what they really need to know is if B and C are both false. It's like some kind of cognitive tick.
It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.
This one is relatively easy to "fix" at least, it's nothing having multiple DNS providers for public records can't handle as well as ensuring redundancy for your internal DNS services. Bonus points if you run your own recursive resolver so you aren't dependent on some other party not screwing up somehow.
I don't understand this. The icon URL is in the HTML. Both icons https://status.aws.amazon.com/images/status0.gif and https://status.aws.amazon.com/images/status3.gif have been working for us all along. Plus clearly they are able to update the status page contents, because they added the "increased error rates" message there too. I don't want to believe it but is it fair to assume they did not want to replace status0.gif with status3.gif in HTML? Please correct me if I'm not getting this straight.
In any case, it's a bad day for AWS folks, I'm feeling their pain too. Being a cloud provider is a tough business to be at and the pressure is really high.
Or as others noted reverse the logic so that it shows red icons by default but as long as the services are working then it replaces that with a green icon. And when those external services are down it would go back to a red icon.
1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)
2. make sure your service reports to your status page instead of your status page looking for the service.
3. redundancy for your status page?
anything anyone-else wants to add?
Many people forget DNS in the equation.
If it's on a subdomain of your regular site, it will go down in case the domain is accidentally/maliciously transferred or legal authorities seize/block it (we're seeing the extremely long arm of the US law enforcement with Mr. Dotcom, as well as Erdogan and other dictators or the Chinese firewall).
If it's on a different domain that's on the same DNS hoster (e.g. Amazon's Route 36, or for that matter your own hoster!) you're screwed if the DNS fails.
If it's via the same registrar, you're screwed if someone obtains access to your registrar account (this once again includes law enforcement).
Obviously this also holds true for the TLD itself - e.g. imagine Verisign (holding .com and .net) has problems, you want a .info, for example.
Conclusion: different datacenter/provider for the HTTP server part, different DNS provider(s), different TLD. For the datacenter and DNS provider level you can use high-availability (multiple different NS entries, multiple different servers), this can also protect from legal overreach.
Also, your status page may have a negligible load as long as your service is operating fine, but people tend to go to status pages and manically press Cmd+R until there's a green light - so best use nginx/lighttpd with static pages and minimal assets only.
If you're running HTTPS on your main site and you do choose to name it "status.mydomain.com", also deploy HTTPS on your status page - else people visiting status.mydomain.com may transmit session cookies in cleartext in case you forgot the SECURE flag or the client does not honor this (for whatever reason).
Oh, and do buy a separate HTTPS cert instead of using your usual wildcard cert or your primary cert with the status page as SAN, so your status page stays up when your primary cert expires...
If the status page relies on getting updated information from the service, it may not even notice when the whole thing just crashes and goes down in flames. Attempting to do some predefined calls to the service to evaluate whether it is working correctly appears like a better solution?
But yes, in general, the status page and status services should be entirely on their own independent infrastructure; and in a different data centre. A number of providers offer independent status page services. If your entire company runs off Digital Ocean, your status page/services should probably be running on Linode or AWS or whatever.