Hacker News new | past | comments | ask | show | jobs | submit login
Fastmail 30 June outage post-mortem (fastmail.com)
163 points by _han on July 6, 2023 | hide | past | favorite | 75 comments



Incidents happen, that's life. You can hedge your bets, but some things are out of your control. Communicating with your customers however is entirely within your control. Fastmail did a poor job of it. Their status page was useless beyond an initial "we found an issue" and then nothing for almost 11hrs. Their Twitter account was the same story, didn't bother with the Mastodon account at all. Unfortunately they don't seem to realise or recognise that they dropped the ball on this and that goes entirely unaddressed.

I'm also not really charmed with how they try to minimise the importance of the incident by repeating it only affected 3-5% of the customers. That might very well be. But those are real people and real businesses that rely on your services that were unavailable for the whole of the EU workday and a significant part of the US workday. Everyone I know who was affected is a paying customer, none of us have received so much as a communication or apology for it.

For a company that's been on the internet since 1999, the single-homed setup is a little shocking. But fine, it's being addressed. But both the communication during and after the incident don't inspire a ton of confidence.


On one hand, I too like better communications in situations like this. On the other, you knew they were aware of the issue and working on it. Updating the status page with a "we're trying to fix it" every hour or so wouldn't speed up the process of fixing the problem or help you in any way.


I didn’t ask for an update every hour, but an 11hr stretch of silence is not cool. An update every 3 would’ve been fine. Even if there isn’t much new to share, reiterating you’re working on it is useful to reassure your customers. I’m fairly certain they could’ve found something more meaningful to communicate than “still twirling the thumbs”. For example, the update after 11hrs included the tip to switch to a VPN. That would’ve been useful to communicate.


> I'm also not really charmed with how they try to minimise the importance of the incident by repeating it only affected 3-5% of the customers.

it's not meant to be charming, it's meant to convey the scale of the outage.

as a customer, even if I'm not effected, there's a world of difference between 3-5% and 99%.


I'll admit that seeing they are single-homed is sketchier than I assumed Fastmail's infrastructure was. I've been using Fastmail for years and I like them, but they are clearly big enough to have a second transit provider, and have been for many years. I'm amazed it took an outage for them to decide to get one. I appreciate the post-mortem but I felt better before I had read it.


Judged by comments and upvotes here, and previously, I am actually amazed how almost everybody thinks how perfect and unfallable he is and how he surrely deserves 100% perfection from the day he is born to the day his grand-grand-... kids die.


I'm surprised they don't own their own IPs. In the email world I would say that's quite important. Seems a tad casual to say "luckily they are willing to lease them"...


Yeah, me too, but it sounds like they have good reason and are working to fix it:

> Thankfully, NYI were willing to lease us those addresses, because IP range reputation is really important in the email world, and those things are hardcoded all over the place — but it has caused us complications due to more complex routing. Over the past year, we’ve been migrating to a new IP range. We’ve been running the two networks concurrently as we build up the trust reputation for the new addresses.

Might be one of those things that they did when they were small, but then got hard to change. Hopefully they will fully own their new addresses.

The thing I'm surprised about is that they only have "single path for traffic out to the internet."


I'm a little rusty in this area, but I'm pretty sure you'll have a hard time getting a direct IP allocation (vs from your transit provider) unless you are multi-homed, which apparently they were not.

There is a nice tick up in complexity when you go to advertising your address space via BGP to multiple providers.


First, someone has to be willing to sell them…


I'm not clear from the post-mortem why the outbound packets were having issues. Was it cloudflare? Did someone accidentally delete an outbound route? Why couldn't they see the issue themselves? I only have more questions now.


It looks like their upstream provider fucked something up so until they release that info we can only speculate.


Crazy that they were single homed.

Edit: After seeing the network diagram I have even more questions. What happens if CF is down? This all seems cobbled together and very prone to failures.


If CloudFlare is down, a significant portion of the Internet is down. Not that it's an excuse, but this isn't Microsoft or Apple. I'm sure funds have to be allocated to take into account the likelihood of something being down. But by all means write a blog post and tell them what they're doing wrong and how you'd fix it. Maybe they'll hire you...


And you don't have to have the resources of Microsoft or Apple to plan and build for the eventuality that a provider becomes intermittent or unavailable. There are fundamental aspects of running an internet-facing service and they failed at one of the most basic.


LOL ok, they "failed". They haven't had an outage like this in decades and this one only affected a small number of their clients. But sure, let's spend money on providing a backup for CF. Armchair QBs are the worst.


The issue isn't that they needed a backup to cloudflare. The problem was they only have a single internet provider at their datacenter, so they couldn't communicate with Cloudflare.

I've honestly never had a service with a single outbound path. Most datacenters where you rent colo have two or three providers as part of their network. In the cases where I've had to manage my own networking inside of a datacenter I always pick two providers in case one fails.

> Work is now underway to select a provider for a second transit connection directly into our servers — either via Megaport, or from a service with their own physical presence in 365’s New Jersey datacenter. Once we have this, we will be able to directly control our outbound traffic flow and route around any network with issues.

Having multiple transit options is High Availability 101 level stuff.


> The issue isn't that they needed a backup to cloudflare. The problem was they only have a single internet provider at their datacenter, so they couldn't communicate with Cloudflare.

That's not the issue. With Cloudflare MagicTransit, packets come in from Cloudflare, and egress normally. They were able to get packets from Cloudflare, but egress wasn't working to all destinations. I wasn't able to communicate with them from my CenturyLink DSL in Seattle, but when I forced a new IP that happened to be in a different /24, because I was seeing some other issues too, the fastmail issues resolved (although timing may be coincidental). Connecting via Verizon and T-Mobile, or a rented server in Seattle also worked. It's kind of a shame they don't provide services with IPv6, because if 5% of IPv4 failed and 5% of IPv6 failed, chances are good that the overall impact to users would be less than 5%, possibly much less, depending on exactly what the underlying issue was (which isn't disclosed); if it was a physical link issue, that's going to affect v4 and v6 traffic that is routed over it, but if it's a BGP announcement issue, those are often separate.


You’d be surprised at how many things break when different routes are chosen. Like etcd, MySQL, and so much more.


Those are generally on internal networks and rarely need to communicate with the internet. They shouldn't be affected by this.


Twould be nice…


Yet they still had the outage. I take exception to being called an 'armchair QB' when most of my career has been spent being called in to repair failures like this, providing postmortem advice to weather future ones and fix technical and cultural issues that give rise to just this type of thinking: oh, it won't happen to us because it has never happened to us.


In your experience, what kind of cost multiple is involved in remediation of the kinds of failure you deal with?

Is it x2 or x100 or somewhere in between?


Since you need two of everything, or more, two switches, two physical links, (hopefully) two physical racks or cabinets, and all that, it's minimum x2, but nowhere near x100. The cost for additional physical transit links is generally pretty reasonable, depending on provider, if you have more you can negotiate better rates, same with committed bandwidth. You can get better rates if you buy more.

There are a lot of aspects to that, but the cost of doing all of the above is a lot less than not having it and failing to have it at the wrong moment and losing money that way. Each business needs to weigh their risk against how much they want to invest and how much they think they can tolerate in terms of downtime.


Seems logical thanks for engaging with the question.


You're essentially saying "They haven't had an outage yet, so they don't need redundancy". I hope you realize how bad of an idea that is.

Also calling them an armchair QB? Very mature. Their comment is more correct than yours.


> This all seems cobbled together and very prone to failures.

AFAIK it's not like FastMail has a crazy number of network-related outages, so overall it doesn't seem that "prone to failure". As with many things, it's a trade-off with complexity and costs.


I'd argue that often the CDN or transit isn't drop-in replaceable. So it's usually more than 2x the cost as one has to maintain two architectures (or at least abstractions). That includes the expertise and not optimizing for strengths of either, or building really robust abstractions/adapters.


It is truly crazy that they do not have their own ASN and IP blocks.

I cannot imagine running a service like that with cobbled together DIA circuits and leased IPs.


> This all seems cobbled together and very prone to failures.

The entire internet.


To be fair. I've been a customer for 5+ years for my main personal email account and this is the first outage that has impacted me.


15+ and there have been a few but nothing particularly notable. Gmail has had outages too, and I couldn't tell you from personal experience which is more reliable which is interesting given the big difference in the complexity of the deployments (obviously Gmail also has the burden of much bigger scale).


I've had more Microsoft Office 365 outages than Fastmail outages in the past 10+ years, and I'm sure Microsoft has a much deeper pocket than Fastmail.

Things break. More things will break in the future due to increased complexity, brittle network automation processes and poorly written code. You can mitigate failures to a certain extent, but you can't guarantee 100% uptime, even with a triple redundant system. Every business decision is a compromise among various constraints.


If CF is down, they're down.

The problem here is that there isn't an alternative to Cloudflare.

They say this is the article. None of their DDoS solutions can take the heat except for Cloudflare.

So, if you want resilience in the face of Cloudflare being down, you need to build another Cloudflare. Let me know when you build it. Lots of people will sign up.


There are several providers in this space. Path, Voxility, etc.


They're still singled homed, right?

They just added redondancy to in/outbound routes.


Why assume CF is a simple box?

I think the whole point of CF is that it isn't.


You have to plug in the cable somewhere. If the hardware where the x-connect is plugged in dies, has issues, has to be rebooted, etc. you have an issue and it's not like there were no CF issues ever.


From the post-mortem, it doesn't sound like like the problem was a single network cable. Redundant network switches have existed for a long time, and they're certainly using them (but not bothering to mention it in the post-mortem).

Their problem was that they only have two transit providers, and one of them black-holed about 3-5% of the internet. Since it was a routing issue, I'd guess it was either a misconfiguration, or that the traffic is being split across dozens of paths, and one path had a correlated failure.


Why does there have to be just one cable?


Do you think there are two cables that are never bundled in the same underground tube?

All the redundancy in the world can’t protect you from some random person digging in the wrong place.


Sure, maybe the underground tube was breached while someone was moving goalposts?


Doesn't matter.

Single or million boxes, if they apply wrong config, it doesn't work.

You dual home because you hedge your bets and hope no 2 ISPs gonna have fuckup at same time


I think you’re suggesting that some redundancy you get by accident — by having two ISPs — is better than the redundancy a single ISP could engineer.

That’s certainly possible in specific cases, but not a very good general principle to rely on. One CF could very well be better than two given ISPs.


I'd suggest dusting off some math and calculating JUST HOW MUCH CF would need to be better compared to having 2 different ISPs fail at same time.

We had not that happen in 10 years

> That’s certainly possible in specific cases, but not a very good general principle to rely on. One CF could very well be better than two given ISPs.

You might think that if you have no idea what are you doing.


That’s a false dichotomy. You can absolutely have two ISPs, one of whom is CF.


Sorry, that's not a false dichotomy.

A second ISP isn't free, it has significant costs in terms of dollars and complexity. The question is, does CF and another provider have significant benefits to justify the additional costs? For it to make sense you have to believe the redundancy CF provides is significantly lacking (and in a way that adding a second provider addresses). Maybe it's true, but it would be nuts to just assume it and start spending a lot of money.


The cost is few k's at most, not exactly massive cost for medium sized company or bigger.

We pay x10 for power alone


I don't know if IP range reputation is part of SMTP but I wish sending email worked without it.


It's not part of SMTP, it's part of the cludge of anti-spam measures we bolted on top of it.


...bolted on top of DNS.


Sure, in addition to this though. IP reputation comes from spamhaus.org and the like and implemented in spamassassin or something similar, at SMTP(ish) level.

What you're talking about there is SPF/DKIM etc, which is an anti-spam measure but not IP reputation :)


Redundancy is the name of the game. Glad they realize this.


They will still be in only one datacenter. This is worse than its competitors.


That doesn't seem to be true from a durability perspective:

https://www.fastmail.help/hc/en-us/articles/1500000278242

Read the section on "slots". They keep two copies in New Jersey, and one in Seattle.

However, based on the post mortem, it sounds like they're not willing to invoke failover at the drop of a hat. They allude to needing complex routing to keep their old good-reputation IP addresses alive. That might have something to do with it. (They were "only" 3-5% down during the outage, which is bad for them, but not unusual by industry standards.)


You might have multiple datacenters, but until everyone isn’t in the same office, you still have the same problem (office can burn down, or fall down and bury everyone).

Also, it’s email. It was literally designed to work in such a way that you can be down for days and still get your email.


> You might have multiple datacenters, but until everyone isn’t in the same office, you still have the same problem (office can burn down, or fall down and bury everyone).

Fastmail has multiple offices.

> Also, it’s email. It was literally designed to work in such a way that you can be down for days and still get your email.

Delivery is designed that way, not storage. Once Fastmail tells the sender that a message was delivered to an inbox, Fastmail cannot ask the sender to redeliver it if its inbox storage is lost.


NYI. Good people.


Am I the only one who thinks that this post-mortem for a margin of email users worldwide is a pure marketing gimmick to attract more "technical" users?


I don't think the Lastpass approach is how anyone gets any new customers, to say nothing of the technical ones


Netgear switches? In an environment like this? I’ll give them the benefit of the doubt that that is maybe a provider-owned thing, and that they have an 'enterprise' line, but, really... Netgear. The firewall brand isn’t revealed in the network diagram, but what is it, a $100 sonicwall? Should I be concerned keeping all my email, business and personal, there about what other parts of their infrastructure they are cheaping out on?

When you are running a service like this, redundancy among transit providers is the most basic, table-stakes thing you can do. It's almost negligent to not have that.


Netgear do some half decent fully managed switches. It’s not all blue crap off Amazon.

The worst switches I ever used were HPE ones in the old C5000 blade chassis. Absolute turds. Packet loss, constant port failures and complete hangs. HPE’s solution was to tell us to buy new ones.


The worst switches I've ever used would probably be various 'Cisco' switches from their small business line, usually ones that ran the same OS used when they were sold under different names like 3Com or Linksys.


oh god, when I worked for a small telecom in the midwest they heavily used the 3Com switches. They were the bane of my existance, things would power loop, or my favorite, continue to work but prevent any sort of access to them.


> It’s not all blue crap off Amazon.

To be honest, those little blue unmanaged Netgear switches aren’t bad at all. We have dozens of them in our lab at work running 24/7 for like decades and have never had a failure that I remember.


They aren’t terrible as long as you have a supply of wall warts available. I ended up powering mine off a little Meanwell switching supply in the end as they don’t blow up as often.


I'd be far worried about single homed internet connection than brand of the switch.


If netgear has you worried, search for "white box switches". Most big iron runs off those these days.


I am aware, I deploy white box gear, what concerns me is the software, some is better, some is worse, less so than the merchant silicon the system is based on.


To be fair, broadcasting your hardware via topology isn't what I consider a safe practice.

As to the netgear switches, I would figure they'd have hot spares considering the cost savings. I'm not entirely sold on a specific vendor for the end-all-be-all for switching needs, support for most is less than stellar, and the need for hot spares grows every quarter report from the big name vendors as they continue to push for larger margins.

In environments like this, it's less about the vendor specific product, and more about the redundancy setup (which appears they lacked but are transparent regarding it).

Just my .02 cents

edit: didn't realize this was such a "hot take"


> To be fair, broadcasting your hardware via topology isn't what I consider a safe practice.

Ah yes, security through obscurity.


You're not wrong but considering all of the recent 0-day exploits, I would argue that it's a better practice than the wack-a-mole response from vendors like Fortinet & Barracuda.

https://www.bleepingcomputer.com/news/security/fortinet-new-...

When the vendors you're buying from aren't taking security seriously, I suppose you take any necessary step in limiting exposure. I'd also argue that outside of the big boys like Cloudflare, no one else is displaying their topology via their own website.


That's fair, but let's still call it what it is. We shouldn't normalize hiding information as some sort of form of security.


Hard to agree.

In an ideal world everyone would share their architecture, stack and so on and we as an industry we could learn between each other and everyone would have a net gain out of this information sharing.

In reality at the time that you share something in good faith you will always have someone trying to exploit it.

One example: I’ve worked in a CV production API to recognise certain documents. More than 900 days with no spikes and only real users in the system.

Then the CTO went to a conference to talk about how our performance was great and made a very large advertisement about our system. End result? 1800% spike, and tons of frauds and adversarial stuff coming.

Not being cynical, but I do not think that we’re entitled to have any disclosure from any private company in that regard.


Fair enough.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: