Hacker News new | past | comments | ask | show | jobs | submit login
Gmail having issues (google.com)
644 points by mangoman 30 days ago | hide | past | favorite | 432 comments



Just got this from the ProtonMail team:

> Dear ProtonMail user,

Starting at around 4:30PM New York (10:30PM Zurich), Gmail suffered a global outage.

A catastrophic failure at Gmail is causing emails sent to Gmail to permanently fail and bounce back. The error message from Gmail is the following:

550-5.1.1 The email account that you tried to reach does not exist.

This is a global issue, and it impacts all email providers trying to send email to Gmail, not just ProtonMail.

Because Gmail is sending a permanent failure, our mail servers will not automatically retry sending these messages (this is standard practice at all email services for handling permanent failures).

We are closely monitoring the situation. At this time, little can be done until Google fixes the problem. We recommend attempting to resend the messages to Gmail users when Google has fixed the problem. You can find the latest status from Google's status page:

https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

Best Regards, The ProtonMail Team


This is the Nightmare Scenario for mailing lists.

Many of them auto-unsubscribe after a bounce.


I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before. The only legit scenario in which that happens is when the user deletes the address, which is a rare event (pretty much always <= 1 time in the lifetime of any address), and there shouldn't be anything wrong with treating that kind of situation the same as any soft error. If you're wrong, your mail will just get rejected a few more times anyway, and you'll know it's genuinely a dead end.

The underlying issue (wherever this occurs) seems to be lack of nuance regarding error codes when people try to implement robust systems. Different codes imply different things and shouldn't all just fall back into generic buckets.


> I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before.

Like HTTP, SMTP is also designed to be stateless so, in the first place, the remote server shouldn't return a permanent error in temporary failure scenarios.

The default error should be 450: "Requested action not taken – The user’s mailbox is unavailable”, not "the user has deleted everything and left".

These standards worked well before big players came and told "My responses tell what I chose them to say, and these meaning doesn't always overlap with the established standards". The only exception is spam and we now have standards for helping to reduce it.


Your answer kind of misses the point GP was trying to make.

Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record. In this case the returned "user doesn't exist" error is intended behavior of the mail server and the post you replied to still stands. If you sent to that email successfully earlier, it's much more likely that the server is responding erroneously than that the email actually got deleted.


> Your answer kind of misses the point GP was trying to make.

Actually, I don't think so.

> Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record.

As a system administrator and/or provider you have to think about worst case scenarios and provide sensible defaults. Your mail gateway should have some heartbeat checks to subsystems it depend on (AuthZ, AuthN, Storage, etc.) and it should switch to fail-safe mode if something happens. Auth is unreliable? Switch to soft-fail on everyone regardless of e-mail validity. Can hard fail others later, when Auth is sane.

Storage is unreliable? Queue until buffer fills, then switch to error 421 (The service is unavailable due to a connection problem: it may refer to an exceeded limit of simultaneous connections, or a more general temporary problem) or return a similar error.

SMTP allows a lot of transient error communication. Postfix, etc. has a lot of hooks to handle this stuff. Just do it. Being Google doesn't allow you to manage your services irresponsibly. If we can think it, they should be able to do it too.


Technically speaking it's possible to soft bounce upon 5xx errors, but in practice, retrying even when the destination tells you not to is the quickest way to get reputation ruined.

Google SMTP servers should have returned a soft bounce here (not hard bounce), so then retry can work.


But then why would Google's mailserver not know that it once delivered email to that mailbox?

If the protocol is stateful, why the state should be kept by the "sender" and not by the "receiver"? Being stateless removes this ambiguity in my opinion.

Also we should remember how bad is for spam reputation sending emails to a non-existent address and thus I would not blame it on the mailing list for being "overly cautious".


The situation here is that the service was so borked that it didn't know what it didn't know.

Hard-failing good addresses is a much worse bad than soft-failing bad addresses. In the latter case, remote sender tries again later and eventually gets a hard bounce. In the former, good addresses are permanently dropped from numerous services, and sent mail is lost rather than retried.

Critical failures should soft bounce until positively determined otherwise.


Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place. This issue is Google sending the wrong error code because of a problem on their end.

Mailing lists believing what an email provider tells them and acting in an overly cautious way is a separate issue.


> Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place.

This can't work; you can say that gmail's system should have a component that recognizes the difference between various failures, but that new component can itself fail. You can't solve the problem of "what if something fails" by saying "just add a new component that won't fail".


Of course it can. Software is complex and that complexity can cause all kinds of problems, as can the fact that the networks linking computers are unreliable, but software is fundamentally deterministic. If you write a piece of code that returns a temporary failure when it can't look up whether a user exists, that code will not mysteriously change itself to start returning permanent user does not exist errors. (Now, if your overall stack is designed in such a way that you can't reliably tell the difference between lookup failures and users that don't exist, you have a problem - but the problem is with the design of the system, not some inherent problem with software.)

Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.


> but software is fundamentally deterministic.

That's true, but human behavior is also fundamentally deterministic, and those two observations are about equally useful.

> Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.

No it isn't. Those are deterministic too.


> that code will not mysteriously change itself to start returning permanent user does not exist errors

That is true in a perfect world. In the current world, there are all sorts of ways that code implemented one day does not run the same the next day. Say the code is in an interpreted language and an unrelated sysop updates the language runtime in a way that changes the behavior. Again, in a perfect world that doesn't happen, but that is not always the world we live in. I have great sympathy with people who treat software systems AS IF they were "physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways".


> doesn't fail completely but cannot access part of the data

If the a mail server can't tell whether a user/email is valid, it should either return a temporary failure or accept and queue.

Unless of course you're too big to fail, then you just do whatever you want.


I think we’re just teasing at the notion that “permanent failure” isn’t a hard and fast distinction. I think some polite retry policy is not unreasonable even for the most explicit “permanent failure” response from a remote server. Imagine the most extreme example: hackers take over the remote server and make it respond with “permanent failure.” After a day, the legit owners regain control of the system. You can’t really argue that “the remote server never should have delivered that response unless the failure truly was permanent,” because clearly there was a mismatch between the apparent intent behind the response and the actual intent.


The issue is that hard bounces can cause big issues with your email sending reputation, and too many can make you lose access to mailing services such as Amazon SES, so you're encouraged at all points during the implementation of anything that sends email to blacklist any bounced emails. This of course works fine, right up until Gmail starts bouncing all emails.


I think it’s spot on. Gmail’s failure mode in this scenario isn’t correct. The rest of the internet is functioning as designed.


This is exactly it. The RFC has error codes for temporary failures (just like HTTP 503 for example). Failing to implement the RFC, the jokes on you.


If Google and other major mail providers weren't opaque about this, then fine, but for me a single bounce is an immediate removal. I can't take the risk. I can't imagine the hell that would ensue trying to get through to Google to ask them to take me off their deliverability shitlist.


Has anybody ever received a reply from gmail's postmaster address?

I have good experience with them fixing issues related to their spam-related flagging for messages that are coming from our self-hosted email server, but never got any specific reply.


I 100% assure you that everyone handling gmail errors and getting burned isn’t just tossing failures into a single bucket. There’s a zillion reasons mail can bounce and all of them are taken into account. This is a particular bounce code that signifies that an ESP shouldn’t send email again to this address.

Email service providers are HIGHLY incentivized to act 100% in accordance with the wishes of the system where the mailbox exists because it’s highly likely that acting in any way that’s considered abusive could get your emails landing in a spam folder.

Mail boxes cease to exist thousands of times a day at places I’ve worked previously. Employees leave all the time and people shutdown mailboxes, this is Google’s fuckup, nobody else’s.


There is actually a very good reason to drop these email addresses, and the reason is that a high-rate of non-deliverable emails hurt your sender score. It's a total pain to get emails delivered to the major email providers in the first place, and you immediately land in spam (or with emails not delivered at all) if they don't trust the sending email server or your score is anything but stellar!


I have 2 responses to the sender reputation concern:

1. If the user's mail service penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed, that itself is absolutely inexcusable nonsensical behavior that needs to be fixed. You shouldn't do that, just as you shouldn't shoot the mailman (or even arm yourself...) merely because he knocked a second time.

2. Notwithstanding the previous point, I don't buy this as valid justification anyway. The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two. The bounce rate increase due to such an event is very negligible here—people don't suddenly delete their accounts en masse. When that happens, it's clearly due to an outage, not because half the users at that domain suddenly decided to delete their accounts. (Which is something you can also easily detect across the domain as another useful signal to drastically lower the bounce rate across the entire domain, btw, if you're absolutely paranoid about your immaculate delivery rate dropping by an epsilon. But it shouldn't be necessary given how negligible the impact should be.)

So I don't buy this excuse one bit.


> The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two.

What you're proposing is to explicitly ignore the specification (which says that you should _not_ retry when you receive a 550) and try to implement a custom smart retry logic that handles temporary error cases, but also does not get you blocked.

> So I don't buy this excuse one bit.

I'm all for building resilient services, but "try to detect when a server incorrectly returns 550" is not something I would prioritize at all. I'll happily manually clean up after this occurrence than to have this complicated retry logic. It's not an "excuse", it's a very sensible trade-off.


No, I am quite explicitly not ignoring the spec. It quite deliberately says should not, not must not. If anyone is ignoring the spec here, it's you, not me. Should not is sound advice; it's telling you what you're supposed to do when you don't have a reason to behave differently. You know, like how you "should not" leave the lights on when you leave your room. Or—more pertinently here—how you "should not" assume everyone is a liar. But when you actively see evidence that deviates from the norm, you are given the power—and arguably the responsibility—to exercise your discretion here to adapt to the situation. If the spec wanted blind obedience, it would say "must not" like it did in 60 other places, but it quite obviously and intentionally decided that would be unwise, and this scenario seems like a pretty clear illustration of that.


But the RFC isn't only for senders it's also for receivers, isn't it?

That means there are two sides to the interpretation of what SHOULD NOT means. And in this case, senders have, due to experience, interpreted what Google does when someone SHOULD NOTs:

- The sender SHOULD NOT send us the same sequence again when we reply 550, if they do they MUST go on our shitlist.

Obviously it's not so binary and it takes retrying to several different recipients, but people have very good reason to interpret this SHOULD NOT as MUST NOT.


No, that's not a sane way to interpret this RFC for the receiver either. I already answered this, so you'll have to go back to my earlier comment (this might be my last comment as I won't keep repeating myself): any system (be it Google's or anyone else's) that penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed is just plain trash. A sender that attempts delivery to an address that accepted their email a day ago is obviously unlikely to be a spammer; there's no justification for treating them as one. It is absolutely unreasonable to interpret the sentence this way. Just as it's unreasonable to interpret "the mailman shouldn't knock a second time when he's told the recipient has moved" as "I should never open the door for the mailman ever again if he does so".


Good callout. The underlying issue of the lack of nuance is probably /state/. Being more nuanced about these errors probably requires managing state, which tends to increase the complexity and scaling challenges.


Nuance is not called for. The standard states that a 5xx SMTP error is a permanent error and "The SMTP client SHOULD NOT repeat the exact request"

Gmail screwed up here, returning a 550 error, it's not anyone else's job to try to second guess that or retry in contradiction of the accepted standard.

https://tools.ietf.org/html/rfc5321


Gmail screwed up, but that's beside the point. We're talking about designing robust systems. You don't design a robust system by assuming nobody will screw up!

Re: the RFC, note it says "should not", not "must not". That seems to suggest they acknowledge repeating might actually make sense in some cases. And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.


Try delivering to invalid email addresses too many times (too many of course being up to each mail provider), and you will be the one shitlisted (and rightfully so, as you are likely bruteforce enumerating valid email addresses).

For any small provider, getting on the shitlist is catastrophic as unlike the big providers, getting off of it will be hard / impossible.


Rules for thee, not for me


> And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.

That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

If you believe the standard is not robust enough to handle problems like this, first work towards a fix to the standard and then implement the solution. Not the other way round.


> That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

I didn't suggest people should apply this thought process in arbitrary cases. I said it should be applied in this case. You can take any thought process that gives a good outcome in one situation and obtain a bad outcome by applying it to the wrong situation. That's not an indictment of the thought process. It's just an indictment of the person failing to correctly judge its applicability.

That said, by all means, do try and go fix the standard; I wasn't trying to imply you shouldn't do that.


Ah I think I did not describe the repercussions of making exceptions (even if they are in highly specialized cases like this). If you allow yourself to make such exceptions, you diminish the motivation for you (or someone else) to fix the problem at the right place. Most workarounds tend to live forever.


There's no clear-cut rule here. Some workarounds stay workarounds and never get standardized. Some become so well-accepted and adopted that people then put them into standards. It's great to put things into standards, so by all means, do try to improve standards. But that shouldn't block you from everything. At the end of the day, standardization is just a means to an end, and the end is what matters here. Nobody cares if their mailman's knocks follows an RFC or not. They just want their mailman to deliver packages with reasonably minimal disruption.


> There's no clear-cut rule here

Exactly, that is why it is important to follow standards. Most engineering decisions are not clear-cut and are born out of tradeoffs. That is why we agree on standards that define those tradeoffs instead of every one of us having our own take on situations.

> Nobody cares if their mailman's knocks follows an RFC or not

If there is a Mailman RFC which says: "If someone opens the door and says `Mike does not live here' then DO NOT attempt delivering the same package"

THEN I expect the mailman to not bother me again, EVEN IF it was actually my mistake that I forgot my roommate Mike actually does live at this address.


I'm tired of arguing about this. Engineers agree on standards for a good reason, yes, but they also agree on "should not" rather than "must not" for a good reason too. I'll leave this as my last comment, but you might want to read the post-mortem. Turns out their implementation of the RFC wasn't even buggy. They just messed up the domain name in the configuration. Which you can only be resilient to by retrying the request sometime later.


But here’s the thing: the standard (like all standards) is obviously not robust enough to physically prevent responses which incorrectly indicate permanent failure.

These incorrect responses could be caused by mistakes which the remote server admins could reasonably avoid, like software bugs. I understand not having much sympathy for that case, especially from an organization with no shortage of resources. But they could also be caused by, for example, hackers or governments exerting control over the remote server temporarily.

A standard which explicitly refuses to acknowledge these possibilities is not what I would describe as “robust.” An obvious better alternative would be to set some standards around what constitutes a polite retry policy.


My understanding is that should not means that you should not try to retry. If I do retry than the other party can rightfully claim that I am DDOSing their service, trying to send emails to deleted accounts or put me on a spam list. I do not think that ignoring the RFC and trying to cover up for Google is the best course of action here. Maybe, just maybe, this is the right time when people realise what does it really mean to have an entity like Google. Because as it is stands, we are going to have the DNS infrastructure moved over to them with DoH and a similar outage is going to be even more devastating. The internet was designed to be resilient to failure because of its distributed nature and right now it just shows why concentrating resources in one place is bad.


You "should not" repeat delivery in basically the same way the mailman "should not" knock a second time if he's told the recipient doesn't reside at the designated address. What "should not" means in these cases is: "knock only once, and assume you're being told the truth in the absence of further evidence to the contrary". But when you clearly saw the recipient reside there yesterday, it makes sense to try to knock and catch him again tomorrow. Because, you know, maybe something went wrong, e.g. maybe the person who opened the door didn't recognize the name (or whatever). At the end of the day, the mailman's job is to deliver the mail with minimal disruption, not to play hot potato with envelopes.


The terminology is well defined [0], so in this case, retrying is not ignoring the RFC.

It's a difficult one though, because as you rightfully state, covering up for Google is not the best course of action for the system as a whole, yet it's likely a good course of action for those users who didn't get their emails.

[0]: 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

[1]: https://tools.ietf.org/html/rfc2119


In most internet engineering task force RFCs the standard verbiage for "must not" usually is in fact "should not".


The phrase "must not" appears some 60 times in this RFC.


Thanks for pointing that out. I suppose a RFC writing style guide would be helpful to have consistency in language and interpretation.


The standard says “don’t resend,” it doesn’t say “assume the worst and begin removing user from all systems.” That was the mailing list software’s decision.


You generally avoid sending to known bad addresses or your reputation will be destroyed very quickly. The 550 response is (read: was) a clear "you fucked up, this user doesn't exist" prior to this.

I saw someone on Reddit say his SES was suspended for sending tons of bounced emails in a short period of time - it's taken very seriously by ESPs.

E: also user rtx a few comments below


We're not talking about repeating the exact request; a subsequent request for the same recipient would be to deliver a completely different message: whatever subsequent message is sent to the mailing list.


Right. In this case it's already pretty typical for mailing lists to track bounces and retry under some errors, so I imagined that part is mostly done, and the missing piece would be taking more care in checking the error conditions.


Aside - I'm not an expert but systems like MailChimp will get very worked up if your list has lots of undeliverable addresses on it. This can trigger an audit of your list which prevents sending, etc. These audits seem to take quite a while, in my very limited experience.


So what you're saying is, if you're annoyed by "subscribe to our mailing list" modal popups, "doesno5exist@garbage.blah" is better than "jeff@amazon.com" ?


In practice, no, it's more nuanced than that. Any mailing list operated through any remotely legitimate ESP will require subscriptions to be confirmed/acknowledged up front before any delivery is attempted to a recipient. If the confirmation step fails, i.e. the "check your email and click a link to verify you really signed up" email bounces, or nobody ever clicks the link, the list owner isn't generally going to be penalized for that.

If you want revenge for modal popups, your best bet is to create a bunch of throwaway email accounts, subscribe to the mailing list from them, and start reporting the individual messages as spam when they arrive. Flag them as junk at the mailbox provider (Gmail, Outlook, etc.) and use the links in the List-Unsubscribe headers to flag them at the ESP's end, too.


If you're trying to get the web site's mail server blacklisted, definitely.


Aka throw the RFC out of the window and implement a broken system because Google did that?


> I said this in another comment but this seems like a naive way

That's the standards-compliant way. Also I'd argue that spec'ing your code to handle cases where Google fails that badly is (was?) a poor allocation of LoCs.


You're entirely missing the point by blaming this on Google. This is meant to detect and handle some failure modes, and they could happen to anyone (including Google), for reasons that can be both inside and outside their control.


I had this issue with GitLab. My email provider returned a permanent error one day (due to an issue on their end), so GitLab silently stopped sending any emails to my address. I checked my email in the preferences many times and had no idea it was blocked on GitLab's end. Eventually, after not getting any notifications, I contacted their customer service and was told of this hidden setting.

So if you are not getting any notifications from GitLab, even though your email is correct, I suggest contacting them and asking if you have been blocked due to an error.


My account with Amazon went in to review because of this. I hope their team is aware about it.


I posted this as a problem in my problem validation platform[1] and a user has built a quick solution by displaying a token if the email service received an email from the sender.

[1]'Check email service status before sending emails' - https://needgap.com/problems/178-check-email-service-status-...


Great point. And potentially email delivery services that have auto-suppression lists to protect reputation, at least they might be able to remove entries on behalf of their customers.


Good. I was hoping this was the case. Unfortunately I already moved to fastmail so there will be little benefit to me.


Oh no.


Interesting response. And spot on from the technical integrity side. It’s also more fair to email providers as a whole to treat them all the same and respect their error messages. I mean, maybe there’s even requirements in some jurisdictions to deal with the address not found error in a specific way. As an email sender I think I’d prefer the message get auto re-sent after Gmail comes back online though.


> Because Gmail is sending a permanent failure, our mail servers will not automatically retry sending these messages (this is standard practice at all email services for handling permanent failures).

I fear that this will lead to many lost mails. In my experience, users are often confused by the technical "Mail delivery failed" mails and tend to ignore them or write them off as spam.


>P.S. You might also consider asking your contacts who are still using Gmail to switch to ProtonMail for more private communications


Confirmed likewise.


This feels like a cheap shot at Google. Shit happens, and they're not immune to it even if the servers are located in Zurich. Running a datacenter is no easy task.


I see it as Protonmail explaining to their users that the failure is not on their end and why they can't do much about a remedy. Seems purely factual. A cheap shot would be generalizing from the event, but I don't see them doing that.


I think I got this completely wrong. What you and other responses are saying makes sense.


Being down is okay. Returning an error message that results in the data being thrown away instead of being requeued is not. Block incoming smtp connections until your app layer is fixed.


> Block incoming smtp connections until you app layer is fixed.

Or returning one of the 4xx status codes which indicate less-permanent failure state like:

- 451 Requested action aborted: local error in processing

Which is kinda like a HTTP internal server error as it can mean anything.


For my comment’s purposes, I assume if this was possible with a flag or config setting (and the code path existed), it would’ve already been done. Doesn’t seem like they can, so they should’ve pulled the handbrake and gone “full stop” without throwing everyone’s mail away (hence blocking incoming connections and let the mail sit in all of the external MTA queues).

Another option would’ve been to accept everything with a very lightweight smtp ingest service, journal it all, and play it back to the full frontend after their code fix was pushed out.

Not an SRE so ¯\_(ツ)_/¯ just some thoughts from my time in a similar role and similar pain points (but thankfully not at this scale)


Yeah, this is a particularly pernicious failure given how email works. Many mailing providers will just mark these as blacklisted, now, and lots of unsophisticated users likely won't notice.


I consider myself sophisticated enough, but my Bitwarden has 700 accounts, of which ~30% old ones are registered with a gmail address, and the rest are handled behind g suite. Granted that last bit might be partly my fault, even though I paid for it. But even for a "sophisticated" user, I have no easy way of knowing if any of these accounts have silently failed to function now, other than by the passage of time and eventually finding out.


Oh, absolutely, even for sophisticated users mitigating may be difficult or impossible depending on exactly what bounced and how. But you at least are aware that this happened, and that you have a problem. Think how many people are out there with no clue what this error meant, or that it signaled an ecosystem problem, or that just had hundreds or thousands of emails silently bounce and unsubscribe.


A lot of people and companies use Gmail. Email providers are definitely getting support requests from users that don't know what's going on.

This is not a cheap shot, but a message to inform users that it's an issue with Google that Protonmail can do nothing about.


More like "if mail to gmail fails it's not us so pleas don't flood the support with comaplains".

> Running a datacenter is no easy task.

Sure, but then there are very view companies which have more experience with running data-centers and (normally) providing reliable email service.

So any outage for more then just a short time is very unusual. I'm really interested what went wrong.


CEO of an email marketing platform here (EmailOctopus). If anyone's curious, here's a chart showing our bounce rate to Gmail addresses over the course of the week:

https://pbs.twimg.com/media/EpUE20UXYAEa_Uv?format=jpg&name=...

That's a peak of 90% of Gmail inboxes bouncing – and this has been going on for almost 24 hours.


I know this is your livelihood, but as someone who basically never wants marketing emails, all I can think is "nice" hopefully I get auto-unsub'ed from a ton of lists.


If they normally successfully deliver to gmail, it's safe to assume a large number of people who do receive their emails want to receive them.


This is very charitable. How many people live with the nuisance of mailing (they un- or knowingly subscribed to) VS those who actually go through the trouble of unsubscribing/mark as spam in hope to rid of the from inbox?


I normally just delete mailing list mails. I don't even read them.

This year i decided to do "something" about it, so every mailing list mail received in my inbox that i don't want/care for gets an unsubscribe. It has already reduced my daily mails by a somewhat large amount. It's hard to say exactly how much, but i estimate around 10 emails less every day.

Most of the unsubscribed lists are from companies where i've purchased something andthe seller took the liberty of subscribing me to their mailing list. Those are mostly pre-GDPR that i've just never gotten around to dealing with.

The execption is of course obvious spam mails, to which unbsubscribing will probably do more harm than good.


That conclusion makes zero sense to me unless counting on the nebulous nature of the descriptor, “a large number”. They deliver successfully to my Gmail account on a regular basis so I must want to receive it? Feels like you’re telling me to stop dressing like a slut. ;)


Totally agree, especially as I signed up for exactly zero of them.

Rant: As I side note I usually try and buy direct when shopping online rather than through Amazon (for all but the most trivial purchases) and this is the 2nd largest drawback (behind filling in CC and shipping info) - because I bought one item from you, once in my life does not mean send me a daily email, and then when unsubscribing pretend like I signed up for them! For me it’s one of the easiest ways to destroy brand loyalty/reputation.


This would affect all email types including emails like receipts, shipment confirmations, password resets, account verification.

Plenty of critical communications get caught in this storm...


How do the public gmail addresses compare to the enterprise (used to be G Suite, now Google Workspace) ones?


I would be very interested to know this as well. I am trying to switch my company over to Google Workspace right now and support has been telling me my signup issues will be "resolved in 48 hours or less."

What a joke. And this after we're leaving AWS Workmail because of bounced emails.

No luck with signing up so far.


Heavily recommend you don't switch your company over to Google. Microsoft seems to understand that in the enterprise world you actually have to have support personnel, not just an opaque AI without chance for appeal


Google has decent support for paying customers.


You can actually appeal things when you start paying.


Consider yourself lucky. I have some ad words in "approval" porocess for 6 months now. I kid you not - every Friday I receive email stating that the update will be send to me on Monday (insert date here). Then nothing happens on Monday until Friday comes and I get exactly same copy, only date is different. At this point I literally laugh.

About your query

I gather that you are concerned about your Ads Disapproval for your Google Ads Account.

Observation

I understand that this is taking a bit longer as we are working with a limited staff due to Global pandemic and there is another team who reviews the account so there can be a slight delay in the decision I apologize for the inconvenience caused as I understand this is not the answer which you are looking for but be rest assured I will get back to you on coming Friday 12/18/2020 end of business day.

For any further assistance, I am just an email away.

Sincerely,


SLA of less than 99,5%... Or if there is multiple issues even sub 99%... That really is a joke...


Anecdotally, my enterprise account seems unaffected.


Also anecdotally, during the outage, test messages from my non-gmail account to my standalone/non-enterprise gmail accounts consistently bounced; test messages from my non-gmail account to my G Suite Business-associated account went through.


Serious question: how would you know that you are receiving ALL emails from ALL senders?


Totally valid, and I wouldn't. The status page indicates that "Google Workspaces" is affected, but I don't know if that is synonymous with what I have (which was Google Apps a decade ago, unsure now). All I can say is I was receiving emails during the affected window.


As an ESP, how much of a headache will this be for you in weeks/months to come? I'm guessing this throws a huge wrench in deliverability techniques--how're you handling it?


It's a real headache but should be fully reversible. @shmoogy hit the nail on the head: we'll run through our events in that timeframe, inspect the raw bounce reason to check it relates to the Gmail outage, then undo the actions that the bounce caused.

The reason why this is so nasty is not because Gmail went down, but because they returned a 5XX permanent failure and not a 4XX temporary failure for these bounces. Literally every email provider will respond to a permanent bounce by suppressing all further emails to that email address (it's permanent, after all!), so the fallout from this will be huge.


I would imagine since it's a known timeframe, domain, and error response, they can cleanly remove the suppression lists.

I logged into our sendgrid and mailgun accounts and manually purged all the failed gmail records.


Might also be affecting GSuite/Workspace emails.


The hard bounce status might be stored outside of your lists. I am not sure customers can easily change a hard bounce status themselves. Do you mean you just deleted those records with intent to re-add to reset the status? On our BigMailer platform this wouldn't work as hard bounce status would get preserved.


We use SendGrid and Mailgun right now, and both of these expose the suppression list, email address, time, and reason code + description. In Sendgrid you can filter, and mass select to remove suppressions easily (which was great). In mailgun I had to export a CSV and just removed them manually as there was not too many across my accounts.

Customers generally cannot change this on their end as far as I can imagine -- this is on the ESP end and is a protection built in because you are sending from their IP / Server and they don't take kindly to that.


+1 what Jonathan said. Typically, when email service providers are down the response code indicates a temporary issue with a soft bounce code, so you can still try to send to that address in the future.

The action for rectifying isn't too difficult, but the implications are still pretty big...


Mailgun added a few new suppressions due to bounced Gmail addresses. Hope ESPs just flush those out.


Thanks for sharing Jonathan, unprecedented situation. And that's just gmail.com addresses we can see data on, while there are all those business domains that use Google Apps for their email that probably experienced a similar issue...


What's this do to your mail-queue size - let's see that chart


Permanent failures, as these are being flagged, don't stay in the queue.


"Type: Permanent; SubType: General; Code: smtp; 550-5.1.1 The email account that you tried to reach does not exist. Please try 550-5.1.1 double-checking the recipient's email address for typos or 550-5.1.1 unnecessary spaces. Learn more at 550 5.1.1 https://support.google.com/mail/?p=NoSuchUser y128si147264pfg.177 - gsmtp"


This is pretty much the worst response possible. Hard bounces mean that email delivery services are going to start automatically removing, or at least stopping delivery to, entire slews of email addresses.

A lot of clean up is going to be needed as a result of this.

To add some more details, when using a 3rd party email delivery service, those services will either black-list or just outright remove email addresses when they get a hard bounce "email address no longer exists" message back.

Some providers make re-adding an address after a hard bounce a non-trivial task, since after all, the authority on that email address just said it doesn't exist.

This is going to be really ugly.


I really cannot believe they did not immediately hack in a new rule to their SMTP server: never return a 5xx (permanent failure), instead return a 421 (temporary failure try again later).

That simple fix buys them 24-72 hours to solve this properly.

Yeah, it burdens servers sending mail to them because now they have to hold on to all mail (including mail that really is permanently undeliverable) for another day or so, but that's still better than what's happening right now.


Why would that be better than just shutting off the delivery stack altogether?


5xx error results in suppression list addition of an email, so future emails won't be delivered (by most ESPs), and not returning MX response would probably be just as bad, or worse (or result in millions/billions of emails being re-queued due to timeouts?)

His solution would result in exponential retry failures baked into most services, which would buy them a few hours, and result in no lost emails, and no suppression list additions.


Failure of response from the server is nearly always treated as temp failure, because it could be down to network connectivity, name resolution, etc.

That is a better scenario, than 5xx.


Inability to contact the destination would be treated as a temp-failure by the origin, and taking the service off the air could be effected instantly.


In case less than 100% of gmail is experiencing this bug.


This outage seems to have lasted for about 2.5 hours. Probably this was fixed by rolling back whatever caused it. (I don't think the rollout was finished before they resolved it; my mail server sends a lot of emails to Gmail addresses, and even at peak I was only seeing maybe about 1/3 mails be rejected.)

There is no way that putting in a hardcored hack like that would have been faster. Making the change is, of course, fast.

But then you need to review it (and this is a super risky change, so the review can't be rubber stamped). Build a production build and run all your qualification tests. (Hope you found all the tests that depend on permanent errors being signalled properly). And then roll it out globally, which again is a risky operation, but with the additional problem that rolling restarts simply can't be done faster than a certain speed since you can only restart so many processes at once while still continuing to serve traffic.

The kind of thing you describe simply can't be done by changing the SMTP server, in 2.5 hours. The best you could get is if there was some kind of abuse or security related articulation point in the system, with fast pushes as required by the problem domain but still with the sufficient power to either prevent the requests from reaching the SMTP server at all, or intercept and change the response.

As a trivial example, something like blocking the SMTP port with a firewall rule could have been viable. Though it has the cost of degrading performance for everyone rather than just the affected requests.


This has been going on for 2 days, not 2 hours.


The linked status page shows a 2.5 hour duration.

My mail server logs show about 20 failures in all of the last week until yesterday 20:43 CET, then 350 failures between 20:43-00:21, then nothing after that. So fair enough, from the client side rather than the status page it looks like 3.5 hours rather than 2.5.

But still, given that resolution time, the suggested solution of changing the SMTP server is absolutely ludicrous.


Yes. I email hundreds of thousands of Gmail users each week (yes, double opt in, they all want the mails!) and we immediately delete any user for whom any Gmail error comes up at all in order to keep a solid delivery record with them. Sounds like we might have deleted 80% of our list if we'd sent today..!


My sanity tests started acting flaky ~3 hours ago, I never thought it was Gmail...

Kind of happy I had to do something else and I didn't burn hours investigating.


So new think to do: Quarantine addresses instead of deleting them and if for one provider most addresses fail don't give them another (maybe manually triggered) try later one.

(And if no such thing is detected deleted quarantined mail addresses.)


My guess is that how most email service providers handle this - they don't actually delete the email and just have a flag on it - bounced, complain, unsub. This way the list owner can run an export and see all the status code.


Hope you have a backup just in case.


Yes, we're unusual in not relying on third parties for list management. We can rollback. Or I might just comment out the 'unsub on hard bounce' code for the rest of the week..! :)


Unsub on two consecutive bounces seems more reasonable to catch flukes (or Gmail going down)?


Yes, most likely! That is a common approach for 'soft bounces' in most list management systems (e.g. MailChimp).

The problem here is Gmail has been throwing out "NoSuchUser" errors which are an instant unsub in most systems because Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I'm extremely paranoid about email hygiene, tiny bounce rates and high delivery rates, so we aggressively unsubscribe troublesome addresses (often to the point of getting reader complaints about it) for many reasons beyond that, however.


> Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I think you mean "reputation purposes"?

If so, wow, that sucks. Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.


> Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.

Good for karma, bad for everyone though.


I think you mean "reputation purposes"?

That better describes what I was trying to say, yes. Reputation then affecting deliverability.

Over 80% of our subscribers use Gmail so to say I'm paranoid about maintaining a good record with them is an understatement ;-) Gmail is a huge weak link for us.


Ah, thanks for the explanation.


Logically you'd expect unsubscribe to only act after lots of bounces of this format when the address has been receiving mail fine before. It also seems reasonable not to trust such bounces for the entire domain for a while when this happens to lots of other addresses that have worked fine before. Not that I expect software currently works this way, but it does seem like a common sense thing to code in.


I mean, it's possible, but you'd need to queue up a day's worth of bounces, do the analysis, and then handle the bounces asynchronously later on to do that.

Most systems operate more immediately in isolation on individual addresses than that right now, because such analysis is generally not needed (until today, of course ;-)).


Mail agents already queue emails that bounce though; it's a matter of changing the conditions for when you retry and/or unsubscribe. I imagine you can do the analysis in real time too... just look at the bounce and see if it pertains to an email you sent to in the past, and if so, increment some rolling counter for that domain.


Their SMTP server being unreachable is a 4xx temporary error. The sender MUST keep trying for at least 24 hours, and 72 hours is recommended.

"Gmail going down" would not have caused this problem. Even if all their SMTP servers went offline.


Yeah, they would have been better off pulling the (metaphorical) plug—maybe block incoming traffic to port 25 or something—until they had this fixed.


Mailgun send a warning mail about increased bounces from our account. Sure, they know what's going on... but we send 4-5 digit mails per hour - it's a lot of bounces

That means I can't just resend the the emails blindly, because I'm too scared to trigger some sort of automatic suspension...

(I don't do this regularly, so I'm not familiar with all features... additional mail verification could help probably ....)


They should be returning 421 for backend outages so that sending servers queue and retry the emails. 550 can be interpreted by some as deleted [1] or even banned accounts in some cases. Maybe someone here could convince them to change the logic that occurs during an outage.

[1] - https://en.wikipedia.org/wiki/List_of_SMTP_server_return_cod...


Yah. Maybe there's an unexpected way that things can fail resulting in 550's. But maybe at Google's scale you should have some kind of kill switch to stop answering SMTP or to not send permanent errors at all, so that you could flip a switch and prevent the worst consequences of this rather than let it go on for a couple of hours.


Absolutely this.

I am astonished that either (a) this switch has not been flipped yet or (b) this switch does not exist.

Somebody is incompetent here.


Perhaps Gmail is just being discontinued ;)


don't get my hopes up!


A lot of people will lose transactional email messages, because of this.

I'd absolutely hate to be hit by this at this time. Thankfully I've made an time investment to run my own mail server years ago. A handful of times it broke down, it either went offline or started returning 4xx codes due to misconfigured or broken milter after an update. Neither meant lost messages from normal senders that use queuing MTAs.


Same for me, mainly for privacy concerns. And I back it up daily to my local NAS. It's so easy to configure and run your own mail server, that I'm surprised we are the minority in the tech community.


> It's so easy to configure and run your own mail server

Is it? Is dealing with IP reputation, getting your emails accepted by major providers, and being on the hook for fixing everything yourself very easy? I haven't tried, so I don't have personal experience, but I've heard enough horror stories to think that it's not a good use of my time.


Sending side of the MTA can be set up manually in about an hour on a Debian server, with dmarc, dkim, spf, etc. Make that a day if you want to read up on and understand each of the things in more detail, if you haven't configured them before. There's really not much to play with in this direction for a typical personal mail server.

Receiving side is where there is a great range of options, and many things to try and have fun with. You can have anything from a single catchall mailbox with no filtering, no GUI, and a simple IMAP or POP3 access for MUA, to a multi-account, multi-domain setup with server side filtering, database driven mailbox and alias management, proper TLS, web MUA access, etc. It can also be built up gradually, starting from very simple setup to something more complicated so that you never lose account of how things work.


Mine are accepted by Gmail so I am good. Considering how dominant Gmail is, that's all that really matters.

Regarding getting a bad IP rating, normally that's due to having an insecure config, like acting as an open relay, or not having DKIM enabled. There are lots of tutorials online about this, if you know Linux it really is easy.


I had an IP reputation issue and managed to resolve it after some time.

TLDR: Before you spin up a mail server, check if your IP address is on any of the blacklists [0]-[1] as well as Proof Point's list [2]. If it is, then try and get a different IP address.

I spun up a hosted server on Digital Ocean and received an IP address. I checked several black lists from a few email testing/troubleshooting sites [0] and [1] and all was groovy; my IP address wasn't on any list.

I got a bunch of 521 bounces when I tried emailing a neighbor who had an att.net address.

So, I checked the troubleshooting websites, and my IP address was listed as clean.

My logs said I should forward the error to abuse_rbl@abuse-att.net, so I did.

Those emails were never delivered, because abuse-att.net had its own blacklist. I was getting 553 errors. In the logs, the message from their server told me to check https://ipcheck.proofpoint.com.

Proof point runs their own blacklist that some enterprises use (e.g. att and apple [3]). I checked their list, and lo and behold, my IP address from Digital Ocean was blocked [2]. Digital Ocean wasn't able to remove the IP address from their blocklist and suggested I spin up a new droplet with a different IP address.

I didn't want to do that, so I sent Proof Point an email that went unanswered; the email asked them to remove my IP address. I forgot about the issue for five or six months (this is a personal server), and ran into the issue again a few months ago. So I sent Proof Point an email again, this time with different wording emphasizing that "my clients" were having delivery issues. Within a day, they removed my IP address from their block list.

So, my main suggestion is to check if your IP address is on any of the blacklists as well as Proof Point's list before you start on your server. If it is, then try and get a different IP address.

Does anyone have more "enterprise" lists, like Proof Point, to check?

[0]: https://www.mail-tester.com/

[1]: https://mxtoolbox.com/blacklists.aspx

[2]: https://ipcheck.proofpoint.com

[3]: https://www.reddit.com/r/email/comments/6toxzr/ip_blocked_by...



It may be helpful to note that Google has acknowledged they are working on similar issues (the description is vague!) with an ETTR of 1900 EST:

https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

On the other hand, their status dashboard reported similar issues yesterday and here we are again: https://www.google.com/appsstatus#hl=en&v=status


Yes, hard bounces even between Gmail addresses.


just curious, how did you check bounces stats for Gmail?


I also had the same hard bounce (when emailing from a non-gmail address -- fastmail -- to a gmail address). Sent it again minutes later and then it worked.


Incoming Gmail is bouncing, but I'm still able to access all prior received messages.


TL;DR; Don't sent your newsletters today if you can avoid it.


Over the past 24 hours, I've had GitHub request that I re-verify my gmail three times (roughly 22 hours ago, 2 hours ago, and now), each time resetting my primary email's status to "Undeliverable" and "Unverified"

The triggering event may be an email bounce. I get a lot of github notifications sent to my email, and the failure of just one/a few may trigger the reverification.


This is another good reason to have email @yourowndomain.tld

When this happens, you can spin up a temporary server and have a mechanism in place to redirect email so you don't go down when your provider does.


I've had way more downtime trying to run my own domain's mailserver for a year than I have with gmail for more than a decade.


That's not what I said. With some emphasis added:

> When this happens, you can spin up a temporary server and have a mechanism in place to redirect email so you don't go down when your provider does.

Use a commercial provider, but fall back to your own server when it goes down without changing your email address.


I see two problems here: The likelihood your service is restored before you spin up your own mail server, and the fact that, not expecting this failure, their DNS may have a fairly lengthy TTL.


https://mailinabox.email/ Can be set up relatively quickly


What about permanent problem, like suspended account?


In that case, owning your own domain is golden. I just don't see "spin up your own mail server" as a short term solution.


Having run my own mail server for over a decade, I have yet to have a single time where the server responds by Permanent error of accounts not existing and with email bouncing.

Losing incoming email is pretty much the worst case scenario when it come to configuration errors. It about as bad as not having backups, in that both cases results in unrecoverable loss of data.


Use a paid email host, just anything but Google. Life's too short to put up with managing your own email server.


It can as well be Google, just the paid Apps version. Zero time to get used to a different UI. I suspect there must be a solution to easily migrate all your tags and filtering rules. (Tags are the killer feature to me. Outlook sort of has them but they are less flexible.)


does the paid apps version have better uptime? Is it not affected by the current issues?


My company has paid apps, and we have been facing issues same as everyone else.


I switched to a custom domain only when gmail torpedoed one of my secondary gmail accounts.


You can redirect to a commercial service as well.


Not me, and I'm not even paying for the services I've been switching between.


Keep in mind other stuff like DNS will go down randomly. At least they won't result in a permanent address-doesn't-exist error, but you'll be putting out potentially more fires that way.


I just switched to Fastmail before all this.


Except as an academic exercise, trying to roll and maintain your own email is fraught with difficulties.


You can forward handling to a provider, like gmail. The idea is that you own your email address and can switch providers more easily if you are not satisfied with them or they turn out to be evil.


still use gmail to manage email lmao


Yep, there was a very similar event yesterday, approx. 22 hours ago: https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=10...


I figured one major incident for Google was enough for the day! We had a bunch of email bounce to @gmail domains yesterday in that timeframe.


When that happened I panicked a little, realizing how much Google Sheets data I had that wasn't really backed up anywhere since Sheets files in Google Drive are basically just links. I started a Takeout, but it looks like I wasn't the only one - it took well over a day to complete.


Be sure to verify that it worked. Some settings of Takeout don’t download docs/sheets/slides files. I don’t remember what the default is, unfortunately.


Same from LinkedIn


As quite a few googlers appear to read and write on HN, I'd really welcome an insider info on what's going on the last few days.

Sure there will be some internal turmoil going on right now, but isn't there some non-confidential info to share? Can't imagine this will hurt the image of google neither in the short nor long run, quite the opposite.


I don’t work at Google, I’m at a different big tech that’s in the news frequently. Sharing inside info on an ongoing incident is a great way to get fired. Big tech companies are way different than startups where everyone can do a bit of anything. There are people whose job it is to handle that communication. You make their job a lot harder if you disclose information. The company is so big that as an engineer you may not know all the factors involved in what would hurt the company long term - undisclosed relevant litigation, compliance commitments, partner obligations, etc.

How much do you hate it as an engineer when sales people make tech promises to customers without asking you? For comms people, engineers leaking info publicly feels the same way.


I am very pleased to see this response, genuinely. Our Technical Curiosity aside, there are literally people and teams in such big firms dedicated for this.


What you're saying makes sense but I don't think it really applies to anything the OP said. The "non-confidential" qualifier indicates to me that they only want people to share what they can responsibly.


And the parent post’s point is that there are people whose job it is to specifically share that information, and so we should let them do their job. They are the domain expert in this particular task.


For any incident like this there are tons of details that are both

1) Harmless to share 2) Will never be shared by PR teams

I don't see anything wrong with asking people to share what they can.


There’s nothing wrong with asking. I’m just explaining that as a Google employee, sharing such details is poor form.


[flagged]


> These companies wouldn’t hesitate to kick you out on the street if they had to

> Sharing inside info on an ongoing incident is a great way to get fired

You're not disagreeing.


He literally just said they wouldn't hesitate to kick you out on the street if they had to


In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.


Googler but nowhere near Gmail, so just educated speculation:

* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)

* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).


We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.

As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."


Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)


Is "exciting" a synonym for "harrowing" where you're from? :P


Chrome web store has no rollback strategy, there is only roll forward :(


You can build rollbacks out of rollforwards, although it certainly isn't particularly fun. You patch an update to version N version code so that it's higher than N+1 and roll out the N+2 labelled N.


> what if you're in a situation where rolling back could make the problem worse?

Here comes the poison pills!


You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:

https://status.cloud.google.com/incident/zall/20013#20013003


Thanks for providing this. It's funny to read the speculations when you have read the actual root cause :D

Well I guess the thing is left unanswered for now is why the quota management reduced the capacity for Google's IMS in the first place.

Maybe we will know someday :)


Maybe they have world class DevOps, but they also have way more things that can go wrong than the vast majority of businesses. It's kind of remarkable that the entire world can be pinging Google services and they have ~99.9% uptime.


> It blows my mind that Google can even have problems like this.

When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.


Ex-Googler here.

Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.

But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.

When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.


I speculate that for many companies, work from home has been at most, less impacting than they thought.

However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.

With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.

Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.

Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.


You might be right for the smaller company where physical access to the machines in the data center is necessary at a certain point in the troubleshooting process. I work at such a place myself. I would guess, however, that Google moved beyond that quite some time ago. It's simply not practical, with or without having offices with people in them.


All the access to the services is remote, but I'd say having the entire team in the same room does help coordinate incident response.


Agreed. And I'd hope that their plan B of "get the whole team on Hangouts" isn't met with connection / auth issues. Kinda feel bad for the googlers. Hope they get this right.


When I was there they had an IRC network for this reason. I hope they still do. Not quite the same as VoIP but fewer dependencies...


That's why the network folks at Google and AWS use IRC for just that purpose. Simple, no external dependencies, just works.


Software isn't as simple as splitting across different locations to prevent global failures.


I thought SMTP was specifically designed for this (with support for multiple MX entries, queuing at the sender MTA side, etc.) and there's an easy hard boundary at the user mailbox level you can use to partition your system.

It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.


It is pretty clear that accepting a TCP connection and reading the bytes of the email from the sender is not the problem. Google is bouncing messages with an error like "that user doesn't exist". This would lead one to believe that some instances are having trouble looking up users, and that doesn't scale super easily. If the product guarantees that it will reject invalid email addresses (which is nice of them, not required by any spec), there has to be a globally consistent record of what email addresses are valid, and the accepting server has to look it up and react in the time that the sender is still connected to the mail server. You can't queue those and send the bounce later (there is no reliable "from" field in email; the only way to correctly bounce is while the sender is still connected). This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email. They made it hard on themselves by providing messages like "that user doesn't exist", but... it is nice when you email someone and you get the message "they got fired, sorry" instead of silence. So they made their system more complicated than it needed to be, for a better user experience, and now they are fighting a breakage.


I doubt that the delivery stack would 550 for mere trouble looking up an account. This smells more like the identity system was incorrectly returning authoritative denials.


Yeah, that sounds right to me. I would expect to see a temporary rejection with DEADLINE_EXCEEDED or something like that.

I think a lot of time and effort is spent categorizing errors from external systems into transient or permanent, and it's always kind of a one-off thing because some of them depend on the specifics of the calling application. It definitely takes some iteration to get it perfect, and it's very possible to make mistakes.


If it really doesn't want to accept emails for addresses that it doesn't know are valid, a well-behaving email server should send temporary failure codes when it can't look up if addresses are valid, and let the sender retry later when the address lookup is working and it can give a definite acceptance or rejection of the email. This is not even remotely a new problem, it comes up in email systems all the time because even at much smaller than Google scale they tend to be distributed systems. Someone screwed up.


> This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email.

You don't have milliseconds. You can take quite some time to handle the client. 10s of seconds for sure. For example default timeout for postfix smtp client when waiting for HELO is 5minutes.


If there is something I've learned from AWS outages (they tend to publish detailed post-mortem), no matter how you design your architecture in a distribute way you will always have Single Point of Failure (SPOF) and sometimes discover SPOF you didn't think of.

Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).


[flagged]


Your contribution has greatly enhanced this conversation, thank you.


Because, maybe, like in every big company, the thing actually doing the work is some old oracle database with some huge monolithic around it...


Out of all the companies Google might be relying on in their back-end, I think Oracle is probably pretty far down the list.


I can’t imagine what part of Google’s history would lead someone to believe there was any third party system in their production stack anywhere.


Now their corporate/finance stack on the other hand... shudder.


Well, google did use a bunch of off the shelf technologies in the early days, but now it is obvious that there is no vendor on earth that could supply the infrastructure to run Gmail.


Didn't they use GNU/Linux form day one on?


Closed-source like Oracle I meant. They've been big boosters of all kinds of open-source stuff like linux, llvm, mysql, ...


Hush, you'll scare the shiny eyed faang wannabies away, they aren't supposed to know this until employed for at least two decades.


I would advice anyone to not share any information that his company hasn't agreed explicitely to share.


Your username is rat9988. Been burned in the past?


Management at google are poking in to check up on their staff, to make sure nothing leaks.


[flagged]


There used to be times when people didn't care for technicalities like this because the focus was on the person's contribution to the discussion.

Now that everyone's replaceable, the popular culture desperately tries to shift focus into arguing about pronouns and terms.

Watch out, this is a road to nowhere. Forcing others to use the right pronoun won't build up your retirement fund, but will distract you from worrying about not having one. And the fact that you care about it more than about your opponent's T-shirt color could be an indication that you are being manipulated to not think about the long-term things.


This is a surprisingly profound and insightful comment so deep in the subthread of, more or less, a shitpost.

Thank you, sir, for elevating our collective level of discourse.


> you are being manipulated to not think

This is where it crosses from insightful into conspiracy theory territory for me. People seem perfectly capable of groupthink-deluding themselves. Why cheapen your argument by postulating some master manipulator when it's not necessary for the deeper point you're making?

It will only lead to people focussing the discussion to challenge this particular aspect, or them disregarding all you've said, instead of engaging with the actual meat of the argument.


'Their' works fine and has been gender-neutral English for ages.


[flagged]


Okay, so use "their." It is gender neutral, so should work for everyone.


It's also wrong. because it's not singular. Makes for difficult reading.


From https://www.pemberley.com/janeinfo/austheir.html:

'Singular "their" etc., was an accepted part of the English language before the 18th-century grammarians started making arbitrary judgements as to what is "good English" and "bad English", based on a kind of pseudo-"logic" deduced from the Latin language, that has nothing whatever to do with English... And even after the old-line grammarians put it under their ban, this anathematized singular "their" construction never stopped being used by English-speakers, both orally and by serious literary writers.'


It's not "wrong." Language is fluid and singular they is widely accepted. A previous poster linked to an article showing centuries of such usage.


> "so what does it matter anymore"

The same reason it ever mattered how you refer to people, politeness and respect. If someone you consider "him" asks you to refer to them as "her" it's like someone asking you to call them by their full name "Rebecca" instead of "Becky" or "Jonathan" instead of "Jon". If you like and respect them, you do as they request because things which matter to them matter to you, and being polite to them is important to you. If you ignore what they ask, call them what you want, you communicate that you don't respect them and don't want to be polite, that you want to dominate and 'win' instead.

> "Pronouns can mean whatever you want them to mean"

Only one way. A specific person asking you to use a specific pronoun for themselves is wildly different from you unilaterally and universally saying that all women should feel included by the word "him" because "him" has no meaning anymore.


Ages is subjective, it came back in to popularity only recently


That varies based on location and regional dialect. Here in the northeast US, I remember using singular they/their since the 80's. It would be interesting to know when this become popular elsewhere.


80s in Australia too, been hearing/using it my whole life.

Though with respect to 'ages' apparently it's been around since at least the 14th century but certain purists tried to stamp it out at various times (just like the singular 'you' which no one currently has grammatical issues with I hope).

https://public.oed.com/blog/a-brief-history-of-singular-they...


I remember some people tried to get BLM into German discussions, which made absolutely zero sense, as we have a complete different history and culture. Now I see this popping up. I really hope Europe can get some cultural distance between itself and the USA in the near future. The time is ripe.


> s/his/her/

s/his/their


I believe you mean:

s/s\/his\/her/s\/his\/their/


s#s/his/her#s/his/their# also works and avoids awkward escaping. The first symbol after s is used as the separator. Works in vim, at least.

In other words:

s%s/s\\/his\\/her/s\\/his\\/their/%s#s/his/her#s/his/their#%


Did you just assume my regex engine is pcre gendered???


Wow, I've never heard this joke before. Original and well-applied to the situation.


    awkward escaping
Or as I've seen it called, "leaning toothpick syndrome".


The question is what exactly is the new “feature” that got pushed skipping canary is.


NSA backdoor? <smirk>


Since so little time has passed since the last issue, I am wondering if it could be the same cause. Maybe they didn’t fix it properly the first time.


Or simply trying to roll something out again, same that failed before.


it's got a similar flavor - that was identity management going down, this is "that email account doesn't exist".


I wonder if Gmail is just not a very well maintained codebase. Here's an issue where old emails just become inaccessible. Not fixed for over a year and they've locked the thread so I'm starting to wonder if they actually deleted the emails by mistake.

https://support.google.com/mail/thread/6187016

Maybe time to switch to a more reliable provider.


> Not fixed for over a year and they've locked the thread so I'm starting to wonder if they actually deleted the emails by mistake.

Did you try pulling them down using the API tester?: https://developers.google.com/gmail/api/reference/rest/v1/us...

Some of the internal formatting that Gmail uses has changed over the years, so more likely than not the API that parses the stored message for display in the Gmail UI is just throwing some kind of error.


I didn't but I did try Takeout and they weren't in it.

Either way my point is that this is a pretty serious bug and they haven't even acknowledged it! Not a good look.


I've never had issues over IMAP with old (decade) message in gmail


Right but the version of an email message you download via IMAP is different than the version of an email message you see in the Gmail UI. That's my point, that the error is probably in the way Google is processing messages for Gmail, so you wouldn't see it in IMAP or via the API.


Yes, I’ve been hearing about this issue from non-technical friends too. An explanation of “X crashed” helps even if they don’t actually understand what X is. The fact that someone figured it out and knows is reassuring.


Uneducated speculation, some sort of security incident. Whenever there is a major security issue in the wild, one of the big providers tends to have a problem within a few days.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: