Hacker News new | comments | show | ask | jobs | submit login
Gmail, Google Maps and YouTube had outage issues (thenextweb.com)
126 points by pthomas551 5 months ago | hide | past | web | favorite | 96 comments

I think it's a little premature to be calling this a "meltdown". It's like calling one block of houses with power fluctuations a "blackout".

Media outlets are in the business of making sensational claims.

Googlegate 2017. Cataclysmic meltdown where I had to wait a whole 10 more seconds for a video to load as it found another source.

I share the irony, but there's one nuance to bring: in our centralized web, sometimes there just is no such thing as "another source". Was trying, during that outage, to watch Radiohead's "Lift" video posted earlier today. Impossible, as it was posted today and only on YouTube, and all media coverage are YouTube iframe embeds.

Welcome to the recentralized web.

Yes. We toned down the overheated title above.

GSuite has status reports for things like this, and it tends to be a good way to confirm a Google service outage[0]. Direct link to the issue[1].

[0] https://www.google.com/appsstatus#hl=en&v=status

[1] https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=d4...

Kind of nice of them to show past results AWS for example tends to "forget" past issues and even has problems admitting to the current ones until they are very noticeable making their status page quite useless.

400s for me.

Does not seem to affect Europe.

In fact, right now YouTube loads far quicker than it has for the last seven to ten days, where it would take ages to load any YouTube page.

reboot helps. May be Google runs Windows. Then they just need to reboot more frequently, ie. proactively.

Funnily enough, in one previous company I worked, which used Linux exclusively, the software was very buggy, and the servers were rebooted daily. We jokingly called that the "Windows solution".

When we proposed adding a daily reboot to cron, the tech lead (which encouraged practices which lead to this low quality software) retorted that "this in not Windows, it doesn't need constant reboots", totally missing the point that just using Linux doesn't make you a developer of reliable software.

I think that if you are around in this business long enough, you will eventually experience something like this.

We had a Linux server that had some issue with a network driver that was causing kernel oops with long uptimes (about once a week). This one was particularly nasty because the oops would disable the network interface, meaning that you couldn't ssh to diagnose the fault (or, more likely, reboot), which meant that you had to drive to the premises and kick the server in the guts.

Of course, Murphy's law ensured that the server would fail at the worst possible time. Late Friday, weekends, your mom's birthday, etc... Not fun, at all.

The solution was to write a cron job to reboot every day at ~4:30AM. Stupid? Yes. But we all agreed that it sure beat the alternative (driving, kicking, sobbing).

The driver was eventually fixed by the vendor, and this "hack" became unnecessary.

Sounds familiar. I had a client once, who had to reboot one particular router every day because it stopped working after 8h uptime. So one of their employees did that every morning by logging into some server via remote desktop to click the "reboot" button. I asked why they don't use some kind of cron job to automate that task, they just said "it doesn't work that way, you have to do it manually".

Some people are anti-automation.

I am an economist, turned into "data scientist" since I've learned to program in the past 5 years (hate that name...).

At a macro consultancy firm I worked, everybody lost it when I suggested that we moved our manually downloaded data from a bunch of excel spreadsheets to a proper database (28 years of macroeconomic data) so that we could programatically extract data for online reports we sent/hosted. They said I was being lazy...

I hope you took that as a compliment. Laziness is a virtue in programming, according to Larry Wall: http://threevirtues.com/

I had a fun issue for about 6 months with a Cable (DOCSIS 3.0) modem I'd purchased. A Motorola/Arris Surfboard 6183

The modem would randomly seem to keel over with some unknown fault. Causing my internet speed to drop from 300Mbps down to 0.25Mbps, ping to Google.com for instance would then spike from 5ms to 1900ms (or more)

Curiously the upload speed would stay pegged at 30Mbps however!

After a few days of this happening, I picked up a Chinese "Smart Switch" that ran OpenWRT, and set up a small shell script to simply ping Google.com, then cycle the modem if the average ping results exceeded a certain threshold (I think 100ms?)

It would also record the exact date and time and log that, so I could try and correlate the issue. Unfortunately it seemed to be utterly random, without any rhyme or reason.

Since I worked at Comcast at the time, I tried to gather more data on the issue internally. Eventually writing a report that totaled something around 10 pages.

From what I gathered: There were no physical signal deviations when the device would "hang". The device would respond normally to SNMP requests etc, everything on Comcast's side appeared normal. The device had some internal fault with its software that was causing problems (kernel bug perhaps?)

I contacted Motorola/Arris for support, and was advised that the warranty specifically excludes Software faults(!) and then kindly recommended to "upgrade" to the newer SB6190 model.

Unfortunately being a Cable modem, the firmware is completely controlled by the ISP. Since there were only 25,000~ SB6183's on Comcast's network at the time, and even fewer on the speed tier that I had, there was not enough data to report the issue back to Motorola/Arris through Comcast

Eventually a software update was pushed out which corrected the issue roughly 6 months later

I used to have a service that was like that, it was third party and only handled a specific function so I had a cron job that would reboot it at 4am every day.

Sometimes the hammer approach is the only solution.

Was the software loading kernel modules? I'm curious why a more simple kill process routine wouldn't work.

Having been in a similar situation, many enterprise software require a precise start order and shutdown, sometime even depending to certain system services or some arcane stuff that then breaks in unpredictable times and often the only sane way to restart the environment is to bring it down whole

Exactly this, it was a horrible piece of enterprise software that depended on other similary horrible pieces of crap.

The only way to get it into a known good state when it shit the bed was to reboot the machine and then let it sort itself out.

We rapidly moved away from using it.

Funnily enough I'm in a similar situation at the new job with Jasper Reports, god damn if that thing isn't everything bad about Enterprise Java(TM).

> "this in not Windows, it doesn't need constant reboots", totally missing the point that just using Linux doesn't make you a developer of reliable software.

Yes... but no software should require the Linux OS to reboot unless you're running custom or known buggy kernel modules, or you've triggered a spiral of death through swap usage. If your program is misbehaving, kill the program and restart it.

Restarting the OS daily is like noticing that your car uses a lot of gas, and deciding that every time you fill it up you'll get an oil change too. You might need to fill up a lot, but the oil change is overkill and not really affecting the situation in one way or another.

Am I idealist by wondering that people think adding a cron job is a good solution? Why not trying to fix the broken software?

Sometimes, it's just not possible nor reasonably doable. As seen above, with proprietary drivers...

Jaded developer here: a cronjob `pkill -9 foo` is cheaper than debugging a memory leak for a lot of businesses. Let the process supervisor handle the process death.


There is a common practice for Node.js developers to use one of those tools that automatically reboot the site either whenever the single-threaded-event-loop-server crashes, or periodically...

While this pattern is common in node, it's not node specific. Erlang/OTP uses supervised process trees, Linux has supervisord to support this pattern, and the general idea is called "crash-only" software [1], among other names.

[1] https://en.m.wikipedia.org/wiki/Crash-only_software

Assuming you don't mean full "reboot", but reloading/restarting the app runtime every X requests, that's not unusual in many languages. Helps against memory leaks (which are not necessarily the fault of the application and thus not always fixable)

Our site was using hosted libraries, google fonts, and google analytics. All of which seemed to be behind captchas, throwing CORS errors, and 503ing since this morning. Swapped out JQuery cdn for now.

For something like jquery, you can host locally and fallback to it if the CDN fails.


Loading JavaScript libraries synchronously should be avoided if possible, making the above solution not a great one.

Until Google goes down. Which makes it a really great one.

Is there a better/different way to handle this type of fallback?

Yes. What you should do is use an asynchronous module loader. There are many small standalone ones like loadjs [1]. But the more widely used tools such as webpack also suppprt this as code splitting [2].

In general you want to avoid sync loads of js assets because depending on how the server serving the asset hangs it can cause the webpage to hang as well. For example, if the server responds with a 404 right away then there are no problems. But if the server does not respond and leaves the connection open the browser will just wait the max time.

[1] https://github.com/muicss/loadjs/blob/master/README.md

[2] https://webpack.js.org/guides/code-splitting/

Surprised Content-MD5 or a similar spec isn't used by the browser to avoid a web where only Google's hosted solution allows for efficient JS file caching. If you know two files are most likely equal by filename and checksum, you should be able to just reload the cached version, if loading the cached file produces too many errors, try downloading the new one or something, instead of forcing everyone to host it all under the same corp (in this case Google). Oh well.

damn, never thought to do this. thanks for the heads up.

Having the same problem here! Glad it wasn't just me...

Is there a good reason to use these things hosted by a third party source? Libraries are tiny, the fonts can be downloaded from Google Fonts and embedded locally, etc. Even the Google Analytics JS script I presume can be stored and run local.

Shouldn't a goal be to mitigate the number of possible failures which can bring down your site by reducing the number of single points of failure?

2 reasons:

1. If you're still using HTTP 1.x, sharding assets across origins lets the browser load them in parallel (if set up correctly). You can generally load just 6 assets in parallel per origin, and sharding is a way to get around that limit.

2. A library like jQuery is so popular, and is so often served from googles CDN, that chances are a user already has it in their local cache from when they downloaded it on some other site.

That said, yes - the downside is more surface area that might go down.

2. A library like jQuery is so popular, and is so often served from googles CDN, that chances are a user already has it in their local cache from when they downloaded it on some other site.

Which of these versions do you have cached?

3.2.1, 3.2.0, 3.1.1, 3.1.0, 3.0.0, 2.2.4, 2.2.3, 2.2.2, 2.2.1, 2.2.0, 2.1.4, 2.1.3, 2.1.1, 2.1.0, 2.0.3, 2.0.2, 2.0.1, 2.0.0, 1.12.4, 1.12.3, 1.12.2, 1.12.1, 1.12.0, 1.11.3, 1.11.2, 1.11.1, 1.11.0, 1.10.2, 1.10.1, 1.10.0, 1.9.1, 1.9.0, 1.8.3, 1.8.2, 1.8.1, 1.8.0, 1.7.2, 1.7.1, 1.7.0, 1.6.4, 1.6.3, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 1.4.4, 1.4.3, 1.4.2, 1.4.1, 1.4.0, 1.3.2, 1.3.1, 1.3.0, 1.2.6, 1.2.3

Asking an annoyed rhetorical question doesn't seem productive to the point you're trying to make here.

As an actual answer, it would be variable proportional to the size of the window between releases mentioned here: https://en.wikipedia.org/wiki/JQuery#Release_history

I'm sure a fair amount of people serve jQuery from a local storage. The usefulness that the user might already have it cached is a non-zero point, no matter how insignificant you may think it is.

Just two versions of jQuery, 1.12 and 1.11, represent approximately half of all jQuery versions in use. The top four most common versions (1.12, 1.11, 1.7, 1.8) represent close to 3/4 of all versions of jQuery in use. Version 3.x and 2.x are hardly being used by comparison to those.

That narrows your suggested problem down dramatically.

How many do you need to make this worthwhile?

The ratio of cost of storing a library versus the cost of GETing a library is very low, so the chances of already having a library cached can be very low for the EV to be worthwhile.

Weighing that against the chance of downtime is a bit more complicated, admittedly.

Looking at just minified versions from googleapis.com, I have 1.12.4, 2.2.4, 1.8.2, 1.10.2, 1.8.1, 1.9.1, and 2.1.1

> You can generally load just 6 assets in parallel per origin

This seems to only apply to Chrome, whereas Firefox will happily download everything as fast as possible.

I know this because I fixed a bug recently where chrome was taking so long to download images that other resources on the page were timing out. No problem in Firefox.

Last I checked all the browsers had limits [1] when it came to HTTP 1.X.

[1] http://blog.olamisan.com/max-parallel-http-connections-in-a-...

The limits in Firefox must be really high then, it was almost funny looking at the difference in behaviour between the two browsers in the situation I was testing.

3. Now 1/3 of pages I'm visiting I need to make request to google because someone wanted to use a 5 lines of JS from jQuery.

> If you're still using HTTP 1.x

Not sure why it was phrased that way but...isn't everybody?

I know that HTTP/2 is released and browsers support it, but I'm fairly certain that next to nobody is actually doing anything with it.

Google, FB, Youtube, Wikipedia, and HN :) use HTTP2.

I also use HTTP2 at work, and on every personal project. It's supported by every browser [1], and comes with a slew of benefits. It's usually trivial to set up, if you want to give it a shot.

[1] http://caniuse.com/#feat=http2

It's getting rolled out in a lot of places. I think adoption is somewhere around 17% depending on how it's measured. I started using it for my personal website and it was really easy, the only part I missed was setting a cron job to reconfigure the web server whenever cert renewal went through.

One reason is that by linking to external libraries your browser most likely has them cached. At least it was the case a few years ago when I was doing web dev.

I see this argument a lot but I don't think it's necessarily a good one. If a third party CDN goes down, your site is down.

A few extra ms in initial download isn't so bad compared to having your site be completely inaccessible for reasons outside your control.

When a CDN goes down, that is the time to use local backups. The best policy I've seen is using 3rd party CDNs for scripts like jQuery, ie. common, unchanging resources, yet with subresource integrity checks [0] to make sure it's actually what you are asking for (in case the site somehow gets compromised or returns an error). On top of that to cover for failure, you have a fallback locally hosted copy which is loaded only if the CDN version fails.

[0] https://developer.mozilla.org/en-US/docs/Web/Security/Subres...

Would those resource checks been done at the client or sever? For example the sever pings the CDN at a set interval and if it's down the sites code is modified to included links to local copies.

Or would you do something in the browser to fetch the local one in case of failure?

Why doesn't having the resources cached provide a buffer against short-term outages of the CDN?

Because normal Cache-Control is only aimed at reducing the amount of data transferred. With newer immutable Cache-Control[0], a CDN going down wouldn't have an effect if you have the resources cached.

[0]: https://hacks.mozilla.org/2017/01/using-immutable-caching-to...

The browser still has to make a request to the CDN to get back an HTTP 304. The goal is to avoid downloading a potentially large payload, not be resilient against connectivity issues.

Self hosting will probably cause more downtime than if you are using a decent CDN. CDN is just one of many points of failure, I would expect there's a fine balancing act where you could achieve benefits of both.

By definition, if your site is down, you don't need your assets.

Then why are these sites down?

This is most likely never the case, because there are way too many versions of each library.

Yes, Google can change the logic for auth, analytics, etc at any time and your local outdated copy will be useless, further I believe it's possible that Google returns different JS depending on the browser that's requesting it in order to keep payloads down and performance high.

Our GA is setup through GTM now, that is why it's not hardcoded into our head. Really the most important gain we get from hosted libraries is caching. Since any user that has hit a google hosted lib, which is pretty widely used and distributed it allows them to access their cached version instead of sending our another request.

IIRC Google Fonts aren't easy to download because they vary by user agent.

You can download zip archive of ttfs using the customize tab on the fonts site. Or go directly to the source, and get the git repo.

Works for me. Either quickly down-and-up or isolated to certain users.

I think google is pretty good about engineering no global single points of failure; all updates are rolled out to only a fraction of users/machines at a time, etc.

Refresh google.com without caching. Logo is missing. There's something you don't see everyday.

Edit: Everything working fine for me again.

Clearly a malicious attack by Apple before the iPhone X announcement later </s>

Maybe too many people googling how to watch t or for the latest news?

Or maybe Samsung's announcement of a 'fold out' phone?

Or the Ted Cruz news?

It's a busy morning.

I don't think i've ever read a news story about a website/app being down, and it was still offline by the time the news story found me.

Even the jQuery CDN is down intermittently for us here in Chicago. Sometimes CSS for core apps like Calendar is not loading, either. Definitely something amiss.

I was noticing higher latency than normal for static assets such as fonts earlier today here in UK, but nothing was down, just responding slower than normal.

Had about 30 minutes of downtime (UK) - was completely down on one account, and intermittent on another. So guess it wasn't affecting everyone.

Looks like services are back up as of 7 minutes ago

9/12/2017 @ 10:27 AM +MST (Time services reported back up according to status page)

Lets just hope that remains stable. That can cause a whole heck of a lot more problems if it goes down.

> Lets just hope that remains stable.

That's why the smart money defaults to

I'm confused, (and are all owned by google, so I'm assuming a outage could also have the others go out?

And 2001:4860:4860::8888, 2001:4860:4860::8844 too…

Works fine for me in the UK. Must be another of these Google issue that only affect a few % of the users.

I'm guessing whatever happened has been fixed, because all of those services are working for me.

Gmail and YouTube are working fine in NC. Maybe they're getting things sorted out now.

Gmail is working fine on east coast.

I did hear youtube having 503 errors.

East coast, been on YouTube all day. Didn't notice anything. /shrug

I am in the same boat. No problems with any Google products today.

I haven't had any service disruption with Gmail or Youtube in VA.

App Engine / US Central + East working fine.

Gmail and gsuite working here in NYC

Working fine in SLC

no longer an issue

I was just about to make a snarky remark about the heatmap, referencing that XKCD about heatmaps that just mirror population densities, and then I noticed that the West Coast is barely affected. That's quite strange.

Updated to say AWS and GitHub also affected.

North Korean cyber attack anyone?

Yes, Donald Trump is strategizing a retaliation with Jeff Bezos and other allies at GitHub as we speak.

Cannot upvote this enough.

They were probably aiming at some IT system at Hyundai and they hired the same people to do routing software as they do to do rocket guidance.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact