

How a load-balancing bug led to worldwide Chrome crashes - jpdus
http://code.google.com/p/chromium/issues/detail?id=165171#c27

======
hammock
The last discussion was mostly about how the title was linkbait. I want to
hear people's opinions on whether they think it's appropriate for a browser
(Chrome) to be designed such that it doesn't operate independently- that it
can be crashed (or self destruct bug, insert your own word here) by a remote
server at any time.

To my knowledge, Firefox doesn't do that. Safari doesn't do that. Internet
browsers are probably the #1 most important app on a computer these days,
browser reliability is vital.

~~~
DannyBee
Chrome Sync is, AFAIK, not a push service. Something polled a Google server,
it returned a bad answer, it crashed the browser. Why is this important?

Because it's entirely possible that Firefox or Safari, for example, could have
been crashed by contacting the safebrowsing server, and the safebrowsing
server returning an answer that crashes it.

Firefox also does remote firefox update checks and plugin update checks, etc.

None of the browsers you mention are "independent" of internet servers
anymore. They are meant to function independently, as is Chrome, but exactly
the right remote bug could likely crash all of them.

~~~
lmkg
Why is it possible to receive a response that crashes the browser? I'll allow
(with reservations) your premise that browsers aren't so independent anymore.
But it seems to me that input from an external source (even trusted) ought to
be validated, and malformed input should raise a warning or something. The
fact that crashing is a possible behavior upon receiving unexpected input is
surprising to me.

~~~
polyfractal
Well that's the point of a bug, isn't it? Chrome didn't handle malformed input
appropriately and crashed because of it.

It wasn't a command that shut down Chrome, it was just poor handling of an
edge-case which resulted in unexpected behavior (crash).

It's a sad truth that most programs will explode if you fling garbage at them.
When push comes to shove, many development timelines don't have room to
bulletproof against everything

------
rachelbythebay
Hmm, a load balancing bug which eats several production services for lunch and
has a bunch of second-order effects. I wonder if it involves running a script
and pushing the output without looking at a "diff" view to see what changed,
and they managed to push a config which sent all of the world's traffic to one
location.

It seems like just the other day when I was thinking about this very thing.
<http://rachelbythebay.com/w/2012/11/19/lb/>

------
hosay123
Previous discussion (perplexingly marked dead):
<http://news.ycombinator.com/item?id=4904125>

~~~
mattmanser
Worryingly marked [dead] in fact, how did it end up dead?

It might not be the greatest article, but it highlights a very real and new
point of failure that cloud apps are introducing. I was hoping for a good
discussion to check out later.

~~~
hosay123
Never ascribe to malice that which is adequately explained by incompetence..
:) HN's code is weird in places, e.g. it's possible to accidentally kill your
own posts by double-clicking submit (I can't remember why it does this, but
there was a sensible-ish explanation)

~~~
jQueryIsAwesome
Yep, happens the same thing with comments (found out thanks to a mouse
malfunction)

------
mintplant
I'm surprised at the level of hyperbole here on this thread.

------
andrewcooke
on the server side, it looks like the problem could have been avoided with
better types - it seem that there was a confusion between status values that
can include 0 and those that cannot (alternatively, perhaps better, there was
no status for the case where the status was undefined?) and then a hand-
written assertion that a particular case could not happen (and so was not
tested for).

the bug report describes all that, roughly (if i've understood) but doesn't
seem to be worried about the higher level issues - the inconsistent types and
need for fragile human assertions about type logic.

(not java bashing - don't see why this couldn't be solved in java)

but i guess this is just a bug report. for an outage like this i suppose
there's going to be a major review? is that all internal? would be interesting
to watch.

~~~
irahul
> (not java bashing - don't see why this couldn't be solved in java)

Chromium is c++. The code diffs in the bug report are c++. Where is Java
coming into picture?

~~~
andrewcooke
oh, sorry - didn't look at code, assumed java from some comment. but either
way, same point - when you talk about solving problems with types, people tend
to assume you're thinking of haskell or similar, but just because you're not
using some super-cool functional language doesn't mean you shouldn't demand
all you can get from your type system.

(ps i was talking about the code on the server side; not chromium)

~~~
irahul
> (ps i was talking about the code on the server side; not chromium)

Oh I see.

> on the server side, it looks like the problem could have been avoided with
> better types - it seem that there was a confusion between status values that
> can include 0 and those that cannot (alternatively, perhaps better, there
> was no status for the case where the status was undefined?)

From what I understood, they are talking about protocol buffer types. The sync
server sent message to chromium to throttle for all types say A, B, and C.
Chromium didn't know about type C, some code returned 0 for unspecified type,
another piece of code calculates index based on what is returned in the
previous step, and then due to the 0, a negative index was accessed in the
bitset leading to out of bound exception.

The issue is server sent all types to clients rather than sending only types
known to the client, and the client didn't gracefully handle unknown types. I
don't think a better type system would have helped the server - that looks
like a logic bug, not a typing bug.

~~~
andrewcooke
what i was trying to say is that you can (often) push this kind of logic bug
into the type system - by having, if you like, different types of statuses, or
a special type for unknown values (a "maybe status").

 _if_ you can do that then you can get the "infallible" compiler to provide
the "this branch will not be executed" logic. but there may be efficiency
trade-offs, or it may be so directly tied to low level protocols that it is
impossible.

five ten years ago i would not have thought of the problem in this way - i
would have agreed with you (and the bug report) that it is just logic. but
slowly i am starting to learn to rely more on types. but i don't know enough
here to suggest details...

------
kintamanimatt
Quick question for those that experienced it: did it take down the whole
Chrome process, or just a single tab? I'd be very displeased if that happened
again and I lost work as a result.

~~~
paulgb
It took down the whole Chrome process.

~~~
kintamanimatt
That's grim and completely unacceptable!

~~~
lttlrck
Switch browsers

or

Contribute to Chrome development.

~~~
mh-
you can't contribute to Chrome development unless you work at Google, fwiw.

------
DigitalSea
Bugs are a part of software life. Considering the great track record of Google
Chrome, I'm not concerned and the issue was resolved pretty quickly. Even
Google developers make mistakes as do the rest of us.

------
oscargrouch
Im pretty unconfortable with my browser to have "hardcoded" code to connect me
with one walled cloud.. be it google, microsoft or apple..

wheres is the choice? look like these days using anything software or hardware
from the tech giants means to be their pets

~~~
philfreo
Perhaps you should use WebKit or Chromium as your browser then.

~~~
pcwalton
Chromium has the same code.

~~~
philfreo
Chromium has Chrome Sync??

------
macrael
I don't understand why the crash happened when visiting Gmail if the bug was
in Chrome Sync.

------
RobAley
Currently getting "500. That’s an error. " page. Seems like even the bug has
crashed too...!

------
nfm
Had significant issues with the Chrome Web Store yesterday - images not
loading from some content servers, extensions failing to install ('The
extension file was not a CRX'). I guess this was the cause.

~~~
nfm
"That quota service experienced traffic problems today due to a faulty load
balancing configuration change. That change was to a core piece of
infrastructure that many services at Google depend on. This means other
services may have been affected at the same time, leading to the confounding
original title of this bug."

Why the downvote?

