
Love Your Bugs - mpalme
http://akaptur.com/blog/2017/11/12/love-your-bugs/
======
drostie
If you like bug postmortems like this you may also enjoy the Jepsen
presentations by @aphyr. A video summary is available here:

[https://www.youtube.com/watch?v=eSaFVX4izsQ](https://www.youtube.com/watch?v=eSaFVX4izsQ)

and if you want to get into the weeds with any of them these are largely
published publicly on aphyr.com; see for example:

[http://jepsen.io/analyses](http://jepsen.io/analyses)

There's a lot of "oh, here's how this dirty read from the underlying system
became a much bigger bug in the system we built on top of it!" but what I like
most about Kyle is that he is generally pretty great about having an attitude
of "these are actually really hard problems and it's not surprising that there
are implementation bugs when you let me mess with the clocks and cut off nodes
and whatnot."

~~~
danso
Also, u/luu started a list of interesting bug post-mortems. It hasn't been
updated in awhile but contains a lot of the classics (many which have made the
rounds on HN):

[https://github.com/danluu/debugging-
stories](https://github.com/danluu/debugging-stories)

------
aetherspawn
I really doubt they were legitimate bit flips and not just software bugs, to
be honest.

The likelihood that you’ve seen multiple flips on the same piece of data ..
sounds like a typical threading bug.

~~~
jmharvey
Let's run through the likelihood of seeing multiple flips on the same piece of
data.

Google's research [1] finds a DRAM error rate of "25,000 to 70,000 errors per
billion device hours per Mbit" on hardware in "modern compute clusters." If
there are 100 million Dropbox clients out there, Dropbox clients should
encounter 2,500 to 7,000 errors per Mbit per hour, though factoring in the
"low-end or old hardware" that many Dropbox clients are running on, the error
rate is probably somewhat higher. For the sake of making the math simple, call
it 10k errors per Mbit per hour, or 1 error per 100 bit-hours. So a given bit
should flip on some user's machine on the order of once a week. That seems
pretty firmly in the range of "sometimes we see these weird errors that we
don't really understand," especially if you multiply by the same error
potentially coming from different parts of the program (so "the same piece of
data" is really "a few pieces of data that get collapsed together for purposes
of analysis).

Your intuition that a "typical threading bug" is much more likely than a
random bit flip is spot on, but that actually works in favor of the "random
bit flip" thesis. On the Dropbox scale, a threading bug/race condition would
typically show up as a significant, persistent issue, several orders of
magnitude more common than the random oddball errors described in the article.

[1]

~~~
aetherspawn
Citation is missing. Thanks for doing the math, very interesting, but I'm
still very skeptical! The paper says they observed bit error rates "order of
magnitudes higher than previously reported" so should we expect 1 error per
week or 1 error per 10 or 100 weeks?

~~~
jmharvey
oops:
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf)

------
teddyh
> _Bitflips are real!_

And yet some people still claim that they don’t need ECC memory.

------
StavrosK
Good article, although I'm a bit surprised at the difficulty the author
proposes for rolling back the log bug. Why not simply serve 500s to 95% of
your clients? That way, you get logs sent to you gradually.

~~~
colonelxc
I think they expected it wouldn't have much effect, since the client was also
updated to delete the old (corrupted) log. Because the logs were always
deleted after a success, and updates started with the oldest, the corrupted
log would necessarily become the oldest.

By the time the DDoS was in effect, the corrupted logs had been deleted by the
client. They would now always succeed (even with the old server code, or old
client code) until they got a new corrupted log.

~~~
StavrosK
Yes, I'm saying that they could have just served 500s to clients (even ones
with regular log files), which would have backed off and retried later.
Essentially what the "chillout" method does too, but it doesn't sound like the
author had considered it.

~~~
smaddox
I'm guessing the answer is that inserting a feature to serve 500 to a fraction
of the incoming requests would have required a similar amount of effort as
just enabling the chillout feature.

------
age_bronze
I don't agree with all the "growth" and "fixed" categorization. Intelligence
is fixed, that's pretty much a fact (see all research about IQ etc.). What
isn't fixed is your state and knowledge in the process of solving a problem.
It has nothing to do with intelligence.

This is the equivalent of talking about cars engines, and how some can get to
higher speed than others, when in the end of the day you can probably reach
your destination even with an old car engine, that can only drive and
accelerate relatively slowly. It didn't magically improve/grow so you can
reach your destination. You reach it because you kept the car moving.

------
henrik_w
My take on why bugs are good for us (mostly because we learn from them):

[https://henrikwarne.com/2012/10/21/4-reasons-why-bugs-are-
go...](https://henrikwarne.com/2012/10/21/4-reasons-why-bugs-are-good-for-
you/)

