
Debugging a Docker Heisenbug in production - loginoff
https://medium.com/@loginoff/debugging-a-docker-heisenbug-in-production-586ccb265f7c#.2aukp4th0
======
StavrosK
My favorite Heisenbug was a few years ago, at my current job with Silent
Circle. There was a bug where, when a user tried to buy a new phone number,
they would be shown around 12 messages saying "you can't change your number".
I saw screenshots, but there was no way that the server would show more than
one message, and there was no way to reproduce this reliably (or anywhere
other than production).

I looked at the code, and it was straightforward: The "add number" view had a
check for "ability to add number", and so did the homepage. The homepage would
redirect you to the "add number" view if you could have a number but _didn
't_, and the "add number" view would redirect you to the homepage if you
_couldn 't_ have a number, with the message "you can't change your number",
but there was no way for this to happen more than once (the checks are
complementary).

Because this is a privacy-focused service, logs were minimal, basically only
the path of the request, the time and the backend that served it. We managed
to at least see the path the bug took, and, indeed, it was bounced multiple
times between the two pages.

There was no clue as to this anywhere, the pages used complementary checks
from data on the session, the session was stored on a central redis cache, all
workers were running the same code, everything.

The bug remained elusive, until I noticed that the requests would be served
first by worker 1, then by worker 2, then by 1 again, etc, until the cycle
broke when a request for page 1 was served by worker 2. This could only mean
that the workers couldn't agree on the check, and, sure enough, the
configuration on one of the hosts had, mistakenly, pointed the cache to local
memory rather than redis.

That was a pretty interesting bug.

~~~
drvdevd
So basically both hosts (workers) were _supposed_ to be storing session data
in a shared Redis cache, but one host was misconfigured and was storing
(and/or retrieving) session data from local memory instead?

~~~
StavrosK
Yes, exactly.

------
ejholmes
Awesome article and really well written. I am curious why people choose to go
with overlay networking. To me, it seems like an extremely complex solution to
a non-problem (source, I run a 700+ container cluster without any overlay
networking).

~~~
loginoff
Are you exposing the necessary ports from these 700+ containers on the host in
order to facilitate multi-host communication? How do you solve service
discovery?

For us the main reason to go with overlay networking is that it basically
solves service discovery (in our specific case) over the entire cluster out of
the box (by using Docker internal DNS). No need to change anything in the
services themselves, they just need to connect to each other using DNS names.

------
user5994461
> Debugging this issue gave us a chance to understand some of the inner
> workings and really appreciate all the little details that Docker manages
> for us, in order to provide isolated networking between things that are
> essentially glorified Linux processes.

Great report!

When I see debugging like that, I wonder how many people on the planet have
the basic knowledge necessary to understand it (need:
OS/process/kernel/networking/etc...)? And how many of them could have
performed the debugging AND find the bug?

Then I realize that's the level we need to be able to operate the service and
I'm like. Ouchhhhh.

------
FLGMwt
I love seeing detailed rundowns like this. Props to the team to have the
forethought to track all of this information during the investigation or
having the historical maturity to create a retrospective.

More of these, everyone!

------
bpatel
Very informative and well written post, thanks for sharing.

------
cat199
tl;dr super hip docker overlay network cluster devopsifier learns to ARP cache
after several days of intermittent service fail.

excuse my cynicism but this is why keeping up with the latest cool things
doesn't necissarily mean a better sysadmin^W excuse me 'cloud devops engineer'

~~~
feinstruktur
Docker allows a mere mortal like myself with only 20+ years of sw engineering
experience and a physics PhD under my belt to run a setup in something that
resembles production more than a few hosted boxed cobbled together do. I
vaguely know about ARP and I can just about follow along this post. So what
you seem to regard as just a toy is actually what makes my work possible
and/or viable.

I find that pretty super hip.

~~~
StreamBright
Actually you still need to learn Docker and once you put your project into
production you need to be able to debug it as described in the article.

~~~
feinstruktur
Oh, for sure. Docker doesn't replace actually knowing what you're doing once
your setup grows. But it gets you further with less upfront knowledge,
especially as a lot of it is really good dev knowledge for setting up test
infrastructure as well. (That's why I started using Docker.)

