
I Saw An Extremely Subtle Bug Today And I Just Have To Tell Someone - joshuacc
http://www.kalzumeus.com/2011/11/17/i-saw-an-extremely-subtle-bug-today-and-i-just-have-to-tell-someone/
======
laughinghan
TL;DR:

1\. Rails 2.3.11 introduced two subtle changes:

\- CSRF tokens have to be included in XHR POST requests

\- failing the CSRF check silently resets the session instead of throwing an
exception

2\. A/Bingo (his A/B testing library) checks if visitors are human with an XHR
POST request. He did not notice that he needed to patch it to include the
Rails CSRF token.

3\. Race Condition: When the login/signup page is loaded, usually the A/Bingo
human check will fail the CSRF check and reset the session, and A/Bingo will
mark the visitor as human, all before the visitor logs in. The session won't
be reset again, because A/Bingo will remember that the visitor is human.
However, if the visitor is very fast and logs in/signs up before the A/Bingo
human check goes through, it might not be until later in the session that the
human check missing the CSRF token resets the session, prompting the visitor
to log in again. Now that the session has been reset and the visitor marked
human, it won't happen again.

4\. His analytics indicated referral stats were way below normal because the
referrer was usually getting reset with the session at the login/signup page.
The only time his analytics libraries would log the referrer correctly was
when the very fast visitors logged in/signed up before the human check missing
the CSRF token reset their session.

Lessons:

\- Race conditions are hard to track down.

\- When analytics indicates something is way out of the ordinary, don't
procrastinate tracking down the problem.

\- Don't dismiss bugs because they (seem) irreproducible. Figure out how to
reproduce them.

~~~
mofey
This was probably the best tl;dr I've read.

------
chops
While reading this article, it reminded me of an odd bug I encountered working
with the Nitrogen Web Framework. I was finding that, similar to Patrick, it
was losing sessions (though much more consistently), but only in Chrome and
only with Yaws as the webserver. Switching to the Mochiweb server or Firefox
caused the problem to go away.

Firebug and Chrome's dev tools both reliably stated that the cookie header was
indeed being set. I just didn't know why Firefox was accepting the cookies and
holding the sessions, but Chrome was dropping the sessions.

Oddly, it was only happening on my virtualbox dev environment, and not on any
production machines.

After much time, I noticed that the expiration date for the cookie was in the
past. I hadn't noticed it before because it _looked_ right (it correctly
passed my mental regex for "looks like a good date").

It turned out the problem was being caused by my machine going to sleep,
pausing everything (including the clock timer in the VirtualBox instance,
which I leave on for weeks or months), causing the clock on the virtual server
to get behind by several days.

Then, when setting the cookie expiry date in max-age format, rather than
absolute time, Mochiweb would send the Max-age expiration, which then gets
handled by the browser relative to the receiving time. But Yaws would first
take server time, add the seconds, and send _that_ as the absolute expiration,
effectively sending an past date to the browser as the expiration.

Firefox, apparently, saw the cookie expiration date, and just said something
like "Hey, we'll hold this until the user closes the tab or something", while
chrome saw the expired cookie, and immediately expired it, appropriately.

That's one of the weirdest non-bug bugs I've encountered.

Note: when I say "Yaws" and "Mochiweb", I mean "Nitrogen's SimpleBridge
connector for Yaws and Mochiweb".

~~~
chops
Also (unrelated), My favorite quote from the article:

 _"Race conditions are why sane programmers don’t program with threads or, if
they do, they use shared-nothing architecture and pass all communication
between the threads through a message queue written by someone who knows what
they are doing (if you have to ask, it isn’t you — seriously, multithreaded
programming is hard)."_

Sounds like an advertisement for Erlang :)

~~~
runT1ME
You can absolutely have race conditions in Erlang, along with deadlocks too.

~~~
masklinn
Absolutely, but it tends to be significantly rarer, and easier to debug (in
part due to tools, in other part due to OTP patterns) than multithreading race
conditions.

But yeah you can have race conditions in everything, you can also have race
condition in non-threaded evented code (e.g. heavily async javascript client-
side code)

------
butterfi
I envy the rush this developer probably got when he figured this out. Sweet,
sweet, code rush... ahhh. If someone could bottle that feeling, caffeinated
sodas would go out of business.

------
smokinn
And that's why you should always have great analytics. Monitoring is what
found a similarly subtle bug at my past job once.

In the later afternoon a marketing guy on the traffic team came over and told
me there was a problem with our trial signups. They'd dropped ~7% today for no
apparently reason. We looked at the charts and today, for most of the day,
compared to many other days and the past week there was indeed a dip. There
were no changes to their marketing campaign or traffic levels either so they
concluded it was a software bug.

But we hadn't pushed any new code that day. All that got pushed was some css
changes from a designer so I concluded that it couldn't be a dev problem. I
started looking from an ops perspective but couldn't find anything abnormal
there either. There weren't a lot of memcached evictions, the db was doing
fine, server loads were normal, etc.

I spent easily a couple of hours trying to find the source of the problem when
marketing found it for me. Someone else was going through the metrics and
noticed that we had the same trial signup rates for firefox, chrome and mobile
browsers, it was only ie that had dropped. When filtered by ie and broken down
by browser version he noticed that IE9, IE8 and IE7 were all the same as other
days but IE6 had a 0%.

When you signed up to our site you signed up as a free member but were
immediately offered a 5 day trial for a premium account and we had enough
volume that the signup rates were predicatable and didn't vary much. When I
created an account with IE6 it put me directly into the home page without
showing me the trial offer page first.

Turns out the problem was indeed from the stylesheet change. The designer had
not only changed a couple of buttons, he'd also added a font-face declaration.
This font-face was not yet used anywhere and the font itself hadn't been
uploaded to assets in production. What happened was IE6 would try to download
the font-face and, when logged into our site we don't output 404 for pages
that don't exist we redirect to the home page. So IE6 would get the redirect
inside a stylesheet and follow that redirect in the browser. All the later IEs
and other browsers simply ignored the redirect.

It was a very strange bug that would've resolved itself on its own the next
day when the designer actually used and uploaded the font but it sure gave me
a lot of head scratching and I never would've found it without the analytics.

~~~
justincormack
You shouldn't redirect for assets that dont exist, even if you do for pages...

~~~
adavies42
you shouldn't redirect for pages either, 404s exist for a reason....

~~~
smokinn
Sigh, I know, but non-technical product managers can be very adamant about
certain things. =(

------
blahedo
tl;dr: A Rails upgrade last January caused previously valid code to reset the
session (but only once per user per session). The session reset caused a
logged-in user to be suddenly logged out; but most users' use-flow (and
Patrick's test-flow) made the session reset happen _before_ login, so the bug
only manifested _visibly_ about 1% of the time. Moral: race conditions are
hard! (Alternate moral: read the release notes carefully before upgrading.)

~~~
lambda
Another alternate moral: Something that seems like a very minor patch-level
change which just fixes a bug and doesn't affect your API actually can break
running production code.

Many people like to pretend that they can follow a discipline like Semantic
Versioning (<http://semver.org/>). But minor changes like this can cause
subtle bugs when you are upgrading between patch-level versions of a component
that you're working with.

------
bgraves
I really like this quote:

 _Bingo Card Creator is not terribly complicated software when compared to
most applications, but it sits on top of other pieces of code (Rails, the web
server the browser, the TCP/IP stack, the underlying OS, the hardware on both
ends, etc) which _collectively_ are orders of magnitude more complicated than
any physical artifact ever created by the human race._

You can insert any number of one-off business applications, reports, queries,
etc in place of "Bingo Card Creator" but the concept remains true.

~~~
neutronicus
As a physicist, I really want to argue with that statement.

In particular, computer hardware operates (by design) as far as possible from
any regimes where the laws of physics are well and truly complicated. A
tokamak fusion reactor, by contrast, although it breaks down into orders of
magnitude fewer logical parts, contains a big blob of monolithic, continuous
complexity, namely the plasma itself and all its associated electromagnetic
fields, which you CS types sell short. :p

~~~
jbigelow76
Leave it to a physicist to make everyone aware of the gravity of the previous
comment. :)

~~~
js2
<http://xkcd.com/435/>

------
bryanh
"If my users are inconvenienced, it is my fault, always."

Good quote. I had a customer become irate with me before over untracked sales
in my digital delivery product. I knew that the error stemmed from their web
designer using their own button code and not our specially crafted button
code, and I tried to help them fix it but only upset them more.

Its hard to take responsibility for things seemingly outside your control, but
in the end it is your responsibility.

~~~
sliverstorm
Habitually taking responsibility for things outside your control is probably a
bad thing though.

~~~
patio11
This is very much not the cultural norm over in these parts. I'd be curious to
hear your reason as to why you think that. (Genuinely curious, not passive-
aggressive "curious.")

Particularly with specific regards to engineering, even things which are
theoretically not within our control (e.g. the server going down because a
technician tripped over the cable) are often inside our ability to affect (by
picking a better provider, investing in a redundant setup, etc). This is good:
after we know that things are generally within our capabilities to address,
we'll address them, rather than thinking that it is outside the scope of our
responsibility, capability, or authority.

(Semi-related sidenote: For hackers who enjoy theology there is a doctrine in
Catholicism called subsidiarity which, as a quick gloss, states that
responsibility for solving a problem begins immediately local to the problem
and then bubbles up to the lowest level of societal organization which is
capable of resolving that problem. That level is obligated to solve the
problem rather than passing the buck up, where it will not receive sufficient
attention, or down, where it will not receive sufficient resources.

I always thought, aside from being a good idea, this was virtually tailor made
for programmers. It's like there is a papal encyclical on the canonical right
way to do exception handling.)

~~~
gburt
I can respond to his comment from my own perspective. With my own customers, I
have a policy of accepting blame for everything and always working to fix
pretty much anything.

This, semi-regularly, results in customers who notice this policy and take
advantage of it. You end up in a situation where you either have to piss off a
customer or do mountains of work that have nothing to do with the scope of
your product simply because the customer is convinced its your fault (or he's
a dink that knows if he pretends its your fault, he'll get something out of
you).

My biggest falling is that. And unfortunately, I feel that the customers who
are like that, are also the ones who make the biggest noise. I'm in a
relatively sensitive industry where my customers tend to be doing a lot of
research before picking my product, so I can't afford to have any bad noise
floating around.

Just an anecdote/personal vent very related to the discussion.

~~~
powertower
GBScripts?

Now we know where to go to get more then we pay for!

~~~
gburt
No, that isn't the business I was referring to. Those are usually pretty open
and shut projects -- and I don't do many of them anymore. Thanks for the
Internet sleuthing though.

------
tedunangst
Wow. Had a very similar CSRF problem once. Customer complained they were
getting page expired errors constantly. But only one customer and only with
Safari, not Firefox.

Turned out they were using an RSS reader on OS X that shared the system cookie
jar with Safari. Every ten minutes it would log in with a password and get a
new session cookie. Safari would then use the new session cookie which didn't
match the CSRF form value.

~~~
Figs
How did you resolve the problem?

~~~
tedunangst
We made the CSRF depend only on the user ID, which doesn't change, instead of
the session ID.

------
spydum
Debugging this sort of issue, finding the cause, and coming up with a solution
is like brain candy to me. I love to do it, and I love to read it.

~~~
steve-howard
I don't have much interest in Rails, but the sentiment reminds me greatly of
The Old New Thing (hey, look what works most of the time because sometimes
behavior is only what you expect by accident!).

------
rlpb
"Whereupon I learned that Rails 2.3.11 changed the behavior of CSRF
protection: instead of throwing exceptions, it would silently just clear the
session and re-run the request. For most sensitive operations (e.g. those
which require a signed in user), this would force a signout and then any
potentially damaging operation would be averted."

Doesn't this change a CSRF attempt into a DoS? I don't understand the logic
behind this change. Why not return an error response?

~~~
laughinghan
How is this a DoS? It's not significantly more resource-intensive than any
other request. If you could submit enough of these requests to DoS the server,
you could submit enough of any request to DoS the server, so the server is no
more vulnerable this way than otherwise.

~~~
tedunangst
It allows you to grief the users of any rails site. You can log them out at
will (for some value of at will).

In the dawn of time, you could use a CSRF exploit and really grief the users.
Then CSRF checking made that go away. This is now a step backwards, where 3rd
party sites can affect your users.

~~~
subwindow
You can't log any arbitrary out of their sessions. It only resets _your_
session, as in the session cookie is reset.

~~~
quotemstr
No, you, as an evil website operator, can cause anyone who visits your site to
log out of any arbitrary rails site.

------
jwingy
This is why I feel a sense of dread every time I update versions of software I
use on my stack...even if you read all the release notes diligently, and
everything seems like it'll check out ok, sometimes you just never know what
exactly might break until it does because of the complexity of all the
intermingling code. Good thing I'm not a control freak and have a convenient
memory.....:)

------
mikebo
To be honest, the default behavior of silently resetting the session and not
throwing seems wrong to me. It's reasonable to only store user auth
information in a cookie, so resetting the session doesn't have the intended
effect of logging out the user.

------
drm237
While not directly related to this post, I feel like it's a good time to point
out how great Github is. Patrick hosts everything himself (for the SEO benefit
I would assume) which is fine, but it means we can't post issues, watch the
abingo repo for changes like this, submit pull requests, etc. I can't imagine
that many people searching for ABingo turn into BCC customers so is the SEO
benefit at the expense of a great abingo community really worth it?

~~~
patio11
The SEO benefit from A/Bingo is not that I sell $30 of software to Rails
hackers. The benefit is that A/Bingo convinces, without loss of generality,
DHH to link to bingocardcreator.com despite him not having any interest in
BCC. The fraction of DHH's substantial trust which rubs off on my domain as a
result helps me rank for unrelated searches by elementary school English
teachers. They pay my rent.

I don't feel that I owe "the abingo community" more than the millions I
already made their for-profit companies while working for free, but if they
really want to send pull requests, they can take the MIT licensed source code,
put it in their own Github, and write me an email saying "Hey, pull my stuff",
which happens every once in a while. Github did not invent requesting pulls.

~~~
rmc
_they can take the MIT licensed source code, put it in their own Github, and
write me an email saying "Hey, pull my stuff", which happens every once in a
while. Github did not invent requesting pulls._

You don't even need to put it on GitHub to send patch. Git has alright support
for easily turning changes to a local branch into an easy to email format, cf.
<http://book.git-scm.com/5_git_and_email.html>

------
mattmanser
Great post, good catch. These kind of errors are the ones that always keep me
up at night when I launch a new site, you can't fix what you can't see!

The worst one I've done is when a fairly bad bug was being blackholed for a
month as I forgot to delete a line from boilerplate code. Damn MS and their
stupidly written yet useful templates ([HandleError] attribute for those in
the know).

 _shudders_

------
timinman
You did a great job of breaking down the technical complexity, Patrick. You
are a good teacher.

"So why did this never show up in development and why did it show up only
sporadically in production?" \-- I've also run into production bugs that don't
manifest in development; your post inspires me to keep as many details as
possible in common for the two environments.

------
heresy
I love spelunking problems like this, it's quite the rush when you finally
crack it.

~~~
rhizome
The real challenge is in not using curse words in response to cracking the
problem.

------
sofal
I got bitten by this same exact thing using Bucketwise a few months ago, but
without the race condition: <https://github.com/jamis/bucketwise/pull/27>

------
AznHisoka
I'm probably missing something, but missed sessions in Ruby? Doesn't this
affect almost everyone using Ruby? What makes this a minor bug for most
people?

~~~
stouset
This behavior isn't a bug. It's expected and necessary. It only changed if you
are 1) POSTing in Javascript, and 2) not using Rails' built-in AJAX or
unobtrusive Javascript helpers, which handle these details for you.

On top of that, most sites that use AJAX heavily and have this bug have it
happen frequently enough that they investigate and fix it. In this case, the
author's Javascript ran infrequently and unpredictably enough to make it
difficult to trace (interestingly, having dealt with this issue before, I
guessed the cause by the third paragraph).

It's the same principle behind the ThinkGeek Annoy-a-Tron. A loud, constant
noise will drive you to look for it within seconds, and you'll probably be
able to find it. But a brief chirp that happens semirandomly every fifteen
minutes is incredibly hard to track down.

------
rubergly
@patio11, minor editorial note I thought I should point out: the link to the
Rails 2.3.10 source is broken (it links to back to the blog post itself).

------
loopdoend
The gremlins do not exist! The attribution of causality to gremlins is likely
causing problems for many people and should be avoided.

------
eduadecastro
Although I am just a front-dev developer, I found this post extremely
interesting to read. Need to try it sometime on a Ruby backend I built
earlier. Thanks!

------
TwoBit
Can somebody post the tldr?

~~~
mechanical_fish
From the first sentence:

 _This post will not help you sell more software. If you’re not fascinated by
the inner workings of complex systems, go do something more important. If you
are, grab some popcorn, because this is the best bug I’ve seen in years._

There's no real point in summarizing further. It's like asking someone to
summarize _Hamlet_ for you [1] because you'd rather not sit through all that
acting and dialogue. The point of this story is that _it's a big complex
meandering story that ultimately culminates in an anticlimactic technical
issue_. You know, kind of like most of _my_ job.

\---

[1] "Almost all the peers in Denmark die violently one by one."

~~~
scott_s
"Cesarean section results in regicide, 30 years later."

"Rebellious teenagers escape parental control."

~~~
smh
Macduff was born by Caesarean section, not Macbeth.

~~~
scott_s
Correct, and Macbeth is king by the end of the play, and Macduff kills him.

