Hacker News new | past | comments | ask | show | jobs | submit login
I Saw An Extremely Subtle Bug Today And I Just Have To Tell Someone (kalzumeus.com)
413 points by joshuacc on Nov 17, 2011 | hide | past | web | favorite | 78 comments



TL;DR:

1. Rails 2.3.11 introduced two subtle changes:

- CSRF tokens have to be included in XHR POST requests

- failing the CSRF check silently resets the session instead of throwing an exception

2. A/Bingo (his A/B testing library) checks if visitors are human with an XHR POST request. He did not notice that he needed to patch it to include the Rails CSRF token.

3. Race Condition: When the login/signup page is loaded, usually the A/Bingo human check will fail the CSRF check and reset the session, and A/Bingo will mark the visitor as human, all before the visitor logs in. The session won't be reset again, because A/Bingo will remember that the visitor is human. However, if the visitor is very fast and logs in/signs up before the A/Bingo human check goes through, it might not be until later in the session that the human check missing the CSRF token resets the session, prompting the visitor to log in again. Now that the session has been reset and the visitor marked human, it won't happen again.

4. His analytics indicated referral stats were way below normal because the referrer was usually getting reset with the session at the login/signup page. The only time his analytics libraries would log the referrer correctly was when the very fast visitors logged in/signed up before the human check missing the CSRF token reset their session.

Lessons:

- Race conditions are hard to track down.

- When analytics indicates something is way out of the ordinary, don't procrastinate tracking down the problem.

- Don't dismiss bugs because they (seem) irreproducible. Figure out how to reproduce them.


This was probably the best tl;dr I've read.


While reading this article, it reminded me of an odd bug I encountered working with the Nitrogen Web Framework. I was finding that, similar to Patrick, it was losing sessions (though much more consistently), but only in Chrome and only with Yaws as the webserver. Switching to the Mochiweb server or Firefox caused the problem to go away.

Firebug and Chrome's dev tools both reliably stated that the cookie header was indeed being set. I just didn't know why Firefox was accepting the cookies and holding the sessions, but Chrome was dropping the sessions.

Oddly, it was only happening on my virtualbox dev environment, and not on any production machines.

After much time, I noticed that the expiration date for the cookie was in the past. I hadn't noticed it before because it looked right (it correctly passed my mental regex for "looks like a good date").

It turned out the problem was being caused by my machine going to sleep, pausing everything (including the clock timer in the VirtualBox instance, which I leave on for weeks or months), causing the clock on the virtual server to get behind by several days.

Then, when setting the cookie expiry date in max-age format, rather than absolute time, Mochiweb would send the Max-age expiration, which then gets handled by the browser relative to the receiving time. But Yaws would first take server time, add the seconds, and send that as the absolute expiration, effectively sending an past date to the browser as the expiration.

Firefox, apparently, saw the cookie expiration date, and just said something like "Hey, we'll hold this until the user closes the tab or something", while chrome saw the expired cookie, and immediately expired it, appropriately.

That's one of the weirdest non-bug bugs I've encountered.

Note: when I say "Yaws" and "Mochiweb", I mean "Nitrogen's SimpleBridge connector for Yaws and Mochiweb".


Also (unrelated), My favorite quote from the article:

"Race conditions are why sane programmers don’t program with threads or, if they do, they use shared-nothing architecture and pass all communication between the threads through a message queue written by someone who knows what they are doing (if you have to ask, it isn’t you — seriously, multithreaded programming is hard)."

Sounds like an advertisement for Erlang :)


You can absolutely have race conditions in Erlang, along with deadlocks too.


Absolutely, but it tends to be significantly rarer, and easier to debug (in part due to tools, in other part due to OTP patterns) than multithreading race conditions.

But yeah you can have race conditions in everything, you can also have race condition in non-threaded evented code (e.g. heavily async javascript client-side code)


Or Go. Though unlike Erlang processes, Go doesn't make mutating shared memory in multiple threads impossible. Go just makes it really easy to never do it.


Or haskell with STM.


Or ØMQ.


We're just starting a new project this week. It's not that complicated in the big scheme of things, but there are several bits of hardware that need to be coordinated which maintain their own state, plus a GUI, logging, and database updates.

We'd initially talked about what modules and what main threads would be needed, but I started getting paranoid.

So we're going to be using the Python multiprocessing library, and ZeroMQ plus JSON for the communication between the parts.


Similarly, one of the most banging-my-head-on-the-table bugs I've ever encountered caused all SSL connections from my iOS app to fail with an unhelpful 'something is wrong' message.

Turned out my development iPhone's clock had just set itself back a few years, but I was so focused on the code that it took me a long time to figure that out.


You can't be faulted for that; aside from a few date-aware apps like calendar, where is the year displayed?


I envy the rush this developer probably got when he figured this out. Sweet, sweet, code rush... ahhh. If someone could bottle that feeling, caffeinated sodas would go out of business.


And that's why you should always have great analytics. Monitoring is what found a similarly subtle bug at my past job once.

In the later afternoon a marketing guy on the traffic team came over and told me there was a problem with our trial signups. They'd dropped ~7% today for no apparently reason. We looked at the charts and today, for most of the day, compared to many other days and the past week there was indeed a dip. There were no changes to their marketing campaign or traffic levels either so they concluded it was a software bug.

But we hadn't pushed any new code that day. All that got pushed was some css changes from a designer so I concluded that it couldn't be a dev problem. I started looking from an ops perspective but couldn't find anything abnormal there either. There weren't a lot of memcached evictions, the db was doing fine, server loads were normal, etc.

I spent easily a couple of hours trying to find the source of the problem when marketing found it for me. Someone else was going through the metrics and noticed that we had the same trial signup rates for firefox, chrome and mobile browsers, it was only ie that had dropped. When filtered by ie and broken down by browser version he noticed that IE9, IE8 and IE7 were all the same as other days but IE6 had a 0%.

When you signed up to our site you signed up as a free member but were immediately offered a 5 day trial for a premium account and we had enough volume that the signup rates were predicatable and didn't vary much. When I created an account with IE6 it put me directly into the home page without showing me the trial offer page first.

Turns out the problem was indeed from the stylesheet change. The designer had not only changed a couple of buttons, he'd also added a font-face declaration. This font-face was not yet used anywhere and the font itself hadn't been uploaded to assets in production. What happened was IE6 would try to download the font-face and, when logged into our site we don't output 404 for pages that don't exist we redirect to the home page. So IE6 would get the redirect inside a stylesheet and follow that redirect in the browser. All the later IEs and other browsers simply ignored the redirect.

It was a very strange bug that would've resolved itself on its own the next day when the designer actually used and uploaded the font but it sure gave me a lot of head scratching and I never would've found it without the analytics.


You shouldn't redirect for assets that dont exist, even if you do for pages...


You're correct. The immediate fix was to just upload the font but the long-term fix was as you pointed out.

At first we would only treat urls with certain file types (jpg, png, mp4, etc) as files when found on the server, the rest as pages and we switched from a whitelist of file types to anything that had an asset directory path in the url. Our original htaccess rules were definitely deficient. The idea was that we wanted tracking pixels and the like to be able to run code.


you shouldn't redirect for pages either, 404s exist for a reason....


Sigh, I know, but non-technical product managers can be very adamant about certain things. =(


tl;dr: A Rails upgrade last January caused previously valid code to reset the session (but only once per user per session). The session reset caused a logged-in user to be suddenly logged out; but most users' use-flow (and Patrick's test-flow) made the session reset happen before login, so the bug only manifested visibly about 1% of the time. Moral: race conditions are hard! (Alternate moral: read the release notes carefully before upgrading.)


Another alternate moral: Something that seems like a very minor patch-level change which just fixes a bug and doesn't affect your API actually can break running production code.

Many people like to pretend that they can follow a discipline like Semantic Versioning (http://semver.org/). But minor changes like this can cause subtle bugs when you are upgrading between patch-level versions of a component that you're working with.


Technically, the reason it was once per use per session was not because of the Rails upgrade, but because once the A/Bingo human check missing the CSRF token resets the session, the visitor is marked human and not checked again.


I really like this quote:

Bingo Card Creator is not terribly complicated software when compared to most applications, but it sits on top of other pieces of code (Rails, the web server the browser, the TCP/IP stack, the underlying OS, the hardware on both ends, etc) which _collectively_ are orders of magnitude more complicated than any physical artifact ever created by the human race.

You can insert any number of one-off business applications, reports, queries, etc in place of "Bingo Card Creator" but the concept remains true.


As a physicist, I really want to argue with that statement.

In particular, computer hardware operates (by design) as far as possible from any regimes where the laws of physics are well and truly complicated. A tokamak fusion reactor, by contrast, although it breaks down into orders of magnitude fewer logical parts, contains a big blob of monolithic, continuous complexity, namely the plasma itself and all its associated electromagnetic fields, which you CS types sell short. :p


Leave it to a physicist to make everyone aware of the gravity of the previous comment. :)



Is this actually the case though?

Isn't the physics of CPU design (quantum tunneling etc) just as complicated as that of a fusion reactor (or arguably more so?)

If you meant that the behavior of the plasma itself is complex from the mathematical point of view.. yes, that is true. But it _is_ modeled (along with nuclear explosions) on one of those computer thing you physics type sell short. :P


Oh, solid state physics is plenty complicated. But in the regime where computers operate, you can successfully model the chip as a bunch of discrete signals and NAND gates and shit (ignorance showing through here).

And, yes, I swept computational physics under the rug. Ironic, really, because that's my niche.


"If my users are inconvenienced, it is my fault, always."

Good quote. I had a customer become irate with me before over untracked sales in my digital delivery product. I knew that the error stemmed from their web designer using their own button code and not our specially crafted button code, and I tried to help them fix it but only upset them more.

Its hard to take responsibility for things seemingly outside your control, but in the end it is your responsibility.


Habitually taking responsibility for things outside your control is probably a bad thing though.


This is very much not the cultural norm over in these parts. I'd be curious to hear your reason as to why you think that. (Genuinely curious, not passive-aggressive "curious.")

Particularly with specific regards to engineering, even things which are theoretically not within our control (e.g. the server going down because a technician tripped over the cable) are often inside our ability to affect (by picking a better provider, investing in a redundant setup, etc). This is good: after we know that things are generally within our capabilities to address, we'll address them, rather than thinking that it is outside the scope of our responsibility, capability, or authority.

(Semi-related sidenote: For hackers who enjoy theology there is a doctrine in Catholicism called subsidiarity which, as a quick gloss, states that responsibility for solving a problem begins immediately local to the problem and then bubbles up to the lowest level of societal organization which is capable of resolving that problem. That level is obligated to solve the problem rather than passing the buck up, where it will not receive sufficient attention, or down, where it will not receive sufficient resources.

I always thought, aside from being a good idea, this was virtually tailor made for programmers. It's like there is a papal encyclical on the canonical right way to do exception handling.)


I can respond to his comment from my own perspective. With my own customers, I have a policy of accepting blame for everything and always working to fix pretty much anything.

This, semi-regularly, results in customers who notice this policy and take advantage of it. You end up in a situation where you either have to piss off a customer or do mountains of work that have nothing to do with the scope of your product simply because the customer is convinced its your fault (or he's a dink that knows if he pretends its your fault, he'll get something out of you).

My biggest falling is that. And unfortunately, I feel that the customers who are like that, are also the ones who make the biggest noise. I'm in a relatively sensitive industry where my customers tend to be doing a lot of research before picking my product, so I can't afford to have any bad noise floating around.

Just an anecdote/personal vent very related to the discussion.


GBScripts?

Now we know where to go to get more then we pay for!


No, that isn't the business I was referring to. Those are usually pretty open and shut projects -- and I don't do many of them anymore. Thanks for the Internet sleuthing though.


I wasn't thinking of things you have the power to address. I was thinking more of making yourself the fall guy for things you didn't cause and cannot fix. I didn't know how to express exactly why, but scott_s summarizes part of my feelings on the matter quite well.

The rest of my thoughts on the matter center around the sense I get that, in continually taking responsibility (blame) for things outside your ability to influence, you will basically form a habit of it, which will lead you down a dark road of learned helplessness, omega behavior, and never being able to please your clients because everything bad that happens is "your" fault. If you think about it, if you form a habit of it and it begins to internalize, you are basically developing something similar to Imposter Syndrome.

Personally, when something is my fault I try to own it. When something is not, and is outside my sphere of influence, I try to skip past the issue of "fault" by briefly explaining the cause (in an attempt to show it was not incompetence on my part, while doing my best to avoid "pointing fingers") and focus on resolution.

Simple example: Coworker erases your work from centralized revision control (history and all), causing you to miss a deadline. Do you declare, "There was a glitch in revision control, it is my fault for trusting it" or do you state the facts and try to move on? The former certainly appears to be noble, but to me it appears to be the tragic hero sort of noble.

(I will add that I am very conflicted in writing this. I am saying to myself, "You cold heartless bastard!". I would, and do, cover for other people sometimes, but fundamentally I do believe in the untenable nature of regularly making yourself the scapegoat.)


To go just a little bit further, I would like to draw attention to some of the people who, at some point or another, were indirectly involved in some way with someone's death. I think everyone is familiar with the opine of 'If only I had told him to finish his dinner, that car might not have hit him!', or 'If only I had not stopped for a bagel this morning, the poor cashier would not have been shot!'.

Taking responsibility for things that you did not cause and have utterly no power to fix has a long, long and documented history of (at least temporarily) destroying people. Obviously this is an extreme example, but I feel it illustrates rather well.


The bad kinds of stress - the kinds that wear on a person and cause burnout - are often caused by people worrying about things they can't control.


"It's all personal, every bit of business. . . . If a bolt of lightning his a friend of his the old man would take it personal. . . . That's what makes him great. The Great Don. He takes everything personal. Like God. He know every feather that falls from the tail of a sparrow or however the hell it goes. Right? And you know something? Accidents don't happen to people who take accidents as a personal insult." -- Michael Corleone


I think there are two factors. Clients want solutions, not excuses and they can often do without explanations. So, taking responsibility to solve a problem will (generally speaking, on average, etc) work in your favour if you have remotely appreciative clients.

However, beating yourself up over it is best avoided.


Perhaps you should have constructed an API, and tell them "This is what your code should abide to", instead of just giving them code for the button. That is: abstract the spec away from the actual code.

If closer inspection shows that their code isn't following the spec, you've earned the right for some serious fingerpointing.


This is a good point. In my case, it was as simple as a PayPal button with an extra variable to track a product ID... pretty mundane. However, a feature to "check" (scrape) their button and ensure it is proper is probably in order.


Wow. Had a very similar CSRF problem once. Customer complained they were getting page expired errors constantly. But only one customer and only with Safari, not Firefox.

Turned out they were using an RSS reader on OS X that shared the system cookie jar with Safari. Every ten minutes it would log in with a password and get a new session cookie. Safari would then use the new session cookie which didn't match the CSRF form value.


How did you resolve the problem?


We made the CSRF depend only on the user ID, which doesn't change, instead of the session ID.


I'd probably check to see if they had a valid login already and not update the cookie. Not sure if that opens up other holes though.


Debugging this sort of issue, finding the cause, and coming up with a solution is like brain candy to me. I love to do it, and I love to read it.


I don't have much interest in Rails, but the sentiment reminds me greatly of The Old New Thing (hey, look what works most of the time because sometimes behavior is only what you expect by accident!).


"Whereupon I learned that Rails 2.3.11 changed the behavior of CSRF protection: instead of throwing exceptions, it would silently just clear the session and re-run the request. For most sensitive operations (e.g. those which require a signed in user), this would force a signout and then any potentially damaging operation would be averted."

Doesn't this change a CSRF attempt into a DoS? I don't understand the logic behind this change. Why not return an error response?


How is this a DoS? It's not significantly more resource-intensive than any other request. If you could submit enough of these requests to DoS the server, you could submit enough of any request to DoS the server, so the server is no more vulnerable this way than otherwise.


It allows you to grief the users of any rails site. You can log them out at will (for some value of at will).

In the dawn of time, you could use a CSRF exploit and really grief the users. Then CSRF checking made that go away. This is now a step backwards, where 3rd party sites can affect your users.


You can't log any arbitrary out of their sessions. It only resets your session, as in the session cookie is reset.


No, you, as an evil website operator, can cause anyone who visits your site to log out of any arbitrary rails site.


It seems like it would.

The security advisory made it seem like the way they were returning an error was inadequate - perhaps there was an insecurity with cookie stuff being sent back?

But yeah, this seems like too big a hammer to fix that problem...


This is why I feel a sense of dread every time I update versions of software I use on my stack...even if you read all the release notes diligently, and everything seems like it'll check out ok, sometimes you just never know what exactly might break until it does because of the complexity of all the intermingling code. Good thing I'm not a control freak and have a convenient memory.....:)


To be honest, the default behavior of silently resetting the session and not throwing seems wrong to me. It's reasonable to only store user auth information in a cookie, so resetting the session doesn't have the intended effect of logging out the user.


While not directly related to this post, I feel like it's a good time to point out how great Github is. Patrick hosts everything himself (for the SEO benefit I would assume) which is fine, but it means we can't post issues, watch the abingo repo for changes like this, submit pull requests, etc. I can't imagine that many people searching for ABingo turn into BCC customers so is the SEO benefit at the expense of a great abingo community really worth it?


The SEO benefit from A/Bingo is not that I sell $30 of software to Rails hackers. The benefit is that A/Bingo convinces, without loss of generality, DHH to link to bingocardcreator.com despite him not having any interest in BCC. The fraction of DHH's substantial trust which rubs off on my domain as a result helps me rank for unrelated searches by elementary school English teachers. They pay my rent.

I don't feel that I owe "the abingo community" more than the millions I already made their for-profit companies while working for free, but if they really want to send pull requests, they can take the MIT licensed source code, put it in their own Github, and write me an email saying "Hey, pull my stuff", which happens every once in a while. Github did not invent requesting pulls.


they can take the MIT licensed source code, put it in their own Github, and write me an email saying "Hey, pull my stuff", which happens every once in a while. Github did not invent requesting pulls.

You don't even need to put it on GitHub to send patch. Git has alright support for easily turning changes to a local branch into an easy to email format, cf. http://book.git-scm.com/5_git_and_email.html


Great post, good catch. These kind of errors are the ones that always keep me up at night when I launch a new site, you can't fix what you can't see!

The worst one I've done is when a fairly bad bug was being blackholed for a month as I forgot to delete a line from boilerplate code. Damn MS and their stupidly written yet useful templates ([HandleError] attribute for those in the know).

shudders


You did a great job of breaking down the technical complexity, Patrick. You are a good teacher.

"So why did this never show up in development and why did it show up only sporadically in production?" -- I've also run into production bugs that don't manifest in development; your post inspires me to keep as many details as possible in common for the two environments.


I love spelunking problems like this, it's quite the rush when you finally crack it.


The real challenge is in not using curse words in response to cracking the problem.


I got bitten by this same exact thing using Bucketwise a few months ago, but without the race condition: https://github.com/jamis/bucketwise/pull/27


I'm probably missing something, but missed sessions in Ruby? Doesn't this affect almost everyone using Ruby? What makes this a minor bug for most people?


This behavior isn't a bug. It's expected and necessary. It only changed if you are 1) POSTing in Javascript, and 2) not using Rails' built-in AJAX or unobtrusive Javascript helpers, which handle these details for you.

On top of that, most sites that use AJAX heavily and have this bug have it happen frequently enough that they investigate and fix it. In this case, the author's Javascript ran infrequently and unpredictably enough to make it difficult to trace (interestingly, having dealt with this issue before, I guessed the cause by the third paragraph).

It's the same principle behind the ThinkGeek Annoy-a-Tron. A loud, constant noise will drive you to look for it within seconds, and you'll probably be able to find it. But a brief chirp that happens semirandomly every fifteen minutes is incredibly hard to track down.


@patio11, minor editorial note I thought I should point out: the link to the Rails 2.3.10 source is broken (it links to back to the blog post itself).


The gremlins do not exist! The attribution of causality to gremlins is likely causing problems for many people and should be avoided.


Although I am just a front-dev developer, I found this post extremely interesting to read. Need to try it sometime on a Ruby backend I built earlier. Thanks!


Can somebody post the tldr?


From the first sentence:

This post will not help you sell more software. If you’re not fascinated by the inner workings of complex systems, go do something more important. If you are, grab some popcorn, because this is the best bug I’ve seen in years.

There's no real point in summarizing further. It's like asking someone to summarize Hamlet for you [1] because you'd rather not sit through all that acting and dialogue. The point of this story is that it's a big complex meandering story that ultimately culminates in an anticlimactic technical issue. You know, kind of like most of my job.

---

[1] "Almost all the peers in Denmark die violently one by one."


"Cesarean section results in regicide, 30 years later."

"Rebellious teenagers escape parental control."


Macduff was born by Caesarean section, not Macbeth.


Correct, and Macbeth is king by the end of the play, and Macduff kills him.


Just because it's interesting to you or the author doesn't mean it's interesting to everybody else. I've seen many posts by programmers who think they discovered something great and it turns out to be something others have seen 100 times.


It's just a fun read, the info about CSRF in Rails will, for most people, be inconsequential. Thus, the tldr is: bookmark this for when you have ten minutes of spare time. :)


Rails 2.3.11 turned CSRF violations into silent retries, which can be a problem if you track session IDs closely.


tl;dr is a blight upon society


Not sure why you're getting downvoted. People should be able to read well-written interesting articles.


tldr: There were some recent changes in Rails in the portion of the code that handles CSRF protection, specifically around handling AJAX. Users' sessions were getting dropped and it was costing him money.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: