1. Rails 2.3.11 introduced two subtle changes:
- CSRF tokens have to be included in XHR POST requests
- failing the CSRF check silently resets the session instead of throwing an exception
2. A/Bingo (his A/B testing library) checks if visitors are human with an XHR POST request. He did not notice that he needed to patch it to include the Rails CSRF token.
3. Race Condition: When the login/signup page is loaded, usually the A/Bingo human check will fail the CSRF check and reset the session, and A/Bingo will mark the visitor as human, all before the visitor logs in. The session won't be reset again, because A/Bingo will remember that the visitor is human. However, if the visitor is very fast and logs in/signs up before the A/Bingo human check goes through, it might not be until later in the session that the human check missing the CSRF token resets the session, prompting the visitor to log in again. Now that the session has been reset and the visitor marked human, it won't happen again.
4. His analytics indicated referral stats were way below normal because the referrer was usually getting reset with the session at the login/signup page. The only time his analytics libraries would log the referrer correctly was when the very fast visitors logged in/signed up before the human check missing the CSRF token reset their session.
- Race conditions are hard to track down.
- When analytics indicates something is way out of the ordinary, don't procrastinate tracking down the problem.
- Don't dismiss bugs because they (seem) irreproducible. Figure out how to reproduce them.
Firebug and Chrome's dev tools both reliably stated that the cookie header was indeed being set. I just didn't know why Firefox was accepting the cookies and holding the sessions, but Chrome was dropping the sessions.
Oddly, it was only happening on my virtualbox dev environment, and not on any production machines.
After much time, I noticed that the expiration date for the cookie was in the past. I hadn't noticed it before because it looked right (it correctly passed my mental regex for "looks like a good date").
It turned out the problem was being caused by my machine going to sleep, pausing everything (including the clock timer in the VirtualBox instance, which I leave on for weeks or months), causing the clock on the virtual server to get behind by several days.
Then, when setting the cookie expiry date in max-age format, rather than absolute time, Mochiweb would send the Max-age expiration, which then gets handled by the browser relative to the receiving time. But Yaws would first take server time, add the seconds, and send that as the absolute expiration, effectively sending an past date to the browser as the expiration.
Firefox, apparently, saw the cookie expiration date, and just said something like "Hey, we'll hold this until the user closes the tab or something", while chrome saw the expired cookie, and immediately expired it, appropriately.
That's one of the weirdest non-bug bugs I've encountered.
Note: when I say "Yaws" and "Mochiweb", I mean "Nitrogen's SimpleBridge connector for Yaws and Mochiweb".
"Race conditions are why sane programmers don’t program with threads or, if they do, they use shared-nothing architecture and pass all communication between the threads through a message queue written by someone who knows what they are doing (if you have to ask, it isn’t you — seriously, multithreaded programming is hard)."
Sounds like an advertisement for Erlang :)
We'd initially talked about what modules and what main threads would be needed, but I started getting paranoid.
So we're going to be using the Python multiprocessing library, and ZeroMQ plus JSON for the communication between the parts.
Turned out my development iPhone's clock had just set itself back a few years, but I was so focused on the code that it took me a long time to figure that out.
In the later afternoon a marketing guy on the traffic team came over and told me there was a problem with our trial signups. They'd dropped ~7% today for no apparently reason. We looked at the charts and today, for most of the day, compared to many other days and the past week there was indeed a dip. There were no changes to their marketing campaign or traffic levels either so they concluded it was a software bug.
But we hadn't pushed any new code that day. All that got pushed was some css changes from a designer so I concluded that it couldn't be a dev problem. I started looking from an ops perspective but couldn't find anything abnormal there either. There weren't a lot of memcached evictions, the db was doing fine, server loads were normal, etc.
I spent easily a couple of hours trying to find the source of the problem when marketing found it for me. Someone else was going through the metrics and noticed that we had the same trial signup rates for firefox, chrome and mobile browsers, it was only ie that had dropped. When filtered by ie and broken down by browser version he noticed that IE9, IE8 and IE7 were all the same as other days but IE6 had a 0%.
When you signed up to our site you signed up as a free member but were immediately offered a 5 day trial for a premium account and we had enough volume that the signup rates were predicatable and didn't vary much. When I created an account with IE6 it put me directly into the home page without showing me the trial offer page first.
Turns out the problem was indeed from the stylesheet change. The designer had not only changed a couple of buttons, he'd also added a font-face declaration. This font-face was not yet used anywhere and the font itself hadn't been uploaded to assets in production. What happened was IE6 would try to download the font-face and, when logged into our site we don't output 404 for pages that don't exist we redirect to the home page. So IE6 would get the redirect inside a stylesheet and follow that redirect in the browser. All the later IEs and other browsers simply ignored the redirect.
It was a very strange bug that would've resolved itself on its own the next day when the designer actually used and uploaded the font but it sure gave me a lot of head scratching and I never would've found it without the analytics.
At first we would only treat urls with certain file types (jpg, png, mp4, etc) as files when found on the server, the rest as pages and we switched from a whitelist of file types to anything that had an asset directory path in the url. Our original htaccess rules were definitely deficient. The idea was that we wanted tracking pixels and the like to be able to run code.
Many people like to pretend that they can follow a discipline like Semantic Versioning (http://semver.org/). But minor changes like this can cause subtle bugs when you are upgrading between patch-level versions of a component that you're working with.
Bingo Card Creator is not terribly complicated software when compared to most applications, but it sits on top of other pieces of code (Rails, the web server the browser, the TCP/IP stack, the underlying OS, the hardware on both ends, etc) which _collectively_ are orders of magnitude more complicated than any physical artifact ever created by the human race.
You can insert any number of one-off business applications, reports, queries, etc in place of "Bingo Card Creator" but the concept remains true.
In particular, computer hardware operates (by design) as far as possible from any regimes where the laws of physics are well and truly complicated. A tokamak fusion reactor, by contrast, although it breaks down into orders of magnitude fewer logical parts, contains a big blob of monolithic, continuous complexity, namely the plasma itself and all its associated electromagnetic fields, which you CS types sell short. :p
Isn't the physics of CPU design (quantum tunneling etc) just as complicated as that of a fusion reactor (or arguably more so?)
If you meant that the behavior of the plasma itself is complex from the mathematical point of view.. yes, that is true. But it _is_ modeled (along with nuclear explosions) on one of those computer thing you physics type sell short. :P
And, yes, I swept computational physics under the rug. Ironic, really, because that's my niche.
Good quote. I had a customer become irate with me before over untracked sales in my digital delivery product. I knew that the error stemmed from their web designer using their own button code and not our specially crafted button code, and I tried to help them fix it but only upset them more.
Its hard to take responsibility for things seemingly outside your control, but in the end it is your responsibility.
Particularly with specific regards to engineering, even things which are theoretically not within our control (e.g. the server going down because a technician tripped over the cable) are often inside our ability to affect (by picking a better provider, investing in a redundant setup, etc). This is good: after we know that things are generally within our capabilities to address, we'll address them, rather than thinking that it is outside the scope of our responsibility, capability, or authority.
(Semi-related sidenote: For hackers who enjoy theology there is a doctrine in Catholicism called subsidiarity which, as a quick gloss, states that responsibility for solving a problem begins immediately local to the problem and then bubbles up to the lowest level of societal organization which is capable of resolving that problem. That level is obligated to solve the problem rather than passing the buck up, where it will not receive sufficient attention, or down, where it will not receive sufficient resources.
I always thought, aside from being a good idea, this was virtually tailor made for programmers. It's like there is a papal encyclical on the canonical right way to do exception handling.)
This, semi-regularly, results in customers who notice this policy and take advantage of it. You end up in a situation where you either have to piss off a customer or do mountains of work that have nothing to do with the scope of your product simply because the customer is convinced its your fault (or he's a dink that knows if he pretends its your fault, he'll get something out of you).
My biggest falling is that. And unfortunately, I feel that the customers who are like that, are also the ones who make the biggest noise. I'm in a relatively sensitive industry where my customers tend to be doing a lot of research before picking my product, so I can't afford to have any bad noise floating around.
Just an anecdote/personal vent very related to the discussion.
Now we know where to go to get more then we pay for!
The rest of my thoughts on the matter center around the sense I get that, in continually taking responsibility (blame) for things outside your ability to influence, you will basically form a habit of it, which will lead you down a dark road of learned helplessness, omega behavior, and never being able to please your clients because everything bad that happens is "your" fault. If you think about it, if you form a habit of it and it begins to internalize, you are basically developing something similar to Imposter Syndrome.
Personally, when something is my fault I try to own it. When something is not, and is outside my sphere of influence, I try to skip past the issue of "fault" by briefly explaining the cause (in an attempt to show it was not incompetence on my part, while doing my best to avoid "pointing fingers") and focus on resolution.
Simple example: Coworker erases your work from centralized revision control (history and all), causing you to miss a deadline. Do you declare, "There was a glitch in revision control, it is my fault for trusting it" or do you state the facts and try to move on? The former certainly appears to be noble, but to me it appears to be the tragic hero sort of noble.
(I will add that I am very conflicted in writing this. I am saying to myself, "You cold heartless bastard!". I would, and do, cover for other people sometimes, but fundamentally I do believe in the untenable nature of regularly making yourself the scapegoat.)
Taking responsibility for things that you did not cause and have utterly no power to fix has a long, long and documented history of (at least temporarily) destroying people. Obviously this is an extreme example, but I feel it illustrates rather well.
However, beating yourself up over it is best avoided.
If closer inspection shows that their code isn't following the spec, you've earned the right for some serious fingerpointing.
Turned out they were using an RSS reader on OS X that shared the system cookie jar with Safari. Every ten minutes it would log in with a password and get a new session cookie. Safari would then use the new session cookie which didn't match the CSRF form value.
Doesn't this change a CSRF attempt into a DoS? I don't understand the logic behind this change. Why not return an error response?
In the dawn of time, you could use a CSRF exploit and really grief the users. Then CSRF checking made that go away. This is now a step backwards, where 3rd party sites can affect your users.
The security advisory made it seem like the way they were returning an error was inadequate - perhaps there was an insecurity with cookie stuff being sent back?
But yeah, this seems like too big a hammer to fix that problem...
I don't feel that I owe "the abingo community" more than the millions I already made their for-profit companies while working for free, but if they really want to send pull requests, they can take the MIT licensed source code, put it in their own Github, and write me an email saying "Hey, pull my stuff", which happens every once in a while. Github did not invent requesting pulls.
You don't even need to put it on GitHub to send patch. Git has alright support for easily turning changes to a local branch into an easy to email format, cf. http://book.git-scm.com/5_git_and_email.html
The worst one I've done is when a fairly bad bug was being blackholed for a month as I forgot to delete a line from boilerplate code. Damn MS and their stupidly written yet useful templates ([HandleError] attribute for those in the know).
"So why did this never show up in development and why did it show up only sporadically in production?"
-- I've also run into production bugs that don't manifest in development; your post inspires me to keep as many details as possible in common for the two environments.
It's the same principle behind the ThinkGeek Annoy-a-Tron. A loud, constant noise will drive you to look for it within seconds, and you'll probably be able to find it. But a brief chirp that happens semirandomly every fifteen minutes is incredibly hard to track down.
This post will not help you sell more software. If you’re not fascinated by the inner workings of complex systems, go do something more important. If you are, grab some popcorn, because this is the best bug I’ve seen in years.
There's no real point in summarizing further. It's like asking someone to summarize Hamlet for you  because you'd rather not sit through all that acting and dialogue. The point of this story is that it's a big complex meandering story that ultimately culminates in an anticlimactic technical issue. You know, kind of like most of my job.
 "Almost all the peers in Denmark die violently one by one."
"Rebellious teenagers escape parental control."