Hacker News new | past | comments | ask | show | jobs | submit login
Debugging file corruption on iOS (facebook.com)
90 points by nspiegelberg on Aug 12, 2014 | hide | past | favorite | 21 comments

This is the challenge of modern software development. When I first started my career, I spent two months looking for a random crashing bug in a video game due to misplaced free(); this was in code that was 100% ours. Now, there likely isn't an application in existence that does anything interesting where 100% of the code was written for the app itself. As such, you are no longer debugging your own code, but other people's as well.

Vernor Vinge coined the idea of a software archeologist, people who could sift through layers of code, all the way back to the beginning of time (Jan 1, 1970 and Unix, of course), to understand how systems work and to make modifications to them. We aren't that far from that point; already, there are people who seem to specializing in digital spelunking to find and fix bugs in these underlying layers.

While I love building new code, some of the most satisfying moments in my career have been when I've gone back through somebody else's code, untied it (oh, you built a polymorphic type system including class initializers and destructors in a decidedly not object oriented language? Cool.) and fixed it. There's something interesting about getting into somebody else's head and seeing how they approached the problem, then finding out where they were wrong.

Interestingly, in my first job all of the code I worked on my first 3 years was all "in-house" code (because from the bootloader to my code was all proprietary). I think places like Microsoft also have groups that have the same phenomenon. However - the people who initially wrote or worked on that code were long gone.

I've gone on to work other places where I had to do more of this archeology.. and I have to say it actually felt similar.

In summary - I think there is actually a new, more combinatorially complex amount of archeology occurring now. Where the microsoft, apple, netapp, linux, emc, vxworks, etc OS people have been dealing with some of this for a while with one OS... people who rely on services, on many processes, on an internet of things or whatever..

It feels like we'll never have a POSIX of the internet. HTTP is as close as we've gotten, and it's too small to be read/write/exec. We'll never have anything you can "trust" and more and more developers need the patience to wade through everyone else's code as well.

I love the idea of software archaeology in the context of video game software. Usually video game platforms (the earlier the better) cause developers do resort to wacky/unique tricks to make the most of their limited resources. And I feel like that ends up giving video games most of their unique flavor.

I chatted with Wired about just this concept: http://www.wired.com/2014/08/facebook_bug/

No offense to you but that article is very poorly written. It seems like someone abducted the author mid-article and pressed the publish button :)

There are some technical inaccuracies that make it hard to look past as a coder (mixing up POSIX and UNIX, talking about unexpected behavior as a bug). However, I think the author did an accurate job on the high-level tenor of the article and touched on a critical meta-point about this bug.

> "The SSL layer instead handled a raw file descriptor and, consequently, lifetime handling was not automatically synchronized ... We worked with the networking team and fixed this issue within hours."

Why on earth is Facebook writing their own SSL layer for iOS?

SPDY (pre-Apple's version) + Open Source SSL Library + Perf Mods

I would guess one of NIH or SPDY. And they may not be (probably aren't is my guess) writing their own, just packaging one up that they know works with whatever depends on it.

Stock SSL on iOS doesn't have NPN, which is needed to negotiate SPDY on the internet.

  > // setup a honeypot file
  > int trap_fd = open(…); 
  > // Create new function to detect writes to the honeypot
  > static WRITE_FUNC_T original_write = dlsym(RTLD_DEFAULT, "write");;
  > ssize_t corruption_write(int fd, const void *buf, size_t size) { 
  >   FBFatal(fd != trap_fd, @"Writing to the honeypot file");
  > }
  > return original_write(fd, buf, size);
  > }
  > // Replace the system write with our “checked version”
  > rebind_symbols((struct rebinding[1]){{(char *)"write", (void *)corruption_write}}, 1);
Does this code snippet look fishy to anyone else? First, the mismatch braces are messing with my head. I'm thinking the brace before the return is a typo. Also, the call to the macro looks wrong. Shouldn't they be checking for fd == trap_fd?

1. I added an extra brace after FBFatal in a final revision :( I'll ask for a revision 2. FBFatal has the same semantics as assert(), so that's correct. 3. The 'rebind_symbols' line truncates and is missing a horizontal scroll. You can view the rest of it if you click drag.

  > 2. FBFatal has the same semantics as assert(), so that's correct.
Ah, this makes sense to me now. Thanks!

> Does this code snippet look fishy to anyone else?

If you can still edit your post, try block-indenting the entire code list with four spaces along the left margin -- this allows the code to appear on separate lines as intended and preserves the original indentation.

Thanks for the tip! Done.

I would love to see a standalone testcase demonstrating this bug. Although, what was Facebook's solution to fix it? Was it to make a fix to the SSL library to use a CFSocker wrapper?

| It turns out that abandoning manual code analysis was a good strategy.

Wait a minute, wasn't this manual code analysis? They were certainly digging around the codebase and a particular slice of commits to figure out why the crash kept occurring.

Although fixing a bug requires analyzing the offending code, we weren't able to effectively narrow down the area of inspection through manual means (diff analysis, stack trace, git bisect). We instead narrowed down the area of code using sandboxing and non-trivial conditional breakpointing.

To fix the SSL library, we first used dup() to properly refcount the FD and then did more long-term restructuring to properly couple the FD & SSL object lifetime later.

Move fast and break things.

With these kinds of Heisenbugs, even the most diligent software tester isn't likely to catch it before release.

How much fire is normal?

i feel they are just over complicating.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact