

Debugging file corruption on iOS - nspiegelberg
https://code.facebook.com/posts/313033472212144/debugging-file-corruption-on-ios/

======
SoftwareMaven
This is the challenge of modern software development. When I first started my
career, I spent two months looking for a random crashing bug in a video game
due to misplaced free(); this was in code that was 100% ours. Now, there
likely isn't an application in existence that does anything interesting where
100% of the code was written for the app itself. As such, you are no longer
debugging your own code, but other people's as well.

Vernor Vinge coined the idea of a software archeologist, people who could sift
through layers of code, all the way back to the beginning of time (Jan 1, 1970
and Unix, of course), to understand how systems work and to make modifications
to them. We aren't that far from that point; already, there are people who
seem to specializing in digital spelunking to find and fix bugs in these
underlying layers.

While I love building new code, some of the most satisfying moments in my
career have been when I've gone back through somebody else's code, untied it
(oh, you built a polymorphic type system including class initializers and
destructors in a decidedly not object oriented language? Cool.) and fixed it.
There's something interesting about getting into somebody else's head and
seeing how they approached the problem, then finding out where they were
wrong.

~~~
nspiegelberg
I chatted with Wired about just this concept:
[http://www.wired.com/2014/08/facebook_bug/](http://www.wired.com/2014/08/facebook_bug/)

~~~
general_failure
No offense to you but that article is very poorly written. It seems like
someone abducted the author mid-article and pressed the publish button :)

~~~
nspiegelberg
There are some technical inaccuracies that make it hard to look past as a
coder (mixing up POSIX and UNIX, talking about unexpected behavior as a bug).
However, I think the author did an accurate job on the high-level tenor of the
article and touched on a critical meta-point about this bug.

------
pretz
> "The SSL layer instead handled a raw file descriptor and, consequently,
> lifetime handling was not automatically synchronized ... We worked with the
> networking team and fixed this issue within hours."

Why on earth is Facebook writing their own SSL layer for iOS?

~~~
nspiegelberg
SPDY (pre-Apple's version) + Open Source SSL Library + Perf Mods

------
fdawg4l

      > // setup a honeypot file
      > int trap_fd = open(…); 
      > // Create new function to detect writes to the honeypot
      > static WRITE_FUNC_T original_write = dlsym(RTLD_DEFAULT, "write");;
      > ssize_t corruption_write(int fd, const void *buf, size_t size) { 
      >   FBFatal(fd != trap_fd, @"Writing to the honeypot file");
      > }
      > return original_write(fd, buf, size);
      > }
      > // Replace the system write with our “checked version”
      > rebind_symbols((struct rebinding[1]){{(char *)"write", (void *)corruption_write}}, 1);
    

Does this code snippet look fishy to anyone else? First, the mismatch braces
are messing with my head. I'm thinking the brace before the return is a typo.
Also, the call to the macro looks wrong. Shouldn't they be checking for fd ==
trap_fd?

~~~
nspiegelberg
1\. I added an extra brace after FBFatal in a final revision :( I'll ask for a
revision 2\. FBFatal has the same semantics as assert(), so that's correct.
3\. The 'rebind_symbols' line truncates and is missing a horizontal scroll.
You can view the rest of it if you click drag.

~~~
fdawg4l

      > 2. FBFatal has the same semantics as assert(), so that's correct.
    

Ah, this makes sense to me now. Thanks!

------
shurane
I would love to see a standalone testcase demonstrating this bug. Although,
what was Facebook's solution to fix it? Was it to make a fix to the SSL
library to use a CFSocker wrapper?

| It turns out that abandoning manual code analysis was a good strategy.

Wait a minute, wasn't this manual code analysis? They were certainly digging
around the codebase and a particular slice of commits to figure out why the
crash kept occurring.

~~~
nspiegelberg
Although fixing a bug requires analyzing the offending code, we weren't able
to effectively narrow down the area of inspection through manual means (diff
analysis, stack trace, git bisect). We instead narrowed down the area of code
using sandboxing and non-trivial conditional breakpointing.

To fix the SSL library, we first used dup() to properly refcount the FD and
then did more long-term restructuring to properly couple the FD & SSL object
lifetime later.

------
Thaxll
Move fast and break things.

~~~
kevinchen
With these kinds of Heisenbugs, even the most diligent software tester isn't
likely to catch it before release.

~~~
cbhl
How much fire is normal?

------
carbonr
i feel they are just over complicating.

