Reverse Debugging at Scale

Veserv · on May 3, 2021

From what I can tell, they are just using standard instruction trace rather than a full trace, so they can only inspect execution history rather than full data history that most other time travel debugging solutions provide. The advantages of their approach of just using the standard hardware instruction trace functionality is that it functions even on shared-memory multithreaded applications at "full" speed unlike most other time travel debugging solutions. The disadvantages being that it requires higher storage bandwidth, Intel does not seem to support data trace, and even if it did support data trace would require significantly more storage bandwidth (something like 10x-100x).

inglor · on May 3, 2021

Ok, so how is thousands of servers 0.1%? That implies they have millions of servers or one for every 9000 people on earth - are companies this size really that wasteful in terms of servers needed?

packetslave · on May 3, 2021

The official public answer is "millions of servers" (see, for example, https://engineering.fb.com/2018/07/19/data-infrastructure/lo...).

Keep in mind that this includes Instagram and Whatsapp too, as far as I know. As for "wasteful", well..

    1.88 billion DAU (Q1 earning report)
      / 86400 
      = 21759 "users per second" (note: I made this up)

Multiply that by N where N is the number of frontend and backend queries it takes to service one user, and you have a lot of QPS. Now add in headroom, redundancy, edge POPs and CDN to put servers as close to users as possible, etc.

It's hard to fathom just how big traffic to FAANG services can be, until you see what it takes to serve it. Is there some waste? Sure, probably, but not as much as you'd think.

akiselev · on May 3, 2021

If they have two million servers that would mean about 1000 daily active users per server. Assuming the average user makes 2000 requests (API calls, images, videos, etc.) a day mindlessly browsing the infinite feed, that works out to about 1 request per second.

Facebook makes its money from advertisers so that's likely where most of the compute resources are going - users just see the ads at the end of all that computation. Combined with the mandatory over provisioning, the overhead of a massive distributed systems, tracing, etc, I'm not surprised those are the numbers.

Assuming each server cost an average of $20k, that's $40 billion which is two quarters worth of revenue but amortized over 5+ years. It's really not all that much.

bluedino · on May 3, 2021

Sort of surprised to see VScode and LLDB mentioned. So Java or C++? Rust?

Veserv · on May 3, 2021

The technology they are describing is largely language-agnostic as it is just reconstructing the sequence of hardware instructions that executed. So, in principle you can apply the underlying technique to any language as long as you can determine the source line that corresponds to a hardware instruction at a point in time. Which is already done by any standard debugger, at least for AOT compiled languages, as this is how a debugger can use the hardware instruction the processor stopped at to tell you which source code line you are stopped at. For JIT or interpreted languages it is slightly more complex, but still a solved problem.

roca · on May 3, 2021

It won't work for anything with a JIT or interpreter, not without significantly more work.

Veserv · on May 3, 2021

Assuming that a Java debugger can convert a breakpoint to its corresponding source line, it must maintain some sort of source<->assembly mapping that transforms over time to do that lookup. As long as you record those changes, namely the introduction or destruction of any branches that Intel PT would record, the same underlying approach should work. The primary complexities there would be making sure those JIT records are ordered correctly with respect to branches in the actual program, and if the JIT deletes the original program text as that might require actually reversing the execution and JIT history to recover the instructions at the time of recording. This would require adding some instrumentation to the JIT to record branches that were inserted or deleted, but that seems like something that can be implemented as a post-processing step at a relatively minor performance cost, so it seems quite doable. If there are no deletions then you could just use the final JIT state for the source<->assembly mapping. Is there something that I am missing beyond glossing over the potential difficulties of engaging with a giant code base that might not be amenable to changes?

As for an interpreter I have not really thought about it too hard. It might be harder than I was originally considering because I was thinking in the context of a full data trace which would just let you re-run the program + interpreter. With just an instruction trace you might need a lot more support from the interpreter. Alternatively, you might be able to do it if the interpreter internals properly separate out handling for the interpreted instructions and you could use that to reverse engineer what the interpreted program executed. Though that would probably require a fair bit of language/interpreter-specific work. Also, given the expected relative execution speeds of probably ~10x, it would probably not be so great since you get so much less execution per unit of storage.

roca · on May 4, 2021

With just an instruction trace you can't figure out which application code the interpreter executed. AFAIK modern Java VMs all use tiered compilation which means there is likely to be some interpreted code sprinkled around even if the majority is JITted code. This is going to mess you up.

As for the JIT, it's not clear to me that modern Java VMs actually maintain a complete machine-code-to-application-bytecode mapping. It would be good for Pernosco if they did, but I think it's more likely they keep around just enough metadata to generate stack traces, and otherwise rely on tier-down with on-stack replacement to handle debugging with breakpoints, at least for the highest JIT tiers.

jeffbee · on May 3, 2021

It is fairly surprising to me that FB would pay a roughly 5% throughput penalty to get this.

slver · on May 3, 2021

In a car, you get the best speed when you press the pedal to the metal and close your eyes. Yet we pay performance penalty by driving it, instead.

r00fus · on May 3, 2021

Car analogy is not applicable - think more instead of lowering speed of the entire high-speed roadway by 5mph (e.g. by road material/quality) - has a qualitative difference at that scale.

slver · on May 3, 2021

Point is slowing down a bit, allows us to see what's happening, so we can make course corrections, and crash less.

That's an understandable tradeoff for car driving.

It's an understandable tradeoff for Facebook debugging.

Ergo, we do it.

adamfeldman · on May 3, 2021

Where does the 5% number come from? I didn't see it in the article or on the linked page for Intel Processor Trace.

jeffbee · on May 3, 2021

Pulled from my experience. How much stuff needs to be logged depends on the structure of the program, and different programs are more or less sensitive to having some of their memory store bandwidth stolen.

adamfeldman · on May 3, 2021

Thank you for sharing!

sacheendra · on May 3, 2021

It is not that surprising. Facebook is complex ecosystem, and therefore experiences lots of outages/problems.

Looking at their status page (https://developers.facebook.com/status/dashboard/), they seem to have a problems every week. These are only the public ones!

I guess they figured they are losing more money due to these problems than the additional 5% they have to spend on infrastructure.

jeffbee · on May 3, 2021

I can see why an org would choose to do this, but the number is still frightening. At Google, we were held to a limit of at most 0.01% cost for sampling LBR events. 5% for debug-ability just seems really high.

DSingularity · on May 3, 2021

They must have a reason. Probably helps them resolve otherwise costly failures in good time.