Parallel Search Speeds Up Time Travel Debugging by 4x

roca · on June 18, 2022

https://pernos.co does something similar, except that instead of searching for watchpoints/breakpoints we're building a database of everything that happened. That involves replaying different segments of the execution in parallel and also applying different kinds of instrumentation to the same segments, also run in parallel. It's great, we can saturate the biggest boxes AWS has :-).

acemarke · on June 18, 2022

Reading the article, this sounds _very_ similar to what we do at Replay ( https://replay.io ), which is time-travel debugging for web apps.

I'm a front-end dev, so I'm not nearly as familiar with the magic happening on the backend, but this looks like it matches what I know about how we manage jumping to specific points in time: many forks of the browser process, run from the beginning up to various pause points.

mark_undoio · on June 18, 2022

That makes sense - the principles seem to be quite general. But, as I understand it, replay.io records languages that use a VM by modifying the VM to help capture the non-determinism.

We started by recording languages without a VM - C/C++ - so we've implemented things at a lower level (which has advantages and disadvantages).

Coincidentally that means our tech could probably be used to record yours - and now we're doing more web stuff we ought to be using yours :-D

jasonlaster11 · on June 19, 2022

Suggested this to Greg a couple of months ago! We sometimes use Pernosco to record the Replay Chromium and Firefox. Would love to see how Undo does in comparison.

Likewise, if you're building more web apps, we'd love to hear how Replay.io works for you. We're on a shared journey to bring TTD to the world.

mark_undoio · on June 19, 2022

Hi there - I actually suggested it to Greg recently too! Would definitely be interesting to do this as we've got the same goals in mind.

I also like the technical blog posts, etc you guys produce.

ajb · on June 17, 2022

Reversible debugging is great. Have spent too much time in my life doing "Finnegan" search ("poor old Finnegan had to begin again...")

w0mbat · on June 18, 2022

I'm glad someone else thought of that and finally implemented it. I interviewed with an embedded tools company in the 90s and independently came up with the idea for this during the interview to try to impress them. It didn't work. I immediately forgot about the idea until now.

My thinking was that if you were simulating an embedded device you could record all state changes, and then be able to step forwards AND backwards (or jump to any earlier point).

mark_undoio · on June 18, 2022

Glad you get to see your idea existing - quite a few engineers of a certain low-level inclination tell us they had imagined (or tried) doing this.

There were academic implementations of this sort of thing around quite a while ago - and VMware's record/replay tech - but these days we have Undo LiveRecorder, rr / Pernosco, Microsoft's WinDbg / TTD, replay.io

The idea's time seems to have come...

dataflow · on June 17, 2022

I don't really understand the premise. So presumably you can't "reverse" A = B because you lost the old value of A, right? OK, but why did you lose that? If you took the time to pause before that instruction and write down B, why didn't you just save the value of A as part of that step too? Is it a speed or space issue?

mark_undoio · on June 17, 2022

As seoaeu says, we only have the values at the snapshot times and need to recompute in between.

We're not actually logging changes to variables, we're only logging outside information that changes the course of the computation. The variables are a result of that computation and not something we store directly.

So there's no explicit saving of A when we set it to B - and hence no opportunity to save the previous value of A either.

The reasons are both speed and space, as you suggest, but also fidelity of recording.

* If we recorded all variable values explicitly it could become very slow to run real world programs due to the extra work being done. * Recording all changes could also use a lot of storage for even trivial behaviours - e.g. for (i = 0; i < 10000000; i++); * Even if you log normal variable values you still have to worry about uninitialised memory, stray pointers, use after free - i.e. sources or destinations of assignments that aren't captured clearly in C language semantics. If we want to catch arbitrary bugs we do need to act at a level below normal-case language behaviours.

A side effect of this is that the underlying low level engine can record other languages with a different layer on top to handle language-specific semantics - that's what we do for Java.

seoaeu · on June 17, 2022

The problem is that A might be overwritten many times between snapshots. So you may know one value for A, but you have to simulate execution forward to find out the specific value it had when execution reached that line.

dataflow · on June 17, 2022

Ohh, I was thinking of single-stepping. I didn't realize snapshots are so coarse. Makes sense for memory usage, I see. Thanks!

wiradikusuma · on June 18, 2022

Help me understand time travel debugging. If the code has side effect and/or depend on external dependencies, how does it work? E.g. A cron that parses CSV, move the file to folder Done, and generate PDF and email it.

mark_undoio · on June 18, 2022

For our approach, it's effectively a sort of virtualization.

At recording time we're collecting info about all the program's interactions with the outside world. In replay we prevent it from actually running any operations and instead just reinject what happened last time.

So if you had a program that did some IO and compute then we've have recorded all the system calls it did. When it's reading CSV data in replay we're feeding in the came data it read before. When it's doing things without outside effects we just say "sure, you've done that" and give it the original return code back.

wiradikusuma · on June 19, 2022

That makes sense. Thank you!

WhitneyLand · on June 17, 2022

Wish this were available in XCode.

So many opportunities for improved debugging on iOS.

mark_undoio · on June 17, 2022

We wish it were too, even if we don't have the resources to do it right now.

More time travel debuggers is good for everyone - it spreads awareness and the techniques are pretty applicable across languages / platforms.

sudo_chmod777 · on June 18, 2022

Is it even possible on macOS? I don't know about Undo but iirc rr needs access to perf counters to achieve deterministic interrupt delivery or something, which macOS doesn't allow. Now I basically have a sidekick Linux machine at work cos my company only offers Macs.

Does Undo use different techniques?

mark_undoio · on June 18, 2022

Yes, though the concepts are similar we use JIT instrumentation of the code to count time, instead of hardware performance counters.

It has the advantage that we work fine in virtual machines, even where they don't allow performance counters. For what it's worth, we work fine on Mac Docker Linux containers for x86.

For native MacOS support it'd be more of a proper porting effort - we'd need to implement a different set of system calls.

jfk13 · on June 17, 2022

I'm not familiar with undo.io's products; would be curious to hear people's thoughts on how they compare to rr-project.org.

mark_undoio · on June 18, 2022

See roca's post earlier for some information from the original rr dev regarding their pernos.co service.

The principles are very similar - each has some advantages over the other but as an Undoer I have an obvious bias as to which I prefer ;-)

rr uses snapshots and deterministic repay in the same way, though they ensure determinism differently. I don't know if rr can do parallel reverse ops but evidently Pernosco does parallel pre-processing to build its database (which is magic as it then allows very fast queries about program state).

khuey · on June 19, 2022

rr doesn't have an equivalent of the feature you're announcing here. reverse-execution only consumes a single core.