Sort of surprised to see VScode and LLDB mentioned. So Java or C++? Rust?

Veserv · on May 3, 2021

The technology they are describing is largely language-agnostic as it is just reconstructing the sequence of hardware instructions that executed. So, in principle you can apply the underlying technique to any language as long as you can determine the source line that corresponds to a hardware instruction at a point in time. Which is already done by any standard debugger, at least for AOT compiled languages, as this is how a debugger can use the hardware instruction the processor stopped at to tell you which source code line you are stopped at. For JIT or interpreted languages it is slightly more complex, but still a solved problem.

roca · on May 3, 2021

It won't work for anything with a JIT or interpreter, not without significantly more work.

Veserv · on May 3, 2021

Assuming that a Java debugger can convert a breakpoint to its corresponding source line, it must maintain some sort of source<->assembly mapping that transforms over time to do that lookup. As long as you record those changes, namely the introduction or destruction of any branches that Intel PT would record, the same underlying approach should work. The primary complexities there would be making sure those JIT records are ordered correctly with respect to branches in the actual program, and if the JIT deletes the original program text as that might require actually reversing the execution and JIT history to recover the instructions at the time of recording. This would require adding some instrumentation to the JIT to record branches that were inserted or deleted, but that seems like something that can be implemented as a post-processing step at a relatively minor performance cost, so it seems quite doable. If there are no deletions then you could just use the final JIT state for the source<->assembly mapping. Is there something that I am missing beyond glossing over the potential difficulties of engaging with a giant code base that might not be amenable to changes?

As for an interpreter I have not really thought about it too hard. It might be harder than I was originally considering because I was thinking in the context of a full data trace which would just let you re-run the program + interpreter. With just an instruction trace you might need a lot more support from the interpreter. Alternatively, you might be able to do it if the interpreter internals properly separate out handling for the interpreted instructions and you could use that to reverse engineer what the interpreted program executed. Though that would probably require a fair bit of language/interpreter-specific work. Also, given the expected relative execution speeds of probably ~10x, it would probably not be so great since you get so much less execution per unit of storage.

roca · on May 4, 2021

With just an instruction trace you can't figure out which application code the interpreter executed. AFAIK modern Java VMs all use tiered compilation which means there is likely to be some interpreted code sprinkled around even if the majority is JITted code. This is going to mess you up.

As for the JIT, it's not clear to me that modern Java VMs actually maintain a complete machine-code-to-application-bytecode mapping. It would be good for Pernosco if they did, but I think it's more likely they keep around just enough metadata to generate stack traces, and otherwise rely on tier-down with on-stack replacement to handle debugging with breakpoints, at least for the highest JIT tiers.