
Time Traveling Linux Bug Reporting: Coming in Julia 1.5 - KenoFischer
https://julialang.org/blog/2020/05/rr/
======
roca
This is very cool stuff. Keno is a great rr contributor, and I'm grateful for
his Pernosco shout-out :-).

This sort of capability has always been part of my vision for rr/Pernosco. I
want to see more projects like this doing record-and-replay capture of bugs in
the field. With more projects depending on this sort of infrastructure, we can
get more attention from OS/hardware people to make the record-and-replay
systems better, and then I think we can reach a tipping point where it becomes
best practice. Eventually we should be looking back and wondering why, as an
industry, we didn't go in this direction earlier.

~~~
Veserv
If you don't mind me asking, you mention that you want more attention from
OS/hardware people. What sort of things are you looking for from OS/hardware
people that would make replay systems better?

As for a more technical question on implementation. Assuming the post
accurately represents the performance characteristics of rr, what do you think
is keeping the overhead so high in the average case described in the post
(50%)?

Disclaimer: I work for a company that sells similar technology.

~~~
roca
Some small-ish items:

\-- Virtualization of rr's required PMU features on Azure and GCP

\-- Virtualization of Intel's CPUID faulting on AWS and other cloud providers
(Linux KVM has it, but it's not enabled in AWS)

\-- Fix the reliability bug(s) in the retired-conditional-branches counter on
AMD Ryzen so rr works there

\-- Implement CPUID faulting in AMD Ryzen

\-- Reliable instructions-retired counter on AMD/Intel (for more
reliable/simpler operation)

\-- Trap on LL/SC failures on ARM (AArch64 I guess) and other archs (so we can
port rr there), with Linux kernel APIs to expose that to userspace

\-- Similar traps for Intel HTM (XBEGIN/XEND) (less important now that's
mostly been disabled for security reasons)

Bigger items:

\-- Invest in kernel-integrated record-and-replay for Linux and other OSes

\-- Implement QuickRec or some other hardware support for multicore record-
replay on at least some CPU SKUs

(If anyone thinks they can help with any of that, talk to me!)

> what do you think is keeping the overhead so high in the average case
> described in the post (50%)?

I assume you've read
[https://arxiv.org/abs/1705.05937](https://arxiv.org/abs/1705.05937) ? For
many workloads the major unavoidable cost is context switching. With rr's
approach every regular tracee context switch (intra or inter process) turns
into at least 2 inter-process context switches (tracee to rr, rr to tracee).
This is bad for thread ping/pong-like behaviour. For many other workloads the
cost is simply the loss of parallelism as we force everything onto a single
core. For other workloads there is a high cost due to system calls that
require context switches to rr because we don't currently accelerate them with
"syscall buffering". The last one can be mostly engineered around, the former
two are inherent to rr's approach.

Who do you work for? :-)

~~~
alexeldeib
This is my first time hearing about rr, and I'm very curious. Your comment
below about increasing generality/performance with bpf is very interesting.

Can you share a little bit about how some of these features (looking at the
first one for example) impact rr's ability to run in cloud environments? Does
it work but with performance degradation? Without performance degradation but
missing some features? I don't work at quite a low enough level to look at
this list and quickly grasp the high level takeaway.

~~~
roca
rr works today on cloud instances with hardware performance counters
available, on Intel CPUs. For example it works on a variety of AWS instances
that are large enough that your VM occupies a whole CPU socket. c5(d).9xlarge
or bigger works well.

The performance impact of virtualization on rr seems small on AWS, small
enough that we haven't tried to measure it.

On AWS rr is mostly feature complete. The only missing feature is CPUID
faulting, i.e. the ability to intercept CPUID instructions and fake their
results. This means that taking rr recordings from one machine and replaying
them on an AWS instance with a different CPU does not work. (The other
direction does work.)

(Pernosco uses AWS, but we have a closed-source binary instrumentation
extension to rr replay that lifts this restriction.)

As I mentioned above, there's no technical reason AFAIK why AWS could not
virtualize CPUID faulting; regular Linux KVM supports this.

------
KenoFischer
Hi all, I'm pretty excited to get this out there for you. I had published this
this morning, so I could be around through the day to answer any questions,
but things were a bit delayed by the HN algorithm gods (- thanks dang for
rescuing it ;) ). That said, I'll check in periodically for the next hour or
two if there's any questions I can answer.

------
wallnuss
I recently spend two weeks on and off hunting down a bug on a platform that
didn't support `rr`. I am fairly confident to say that if I had `rr` available
it would have taken me a couple of hours at most.

Being able to run backwards from the point of failure and understand where a
value is coming from is very powerful.

Having this available in Julia directly is great, and will make it much easier
to get bug-reports from users.

------
tgflynn
I'm surprised I haven't heard of rr before, it sounds like it could be a game
changer for debugging many types of problems. How long has this project
existed/been usable ?

Am I correct in understanding that rr can be used with any application (ie.
the application doesn't have to be built specifically to support it) ? That's
the impression the usage introduction on the website gives: [https://rr-
project.org/](https://rr-project.org/).

~~~
KenoFischer
> How long has this project existed/been usable ?

It's been our (people who work on Julia) daily debugger since 2015 or so

> Am I correct in understanding that rr can be used with any application

Yes

------
boromi
is this coming to Windows or Mac?

