This sort of capability has always been part of my vision for rr/Pernosco. I want to see more projects like this doing record-and-replay capture of bugs in the field. With more projects depending on this sort of infrastructure, we can get more attention from OS/hardware people to make the record-and-replay systems better, and then I think we can reach a tipping point where it becomes best practice. Eventually we should be looking back and wondering why, as an industry, we didn't go in this direction earlier.
As for a more technical question on implementation. Assuming the post accurately represents the performance characteristics of rr, what do you think is keeping the overhead so high in the average case described in the post (50%)?
Disclaimer: I work for a company that sells similar technology.
-- Virtualization of rr's required PMU features on Azure and GCP
-- Virtualization of Intel's CPUID faulting on AWS and other cloud providers (Linux KVM has it, but it's not enabled in AWS)
-- Fix the reliability bug(s) in the retired-conditional-branches counter on AMD Ryzen so rr works there
-- Implement CPUID faulting in AMD Ryzen
-- Reliable instructions-retired counter on AMD/Intel (for more reliable/simpler operation)
-- Trap on LL/SC failures on ARM (AArch64 I guess) and other archs (so we can port rr there), with Linux kernel APIs to expose that to userspace
-- Similar traps for Intel HTM (XBEGIN/XEND) (less important now that's mostly been disabled for security reasons)
-- Invest in kernel-integrated record-and-replay for Linux and other OSes
-- Implement QuickRec or some other hardware support for multicore record-replay on at least some CPU SKUs
(If anyone thinks they can help with any of that, talk to me!)
> what do you think is keeping the overhead so high in the average case described in the post (50%)?
I assume you've read https://arxiv.org/abs/1705.05937 ? For many workloads the major unavoidable cost is context switching. With rr's approach every regular tracee context switch (intra or inter process) turns into at least 2 inter-process context switches (tracee to rr, rr to tracee). This is bad for thread ping/pong-like behaviour. For many other workloads the cost is simply the loss of parallelism as we force everything onto a single core. For other workloads there is a high cost due to system calls that require context switches to rr because we don't currently accelerate them with "syscall buffering". The last one can be mostly engineered around, the former two are inherent to rr's approach.
Who do you work for? :-)
Can you share a little bit about how some of these features (looking at the first one for example) impact rr's ability to run in cloud environments? Does it work but with performance degradation? Without performance degradation but missing some features? I don't work at quite a low enough level to look at this list and quickly grasp the high level takeaway.
The performance impact of virtualization on rr seems small on AWS, small enough that we haven't tried to measure it.
On AWS rr is mostly feature complete. The only missing feature is CPUID faulting, i.e. the ability to intercept CPUID instructions and fake their results. This means that taking rr recordings from one machine and replaying them on an AWS instance with a different CPU does not work. (The other direction does work.)
(Pernosco uses AWS, but we have a closed-source binary instrumentation extension to rr replay that lifts this restriction.)
As I mentioned above, there's no technical reason AFAIK why AWS could not virtualize CPUID faulting; regular Linux KVM supports this.
In terms of other publicly available systems, UndoDB claims properties similar to rr and appears to use the same general approach based on talks I have seen from the creator, but I have no first hand information on its actual properties.
If only advertising would help more. Unfortunately for all companies in this business, few companies are willing to invest even a tiny amount in developer productivity let alone properly value it. For instance, do you know of any company that spends more than $20k/year per developer on tools to enhance software developer productivity? To go even further, how many companies would even consider spending the astronomical sum of $20k/year per developer on tools? Yet, a fully burdened developer is ~$200k/year, so, using the most basic business analysis of just cost savings (which undervalues to a mind-boggling degree), $20k/year is a mere 10% cost increase, but for some reason it is viewed as an astronomical and ridiculous price. In other sectors, businesses easily spend significant percentages of salary cost in tooling. EDA tools for EEs can run $50k/year, trucks for truck drivers amortize out to $30-50k/year, etc. Obviously, you need to justify a 10% cost increase with an appropriate return, but the fact that few companies even bother to do so is in my opinion the primary limiting factor on increased adoption of the techniques and technologies we are discussing since nobody values them.
UndoDB's design is similar to rr but they use binary instrumentation at record time, so it's higher overhead than rr and a more complex implementation in some ways, but they don't depend on performance counters so they work on more architectures (AMD, ARM) and in any virtual guest.
It's certainly true that it's hard to get people to pay for debugging tools. On the other hand, some big-name companies spend tons of money on internal tooling, so at least some companies are willing to spend on developer productivity in general. I think certain classes of tools have "traditionally" been free and that mindset is hard to change, and the majority of companies are reluctant to spend money on developer productivity in general. I don't have great ideas to tackle this other than "keep making tools better and better until the wall cracks".
If you want to talk more off the record, feel free to contact me.
As Keno says in his blog post, the promise of rr is that if you have an rr recording, you are almost completely assured of having enough information to figure out the bug. That is what we see in practice, and it has big implications for developer workflow.
Being able to run backwards from the point of failure and understand where a value is coming from is very powerful.
Having this available in Julia directly is great, and will make it much easier to get bug-reports from users.
Am I correct in understanding that rr can be used with any application (ie. the application doesn't have to be built specifically to support it) ? That's the impression the usage introduction on the website gives: https://rr-project.org/.
It's been our (people who work on Julia) daily debugger since 2015 or so
> Am I correct in understanding that rr can be used with any application