New features in Linux 6.10 contributed by Pernosco

Posted by khuey on 19 September 2024

The Linux 6.10 kernel release contains two new features in the perf event subsystem contributed by Pernosco. These features are intended to benefit rr (and thus Pernosco), but they also have broader applications if adopted by other software. In this blog post I will discuss what we added, why it benefits rr, and the broader applications it could have.

Perf events and FASYNC

rr makes extensive use of perf events in both the recording and replay phases. During recording, perf events are used to establish a timeline of program execution, to generate tracee context switches, and to observe when certain tasks are descheduled. During replay, perf events are used to "fast forward" the tracee to close to a certain point of execution, and then more traditional methods like breakpoints are used to advance to the exact point of interest.

All of these use cases except the first require using the perf event to interrupt the tracee at some point. We do this by combining a number of Linux features. First, we use the PERF_EVENT_IOC_PERIOD ioctl to program the perf event to overflow after a certain number of events. The perf event subsystem marks the fd ready for I/O upon overflow, but rather than polling it, we use the FASYNC mechanism to convert that readiness into a signal. Then, we use F_SETOWN_EX to target that signal at the tracee itself. Finally, rr itself ptraces the tracee. The result is that when the perf event counter overflow it fires a signal at the tracee. That pauses the tracee and traps to the rr supervisor via ptrace. The rr supervisor can use the opportunity to meddle with the tracee as necessary and then absorb the signal without passing it on to the tracee.

Suppressing I/O signals from perf events with BPF

PERF_EVENT_IOC_SET_BPF allows BPF programs to be attached to perf events. Those BPF programs can suppress perf samples by returning 0. Until 6.10, a BPF program could not suppress the I/O signal. We added that capability.

For rr, this allows the acceleration of replay of asynchronous events. To deliver a SIGALRM or a thread context switch, rr has to replay the tracee to a precise point before injecting the event. While we can quickly replay to a specific instruction by simply setting a breakpoint on it, finding the precise execution of that instruction can be cumbersome if it is executed in a tight loop. We could end up checking the state of the tracee millions of times looking for the right iteration of the loop. Each check requires a context switch from the tracee to rr, and multiple ptrace syscalls to retrieve the tracee's register state.

With the new feature we contributed to 6.10 it's instead possible to filter the breakpoint hits in the kernel without ever trapping to rr or using ptrace. We can install a hardware breakpoint via the perf events subsystem and attach a BPF program to it that checks for matching register values and suppresses signals for those iterations that are not of interest. In one pathological trace provided by a customer that involves frequent context switches interrupting a very tight loop this reduces the overhead of rr replay by 94%.

Generating I/O signals with a perf event's wakeup_watermark

perf events also have the ability to trigger wakeups when a "watermark" is passed in the data buffer attached to the event. Until 6.10, those watermarks did not trigger I/O signals. We added that capability too.

For rr, this allows us to observe when a tracee is switched out without requiring elevated privileges. Previously rr used the PERF_COUNT_SW_CONTEXT_SWITCHES perf event, but that requires privileges or for perf_event_paranoid to be set low enough that untrusted software can observe kernel perf events. Starting in 6.10, rr can instead simply record a dummy event and configure perf events to emit PERF_RECORD_SWITCH records into the data buffer. Then wakeup_watermark is set to 1 (byte), triggering a wakeup when any data is written to the buffer. Finally, the wakeup is converted into a signal targeted at the tracee and observed by the rr supervisor via ptrace as above.

This allows rr to work with perf_event_paranoid == 2 starting with 6.10, where rr previously required perf_event_paranoid == 1.

Broader applications for these features

Both of these features have potential applications beyond rr.

Hardware breakpoints with BPF-accelerated conditions could be used by any debugger for faster conditional breakpoints. The idea of evaluating breakpoint conditions being performance critical is not new. gdb has had the capability to convert conditions to agent expressions and ship them to a gdbserver to avoid unnecessary socket or network roundtrips for ages. Now this same concept can be extended to avoiding unnecessary context switches to the debugger.

Use of wakeup_watermark's new ability to generate I/O signals, plus a suitable "overrun" area, along with targeting the wakeup signal at the program being traced, allows for software to be interrupted before it overflows the data buffers of its perf events. This can be used for more accurate data collection (e.g. via Intel Processor Trace) for workloads where interrupting the software is more tolerable than data loss.