Software developers spend a lot of time debugging. We believe debugging technology is in its infancy and improved debugging tools can significantly increase the productivity of even the best developers. Even state-of-the-art debugging tools like rr only scratch the surface of what's possible. We have demonstrated this by building Pernosco, a debugger which dramatically improves the state of the art: powerful new features that leverage omniscient debugging to make debugging faster and more fun; novel workflow integrations delivering "debugging as a service"; and new implementation techniques to make omniscient debugging practical and cost-effective. Yet Pernosco is not just a research project; it is being used by real developers and improved in response to their feedback.
This series of articles will show what the Pernosco debugger can do. Our main goal for this series is to find developers (and their employers) who will pay to use Pernosco. If that's you, please get in touch. Our secondary goal is to persuade people that even if Pernosco is not (yet) appealing to them, debugging tools matter. The limitations of existing debuggers have convinced many in the software industry that specialized debugging tools are not worth using, or even investing in. This has created a vicious cycle of underinvestment and inadequate tools. We hope to change minds on this issue, inspire developers, and break that cycle. To follow our progress, please stay in touch.
Pernosco uses rr as a component and is currently subject to rr's limitations: debugging of x86-64 Linux applications, with a focus on debugging statically-compiled languages with DWARF debuginfo such as C, C++, and Rust. However, Pernosco could be extended to support debugging in other languages, architectures and operating systems.
One of the greatest challenges facing debuggers is deploying them in a way that's compatible with modern workflows. A lot of debugging happens when CI reports test failures, but it's inconvenient, and often impossible, to deploy a debugger in that context. Likewise it is inconvenient or impossible to apply a debugger when you observe a bug in a mobile application or a microservice running in the cloud. Attaching a debugger to a process that must not stop is usually fatal.
Low-overhead record-and-replay systems such as rr can solve some of these problems. For example, we can record test execution in CI and make recordings of failed tests available for debugging. Debugger integration must be as frictionless as possible to overcome developer inertia. Pernosco can watch a Github project for CI test failures, automatically reproduce and record the failures, and annotate each failure report with a hyperlink to a Pernosco session to debug that failure. Details and demos are in a later article.
Existing debugger interfaces typically show the program state at a particular point in time, with some ability to shift that point forwards in time (or backwards, for debuggers with reverse execution). They're designed that way not because it's the ideal way for developers to understand bugs, but because that's what can be easily implemented. However, many debugging tasks benefit from integrating information across multiple moments in time (e.g., visualizing control flow). Furthermore, forward or reverse execution typically suffer from noticeable delays while application code actually runs.
An alternative approach is "omniscient debugging": collect all program states into a database indexed for efficient queries (e.g. containing every memory and register write), and implement a debugging interface using those queries. This eliminates delays during debugging (thereby eliminating productivity-destroying context switches). It also enables debugger visualizations that seamlessly integrate information across time.
The obvious barrier to omniscient debugging is scalability: building and storing that database is very expensive. We have made tremendous technical improvements over previous implementations of omniscient debugging, and can demonstrate cost-effective debugging of complex applications with recorded execution times of many minutes (though not yet hours).
We record application execution with rr and then build an omniscient database of CPU-level state by replaying execution with binary instrumentation. Deferring database construction to the replay phase keeps the initial overhead low while the application is interacting with its environment (e.g., avoiding spurious timeouts). We don't waste much effort if tests don't fail. Even more importantly, it lets us speed up database building by processing different sections of a single execution in parallel.
We provide our system as a Web service, and run database builds in the cloud. This allows for much more efficient hardware utilization than doing the work on local developer machines. In many deployment scenarios we can build databases using cloud "spot instances" (i.e., deeply discounted excess capacity).
A less obvious issue with omniscient debugging is the challenge of designing a debugger interface once freed from most of the implementation constraints that "traditional" debuggers are subject to: given we can provide almost any desired query efficiently over all program states, what is the best way to convey that data to developers so they can fix bugs in the shortest amount of time? This is a challenging intellectual problem, because existing implementation-constrained interfaces are unreliable guides and the space of possible new interfaces is very large. For the same reasons it is also very exciting!
This series of articles will describe how we have tackled this problem. We recognize that there are many interesting alternative approaches, many of which will occur to readers of these articles. We expect that more usage data, plus time and money, will let us improve our interface significantly. However, we have guiding principles we're confident in.
We believe that many features in "traditional" debuggers are hacks to get around the limitations of being confined to a single moment in time. Single-stepping, for example, is used when developers are afraid of going too far forward in program execution, or when they want to see how control flow or data values evolve over time. Omniscient debugging enables better solutions to these problems.
We also believe that most "cool features" we can imagine are probably not in the ideal set of features needed to understand most bugs in nearly-minimal time, especially when you consider the costs of a large complex interface with many hard-to-discover features. We have tried to avoid the temptation of "wouldn't it be cool if ...?" Instead, we have started with a minimal set of features that seem obviously essential, and incrementally added features to address problems encountered by real users, trying to choose the simplest and most general solution to each problem. To ease Pernosco adoption, we have implemented some not-ultimately-optimal features to make Pernosco more familiar to users, e.g. gdb integration.