Making Debuggers Sad: C++ Identifier Canonicalization

Posted by roc on 24 November 2021

Why do debuggers like gdb take so long to start up on large programs? There are many reasons, but one surprising reason is that gdb spends significant amounts of time parsing C++ identifiers and re-emitting them into a canonical form. This is due to deficiencies in clang++ and g++ (and, arguably, DWARF) — but not everyone agrees. The underlying reasons also apply to Pernosco so we have implemented something similar, although we're able to hide the startup impact more effectively by folding it into our "build the big database of everything" step.

Suppose a debugger user wants to evaluate the value of the variable Foo<short>::FOO, where Foo is declared with

template <typename T> struct Foo {
    enum Enum { FOO = 1 };
};

The DWARF debuginfo for this program contains a DW_TAG_structure_type for Foo<short>, which contains debuginfo for the Enum enum and its values. We'll have to search for this DW_TAG_structure_type by name. Unfortunately, the DW_AT_name in the debuginfo produced by gcc 9 is not Foo<short> — it's Foo<short int> — so we may not find the type with a naive search 😞.

The basic problem here is that there are many valid ways of writing the same template parameters, so the user might pick a different way than the compiler emitted. This applies not just to template parameters that are types, but also values, e.g. given template <unsigned long V> struct Bar { ... } the compiler might emit a type with name Bar<1UL> (as clang++ 12 does), while the user enters Bar<1>.

The only general solution here is for the debugger to parse all the C++ type names that contain template parameters and store them in a canonical form. When a user enters a type name, it is canonicalized using the same algorithm, so a match will be found if one exists. E.g. in the above examples the debugger could canonicalize the DWARF names Foo<short int> and Bar<1UL> to Foo<short> and Bar<1> respectively and use the latter for lookup. This requires parsing C++ type syntax, which is nasty, but the debugger already needs to do this to handle various forms of user input, so it's not a new problem. Potentially parsing many gigabytes of C++ symbols does subject the parser to increased performance stress, however.

There are situations where it gets very difficult or impossible to correctly parse C++ type syntax outside the context of a compilation unit, but let's studiously ignore that.

Interaction with demangling

C++ entities with "linkage", i.e. functions and variables, are assigned mangled names in their binaries. Debuggers demangle these into fully-qualified human-readable names. We take advantage of this in Pernosco by ensuring that the demangler always produces names in our canonical form (via options passed to cpp_demangle). This greatly reduces the number of C++ identifers we would otherwise have to parse and canonicalize.

BTW you would hope that GNU's c++filt demangler at least produces names that are consistent with the names gcc emits into debuginfo, but it does not. Likewise llvm-cxxfilt produces names inconsistent with clang++.

Ideal solution

Ideally the text serialization of C++ names would be standardized, gcc and clang++ would produce standard names in their debuginfo, their demanglers would also emit standard names (at least when the right options are set), and debuggers could detect that this has been done and avoid a lot of work. That isn't likely to ever happen; as far as I can tell, compiler maintainers don't think that the current situation is a problem.

A slightly less ideal approach that would still be a big improvement is the same thing I suggested for structured identifiers: making debuginfo include mangled names for all C++ types. This would let us rely on demangling instead of having to parse C++ type syntax in debuginfo names. But my guess is this won't happen either.

So it looks like we'll just have to get good at parsing gigabytes of C++ type syntax really fast.