ELF symbol interposition and RTLD_LOCAL

Posted by khuey on 19 July 2022

You may be familiar with "the LD_PRELOAD trick". This "trick" is used to implement things like heaptrack. By interposing a third library between an application and libc's malloc/free you can track the state of the heap and recognize errors like double frees and memory leaks. But this doesn't work for libraries loaded with RTLD_LOCAL, which is the default behavior of dlopen. Why not? Let's look at how this sort of linking works normally first, and then we can figure out why it goes wrong with RTLD_LOCAL.

Dynamic linking on Linux

Unless a program is completely statically linked, it will contain undefined symbols in its symbol table. You can see these by running readelf -s on the program. Even the most trivial programs, such as /bin/true, will have some.

~/dev/scratch$ readelf -s /bin/true

Symbol table '.dynsym' contains 59 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND free@GLIBC_2.2.5 (2)
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND abort@GLIBC_2.2.5 (2)
     3: ....

Note the UND under Ndx for the undefined symbols. This output says that /bin/true expects to find the free@GLIBC_2.2.5 symbol in another library at runtime.

These other libraries come from DT_NEEDED entries in the ELF object's .dynamic section. readelf -d will show you these.

~/dev/scratch$ readelf -d /bin/true

Dynamic section at offset 0x8c98 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 ....

While /bin/true only lists a single library here, more complicated binaries can list several. Those libraries can have their own DT_NEEDED entries that point to other libraries. libc.so.6 for instance will itself require the dynamic linker ld-linux-x86-64.so.2. Running ldd on a binary will show you the full list.

~/dev/scratch$ ldd /bin/true
        linux-vdso.so.1 (0x00007fffe6099000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f65a0af3000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f65a0d1f000)

(The linux-vdso.so.1 library is provided by the kernel and is not important here).

ELF objects specify the symbols they need and they specify any additional shared libraries they need, but they don't actually specify which symbols come from which objects. Instead, the dynamic linker searches all the objects starting with the original executable and then proceeding through the DT_NEEDED entries in turn. The LD_DEBUG environment variable can be used to have the dynamic linker explain what it is doing.

For instance, for that abort@GLIBC_2.2.5 symbol at the top of this post, LD_DEBUG=all outputs the following:

     52530:     symbol=abort;  lookup in file=/bin/true [0]
     52530:     symbol=abort;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
     52530:     binding file /bin/true [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `abort' [GLIBC_2.2.5]

We see that the dynamic linker first looked for it in /bin/true, and then when it was not found there, it looked for it in libc.so.6. The symbol is defined there, so that one was used. Here's another symbol:

     52530:     symbol=_dl_argv;  lookup in file=/bin/true [0]
     52530:     symbol=_dl_argv;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
     52530:     symbol=_dl_argv;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     52530:     binding file /lib/x86_64-linux-gnu/libc.so.6 [0] to /lib64/ld-linux-x86-64.so.2 [0]: normal symbol `_dl_argv' [GLIBC_PRIVATE]

This one was not found in /bin/true, nor was it found in libc.so.6, so the search continued to the third library where it was finally found in ld-linux-x86-64.so.2.

LD_PRELOAD

The magic of LD_PRELOAD is that it lets you insert additional libraries near the beginning of this search list. Symbols in the LD_PRELOADed libraries will then be preferred to symbols in the normally loaded libraries. For example:

~/dev/scratch$ LD_PRELOAD=~/dev/obj-rr/lib/rr/librrpreload.so ldd /bin/true
        linux-vdso.so.1 (0x00007ffeab1e9000)
        /home/khuey/dev/obj-rr/lib/rr/librrpreload.so (0x00007f2deb0fc000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2deaedd000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2deaed7000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2deaeb4000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2deb11c000)

I've LD_PRELOADed the librrpreload.so library and you can see it appears in the search list ahead of libc.so.6 and ld-linux-x86-64.so.2 (it also brings along its own new dependencies on libdl.so.2 and libpthread.so.0). If librrpreload.so contained a definition for the abort@GLIBC_2.2.5 symbol, then the dynamic linker would use the version in librrpreload.so and not the version in libc.so.6.

Symbol wrapping

While sometimes replacing a symbol is sufficient, often LD_PRELOAD is used to wrap a symbol (as in heaptrack) with some additional logic. This requires a way for the wrapper function in the preload library to find the "original" symbol that would have been used if the preload library had not been present. The dynamic linker exposes a function dlsym that can do exactly that: it takes a handle argument that can have the special value RTLD_NEXT, which roughly means "find the next location of this symbol after me". So, a malloc-wrapping library can call dlsym(RTLD_NEXT, "malloc") to get the "normal" malloc, do any custom processing it wants, and then forward the call to the "normal" symbol.

You might wonder how dlsym implements the "after me" part of that description. After all, there's no "me" parameter to dlsym. dlsym actually looks at the return address on the stack to determine what the calling ELF object is. It can then find that object in the search list and resume the search after that object.

Dynamic loading

The way dynamic symbol lookup works may seem a bit brittle. Libraries have to be careful not to step on each others toes by using the same symbol names, or the order in which they are searched needs to be managed to ensure that symbol lookups bind to the right values. But if LD_PRELOAD is not used, the symbol search results are all effectively determined at build time, so bugs in dynamic linking tend to be rare. Dynamic loading with dlopen changes that though.

dlopen allows for the construction of different library load orders at runtime. dlopen also introduces the concept of "scopes". There are RTLD_GLOBAL and RTLD_LOCAL (the default) options for dlopen. RTLD_GLOBAL adds the loaded library to the normal symbol search list (now called the global scope or scope 0) as if it had been loaded with DT_NEEDED at application startup. RTLD_LOCAL, on the other hand, adds the loaded library to a search list that is specific to the current ELF object (called the local scope or scope 1). RTLD_LOCAL is also transitive, meaning that the DT_NEEDED dependencies of a library opened with RTLD_LOCAL will themselves be loaded as RTLD_LOCAL and not added to the global scope.

This is very poorly documented in the man pages but it is visible when using LD_DEBUG to see what the dynamic linker is doing.

How things go wrong

The existence of the local scope is fine for most purposes. Typically, when a library is loaded via dlopen, symbols in it will be looked up by using dlsym with the handle to that specific library, so the search order does not matter. RTLD_LOCAL also ensures that symbols loaded as part of one dynamically loaded library don't interfere with another dynamically loaded library loaded later. And since the global scope is searched first before the local scope, symbol lookups during the dlopen behave as one would expect.

Where this does cause problems though is in conjunction with symbol wrapping and LD_PRELOAD. Suppose a symbol in one of the RTLD_LOCAL-loaded libraries is interposed by an LD_PRELOADed library (which, by definition, is in the global scope). The symbol search as part of the dlopen will bind to the preloaded library, because that is at the beginning of the global scope and far earlier than anything in the local scope. When that symbol is actually executed, the preloaded library will use dlsym to try to find the "normal" symbol to forward the call to. But the local scope for the preloaded library is different than the local scope for the binary that called dlopen. None of the RTLD_LOCAL-loaded libraries will be in scope for the preloaded library, and the dlsym(RTLD_NEXT, ...) call will fail, leaving the preloaded library unable to forward the call.

This happened to us in rr issue #3304. librrpreload.so wraps certain libstdc++.so symbols (to disable rdrand in std::random_device). In this issue the primary application was python, which is a C, not C++, program, so it does not load libstdc++.so through DT_NEEDED. The user's python script loads the a python extension, which is a binary that is dlopened with RTLD_LOCAL. That extension, in turn, does use libstdc++.so, which is transitively loaded with RTLD_LOCAL. The conditions for this failure are now present: when the interposed function is called execution ends up in librrpreload.so's wrapper, and when it attempts to dlsym(RTLD_NEXT, ...) the symbol, that fails, because libstdc++.so is not present in the global scope.

The solution

dlsym determines which binary's scopes to use the same way it determines the current binary for RTLD_NEXT, by using the return address on the stack. This precludes an actual solution because in order to do the correct dlsym lookup the preload library needs we need to use the scope of the library that loaded the RTLD_LOCAL libraries. But for RTLD_NEXT we need to start searching from the preloaded library, and since both are determined by the same address, there's no way to do both.

Barring a future dynamic linker API, we've settled for recognizing this situation and printing an error message telling the user to force libstdc++.so into the global scope (and ironically enough, the easiest way to do that is via LD_PRELOAD).