Shared libs, rpath and the runtime linker

Sun 21 April 2024

Linux

If you are like me, maybe you understand that binaries can have statically or dynamically linked dependencies (shared libs) with each strategy having its own set of pros and cons. Maybe you also know that when a binary has shared libs, you can use ldd to learn more about which shared libraries that binary is using, and where they specifically are located in the file system.

This is the story of how troubleshooting why a binary refused to run, I ended up learning a ton more about shared libraries.

The problem

I was working on compiling the code for a project. The idea was to check if we could use our own compiling toolchain instead of CMake, which was the one commonly used for that project until then. This would bring a bunch of benefits because of integrations with lots of other tools.

We made good progress and managed to have something that successfully compiled, but the problem was that when we tried to run the binary, we were getting an error like this:

  ./run_server
./run_server: error while loading shared libraries: liblz4.so.1: cannot open shared object file: No such file or directory

Naturally, this being a problem with a shared library that couldn't be found, I expected ldd would indeed show that there was a library missing:

➜  ldd ./run_server
        linux-vdso.so.1 (0x00007ffd9dfa5000)
        liblz4.so.1 => /lib64/liblz4.so.1 (0x00007fa0eb31a000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fa0eb300000)
...

Note: ... means there is (irrelevant) output omitted

Instead, it turns out that some of them were not found, but liblz4.so.1 specifically (the one the error was complaining about) was found! I found that very surprising. How come ldd finds it, but when we run the binary, it does not? Remember this, because we will come back to it later!

My initial thought was: okay, maybe the shared lib location is a symlink pointing to nowhere, or maybe it had the wrong permissions. The error very clearly says "No such file or directory" but I guess it doesn't hurt to check:

  ls -la /lib64/libz.so.1
lrwxrwxrwx 1 root root 14 Oct 30 02:16 /lib64/libz.so.1 -> libz.so.1.2.11
  ls -la /lib64/libz.so.1.2.11
-rwxr-xr-x 1 root root 102672 Oct 30 02:16 /lib64/libz.so.1.2.11

It's all good. Others should have permission to access the dir and read the library, which makes sense. Probably something much bigger would be broken if that wasn't the case.

Since it seems /lib64 does indeed contain that lib, I thought I could use LD_LIBRARY_PATH to force the binary to check in /lib64. I was pretty confident this wouldn't make any difference because that's a pretty standard directory for shared libs, and because ldd said the lib could be found there (which meant that it was already checking that path).

  LD_LIBRARY_PATH=/lib64/ ./run_server
./run_server: error while loading shared libraries: libthird-party_zlib_z.so: cannot open shared object file: No such file or directory

Surprisingly, too, it started complaining about a different lib (libthird-party_zlib_z.so), so indeed it seems /lib64 was not being used after all. (Remember this, too!)

Still, as we can see, there are more libraries not found anyway, so that didn't solve much.

I was running out of ideas, so I turned to the internet. Initially, some Google searches, until I eventually went to my favorite LLM for this sort of question. "Maybe it'll help?" - I thought.

It required some prompt engineering since the initial suggestions were way too basic, but after a while, it recommended I check if the binary had something called rpath set. It suggested a command to check that.

  readelf -d ./run_server | head -n 10

Dynamic section at offset 0xbf93cf8 contains 67 entries:
  Tag        Type                         Name/Value
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/./__libs__]
 0x0000000000000001 (NEEDED)             Shared library: [liblz4.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
...

Oh, so it is set! What is this? A linker flag?

Let's check the man pages:

  man ld
...
       -rpath=dir
           Add a directory to the runtime library search path.  This is used when linking an ELF executable with shared objects.  All -rpath arguments are concatenated and passed to the runtime linker, which uses them to locate shared objects at
           runtime.
...
           The tokens $ORIGIN and $LIB can appear in these search directories.  They will be replaced by the full path to the directory containing the program or shared object in the case of $ORIGIN and either lib - for 32-bit binaries - or lib64
           - for 64-bit binaries - in the case of $LIB.
...

It turns out that when a binary has a rpath set, it adds that path to the runtime search paths (LD_LIBRARY_PATH, ldconfig, etc). The question is: why did my binary have a rpath? And why couldn't the linker find the lib in that directory?

I could only guess, but after spending some time learning about rpath and our toolchain, I concluded that all of this had to do with ensuring isolation between what I was building and whatever shared libs were installed in my system.

Instead of simply running my binary against the shared libs installed in the system, the compile toolchain knows exactly which versions of those shared libs to use (since they are marked as dependencies), so it drops them in a folder, and sets rpath to that folder to ensure those are used (instead of the system's).

The value of rpath in my binary was: $ORIGIN/./__libs__. The key thing to understand here is $ORIGIN, which means the path is relative to the binary path.

It turns out I broke all of this because I copied the binary from the location where the toolchain created the binary to somewhere in /tmp, completely unaware of the fact that the binary was (silently, I must say) referencing libraries with a relative path.

Once that was understood, the possible solutions were straightforward: Symlink the file instead of copying it, or simply use it from that original path.

The mystery is not fully solved

Having fixed my issue, I was happy I could continue with my original goal.

But there was a question in the back of my mind that was bothering me: Why was the binary complaining about not finding the shared lib, if ldd found it in /lib64?

Even accounting for the issue that I caused by copying the binary somewhere else, surely if ldd was finding the lib in /lib64, the binary should find it as well, and run with that version.

  ./run_server
./run_server: error while loading shared libraries: liblz4.so.1: cannot open shared object file: No such file or directory

  ldd ./run_server
        linux-vdso.so.1 (0x00007ffd9dfa5000)
        liblz4.so.1 => /lib64/liblz4.so.1 (0x00007fa0eb31a000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fa0eb300000)
[ ... ]

As mentioned, when running the binary, it complains liblz4.so.1 does not exist, but surprisingly ldd can find it in /lib64/liblz4.so.1. Isn't that weird?

To understand what is going on, I had to learn more about how these binaries load these shared libraries, so let's go and do that.

Usually, binaries deployed in Linux are ELF binaries. ELF binaries can be statically linked, or dynamically linked, which means libraries need to be linked at runtime. This process of linking shared libraries is done by a piece of software called the "linker runtime" (which usually is ld.so when using glibc).

The question here is: when executing an ELF binary, how do we know where to find the runtime linker?

It turns out every ELF binary includes a "program header" called .interp where that is specified. Here's an example using bash:

 readelf -l /usr/bin/bash

Elf file type is DYN (Shared object file)
Entry point 0x31d30
There are 13 program headers, starting at offset 64

Program Headers:
[...]
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
[...]

Most binaries specify the default runtime linker, which is at /lib64/ld-linux-x86-64.so.2 in Linux.

Interestingly, you can call this runtime directly to get more information:

  /lib64/ld-linux-x86-64.so.2
/lib64/ld-linux-x86-64.so.2: missing program name
Try '/lib64/ld-linux-x86-64.so.2 --help' for more information.


  /lib64/ld-linux-x86-64.so.2 --help
Usage: /lib64/ld-linux-x86-64.so.2 [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked 'ld.so', the program interpreter for dynamically-linked
ELF programs.  Usually, the program interpreter is invoked automatically
when a dynamically-linked executable is started.

You may invoke the program interpreter program directly from the command
line to load and run an ELF executable file; this is like executing that
file itself, but always uses the program interpreter you invoked,
instead of the program interpreter specified in the executable file you
run.  Invoking the program interpreter directly provides access to
additional diagnostics, and changing the dynamic linker behavior without
setting environment variables (which would be inherited by subprocesses).

  --list                list all dependencies and how they are resolved
  --verify              verify that given object really is a dynamically linked
                        object we can handle
  --inhibit-cache       Do not use /etc/ld.so.cache
  --library-path PATH   use given PATH instead of content of the environment
                        variable LD_LIBRARY_PATH
[ ... ]

This runtime linker is used to find out the dependencies (shared libraries) that are required for a binary to run. Then, depending on how it has been configured (using something like ldconfig) those libraries are resolved (found in the filesystem) and linked at runtime.

We can use this runtime linker to run any binaries. For example, this will resolve bash dependencies, resolve them, link them, and run bash, which ends up dropping me in a bash shell:

  /lib64/ld-linux-x86-64.so.2 /usr/bin/bash
[carlosrdrz@myserver ~]$

The thing is, to avoid having you prepend every command with its runtime linker, whatever runs the ELF binary does it for you. It reads that .interp section, finds the right runtime linker and calls it to run the binary, so the previous call is equivalent to simply doing:

  /usr/bin/bash
[carlosrdrz@myserver ~]$

Alright so now we know how things run, but how can we print the list of dependencies required for a binary to run? You might have noticed there are some arguments we can use in the runtime linker to get info about it, for example:

➜  /lib64/ld-linux-x86-64.so.2 --list /usr/bin/bash
        linux-vdso.so.1 (0x00007ffcff56a000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f7cac760000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f7cac400000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f7cac8f4000)

That's useful! And at the same time looks very similar to what ldd provides. So, what is ldd doing?

I was surprised when I learned this:

➜  whereis ldd
ldd: /usr/bin/ldd /usr/share/man/man1/ldd.1.gz
➜  readelf -l /usr/bin/ldd
readelf: /usr/bin/ldd: Error: Not an ELF file - it has the wrong magic bytes at the start
➜  file /usr/bin/ldd
/usr/bin/ldd: Bourne-Again shell script, ASCII text executable

ldd is a script! I've always assumed it was a binary for some reason. We can read its code and see what it does, or even better, we could just run it with bash -x and see what it does.

➜  bash -x /usr/bin/ldd /usr/bin/bash
+ TEXTDOMAIN=libc
+ TEXTDOMAINDIR=/usr/share/locale
+ RTLDLIST='/lib/ld-linux.so.2 /lib64/ld-linux-x86-64.so.2 /libx32/ld-linux-x32.so.2'
+ warn=
[ ... ] reducing some output here for brevity [ ... ]
+++ LD_TRACE_LOADED_OBJECTS=1
+++ LD_WARN=
+++ LD_BIND_NOW=
+++ LD_LIBRARY_VERSION=
+++ LD_VERBOSE=
+++ /lib64/ld-linux-x86-64.so.2 /usr/bin/bash
        linux-vdso.so.1 (0x00007fff40151000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f3acff3c000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f3acfc00000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3ad00d0000)
+ return 0
+ exit 0

I think there are two things to notice from that (or by analyzing the script directly):

  • There is a hardcoded list of possible runtime linkers
  • ldd is simply calling /lib64/ld-linux-x86-64.so.2 with some env variables, which makes it print the shared libs dependencies.

So we've just learned that it turns out you can do this to get the list of shared libs for a binary:

➜  LD_TRACE_LOADED_OBJECTS=1 /lib64/ld-linux-x86-64.so.2 /usr/bin/bash
        linux-vdso.so.1 (0x00007ffecb94b000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f181395f000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f1813600000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1813af3000)

Or also this (because as I said, the right runtime linker will get called anyway)

➜  LD_TRACE_LOADED_OBJECTS=1 /usr/bin/bash
        linux-vdso.so.1 (0x00007ffecb94b000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f181395f000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f1813600000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1813af3000)

So... you might be guessing where am I going with all of this.

What happens if you specify your own .interp header in your binaries?

 readelf -l ./run_server

Elf file type is EXEC (Executable file)
Entry point 0x37bb540
There are 12 program headers, starting at offset 64

Program Headers:
[...]
  INTERP         0x00000000000002e0 0x00000000002002e0 0x00000000002002e0
                 0x0000000000000028 0x0000000000000028  R      0x1
      [Requesting program interpreter: /usr/local/platform/lib/ld.so]
[...]

When you specify your own runtime linker, it will be used for resolving all those shared libraries.

What you might not expect is that, if ldd does not have those runtime linkers in its hardcoded list of runtime linkers, it will simply try to use the default one, so its output will be completely useless to you. ldd will ask /lib64/ld-linux-x86-64.so.2 to print the shared libs, but ultimately, when running the binary, it is a different runtime linker linking those shared libraries!

Remember that weird thing where setting LD_LIBRARY_PATH to /lib64 when running our binary made it find one of the libs? Maybe now you can guess what was going on!

/lib64/ld-linux-x86-64.so is hardcoded to look for libs in some default paths, like /lib64, whereas /usr/local/platform/lib/ld.so (being a company-specific thing) was not (because of reasons I won't elaborate now).

When using ldd, the default runtime linker was being used, so it found the lib in /lib64, but when running the binary, the company runtime linker was used, and therefore the lib couldn't be found. When I set LD_LIBRARY_PATH to /lib64, it added that "default" dir to the runtime search dir, and made it available.

Learnings

Aside from the satisfaction of resolving a good old troubleshooting mystery, there are some good learnings here that could be useful in the future.

First, that whole thing about rpath is something we might find in the future. It's good to know that sometimes binaries reference shared libs using relative paths, and ldd is not very clear about that happening. Copying a binary to a different directory might make it not work!

Also, most importantly, the easiest way to get the source of truth from the runtime linker is to run your binary using LD env vars, since that ensures that the right runtime linker is giving you that information.

For example:

➜  LD_TRACE_LOADED_OBJECTS=1 $YOUR_BINARY

or:

➜  LD_DEBUG=libs $YOUR_BINARY

Those will print a lot of information about shared library usage in your binary, and you will be sure those are being printed by the real runtime linker.

It's worth clarifying that this assumes a glibc linker, whereas other projects (for example musl) use a different runtime linker that might not support these env vars and might have different mechanisms to figure this out.

You can also find more about this env vars with man ld.so.

Having said all of that, the main issue continued to be the same: I moved my binary to a different folder, it had an rpath, and most of the libraries were in that rpath directory, so they could not be found.

Truth be told, though, using ldd didn't help much with troubleshooting. In fact, it probably made things more difficult, since it wasn't aware of the binary using a different runtime linker, which made it give completely incorrect results.

A couple of warnings would have been nice! Maybe "rpath folder does not exist" or "runtime linker not found"? That would have been useful!

links

social