If you are like me, maybe you understand that binaries can have statically or dynamically linked dependencies (shared libs) with each strategy having its own set of pros and cons. Maybe you also know that when a binary has shared libs, you can use ldd
to learn more about which shared libraries that binary is using, and where they specifically are located in the file system.
This is the story of how troubleshooting why a binary refused to run, I ended up learning a ton more about shared libraries.
The problem
I was working on compiling the code for a project. The idea was to check if we could use our own compiling toolchain instead of CMake, which was the one commonly used for that project until then. This would bring a bunch of benefits because of integrations with lots of other tools.
We made good progress and managed to have something that successfully compiled, but the problem was that when we tried to run the binary, we were getting an error like this:
➜ ./run_server
./run_server: error while loading shared libraries: liblz4.so.1: cannot open shared object file: No such file or directory
Naturally, this being a problem with a shared library that couldn't be found, I expected ldd
would indeed show
that there was a library missing:
➜ ldd ./run_server
linux-vdso.so.1 (0x00007ffd9dfa5000)
liblz4.so.1 => /lib64/liblz4.so.1 (0x00007fa0eb31a000)
libz.so.1 => /lib64/libz.so.1 (0x00007fa0eb300000)
...
Note:
...
means there is (irrelevant) output omitted
Instead, it turns out that some of them were not found, but liblz4.so.1
specifically (the one the error was complaining about) was found!
I found that very surprising. How come ldd
finds it, but when we run the binary, it does not? Remember this, because we will come back to it later!
My initial thought was: okay, maybe the shared lib location is a symlink pointing to nowhere, or maybe it had the wrong permissions. The error very clearly says "No such file or directory" but I guess it doesn't hurt to check:
➜ ls -la /lib64/libz.so.1
lrwxrwxrwx 1 root root 14 Oct 30 02:16 /lib64/libz.so.1 -> libz.so.1.2.11
➜ ls -la /lib64/libz.so.1.2.11
-rwxr-xr-x 1 root root 102672 Oct 30 02:16 /lib64/libz.so.1.2.11
It's all good. Others should have permission to access the dir and read the library, which makes sense. Probably something much bigger would be broken if that wasn't the case.
Since it seems /lib64 does indeed contain that lib, I thought I could use LD_LIBRARY_PATH to force the binary to check in /lib64. I was pretty confident this wouldn't make any difference because that's a pretty standard directory for shared libs, and because ldd said the lib could be found there (which meant that it was already checking that path).
➜ LD_LIBRARY_PATH=/lib64/ ./run_server
./run_server: error while loading shared libraries: libthird-party_zlib_z.so: cannot open shared object file: No such file or directory
Surprisingly, too, it started complaining about a different lib (libthird-party_zlib_z.so), so indeed it seems /lib64 was not being used after all. (Remember this, too!)
Still, as we can see, there are more libraries not found anyway, so that didn't solve much.
I was running out of ideas, so I turned to the internet. Initially, some Google searches, until I eventually went to my favorite LLM for this sort of question. "Maybe it'll help?" - I thought.
It required some prompt engineering since the initial suggestions were way too basic, but after a while, it recommended I check if the binary had something called rpath set. It suggested a command to check that.
➜ readelf -d ./run_server | head -n 10
Dynamic section at offset 0xbf93cf8 contains 67 entries:
Tag Type Name/Value
0x000000000000000f (RPATH) Library rpath: [$ORIGIN/./__libs__]
0x0000000000000001 (NEEDED) Shared library: [liblz4.so.1]
0x0000000000000001 (NEEDED) Shared library: [libz.so.1]
...
Oh, so it is set! What is this? A linker flag?
Let's check the man pages:
➜ man ld
...
-rpath=dir
Add a directory to the runtime library search path. This is used when linking an ELF executable with shared objects. All -rpath arguments are concatenated and passed to the runtime linker, which uses them to locate shared objects at
runtime.
...
The tokens $ORIGIN and $LIB can appear in these search directories. They will be replaced by the full path to the directory containing the program or shared object in the case of $ORIGIN and either lib - for 32-bit binaries - or lib64
- for 64-bit binaries - in the case of $LIB.
...
It turns out that when a binary has a rpath set, it adds that path to the runtime search paths (LD_LIBRARY_PATH, ldconfig, etc). The question is: why did my binary have a rpath? And why couldn't the linker find the lib in that directory?
I could only guess, but after spending some time learning about rpath and our toolchain, I concluded that all of this had to do with ensuring isolation between what I was building and whatever shared libs were installed in my system.
Instead of simply running my binary against the shared libs installed in the system, the compile toolchain knows exactly which versions of those shared libs to use (since they are marked as dependencies), so it drops them in a folder, and sets rpath to that folder to ensure those are used (instead of the system's).
The value of rpath in my binary was: $ORIGIN/./__libs__
.
The key thing to understand here is $ORIGIN
, which means the path is relative to the binary path.
It turns out I broke all of this because I copied the binary from the location where the toolchain created the binary to somewhere in /tmp
, completely unaware of the fact that the binary was (silently, I must say) referencing libraries with a relative path.
Once that was understood, the possible solutions were straightforward: Symlink the file instead of copying it, or simply use it from that original path.
The mystery is not fully solved
Having fixed my issue, I was happy I could continue with my original goal.
But there was a question in the back of my mind that was bothering me: Why was the binary complaining about not finding the shared lib, if ldd
found it in /lib64?
Even accounting for the issue that I caused by copying the binary somewhere else, surely if ldd
was finding the lib in /lib64, the binary should find it as well, and run with that version.
➜ ./run_server
./run_server: error while loading shared libraries: liblz4.so.1: cannot open shared object file: No such file or directory
➜ ldd ./run_server
linux-vdso.so.1 (0x00007ffd9dfa5000)
liblz4.so.1 => /lib64/liblz4.so.1 (0x00007fa0eb31a000)
libz.so.1 => /lib64/libz.so.1 (0x00007fa0eb300000)
[ ... ]
As mentioned, when running the binary, it complains liblz4.so.1
does not exist, but surprisingly ldd
can find it in /lib64/liblz4.so.1
. Isn't that weird?
To understand what is going on, I had to learn more about how these binaries load these shared libraries, so let's go and do that.
Usually, binaries deployed in Linux are ELF binaries. ELF binaries can be statically linked, or dynamically linked, which means libraries need to be linked at runtime. This process of linking shared libraries is done by a piece of software called the "linker runtime" (which usually is ld.so
when using glibc).
The question here is: when executing an ELF binary, how do we know where to find the runtime linker?
It turns out every ELF binary includes a "program header" called .interp
where that is specified. Here's an example using bash:
➜ readelf -l /usr/bin/bash
Elf file type is DYN (Shared object file)
Entry point 0x31d30
There are 13 program headers, starting at offset 64
Program Headers:
[...]
INTERP 0x0000000000000318 0x0000000000000318 0x0000000000000318
0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
[...]
Most binaries specify the default runtime linker, which is at /lib64/ld-linux-x86-64.so.2
in Linux.
Interestingly, you can call this runtime directly to get more information:
➜ /lib64/ld-linux-x86-64.so.2
/lib64/ld-linux-x86-64.so.2: missing program name
Try '/lib64/ld-linux-x86-64.so.2 --help' for more information.
➜ /lib64/ld-linux-x86-64.so.2 --help
Usage: /lib64/ld-linux-x86-64.so.2 [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked 'ld.so', the program interpreter for dynamically-linked
ELF programs. Usually, the program interpreter is invoked automatically
when a dynamically-linked executable is started.
You may invoke the program interpreter program directly from the command
line to load and run an ELF executable file; this is like executing that
file itself, but always uses the program interpreter you invoked,
instead of the program interpreter specified in the executable file you
run. Invoking the program interpreter directly provides access to
additional diagnostics, and changing the dynamic linker behavior without
setting environment variables (which would be inherited by subprocesses).
--list list all dependencies and how they are resolved
--verify verify that given object really is a dynamically linked
object we can handle
--inhibit-cache Do not use /etc/ld.so.cache
--library-path PATH use given PATH instead of content of the environment
variable LD_LIBRARY_PATH
[ ... ]
This runtime linker is used to find out the dependencies (shared libraries) that are required for a binary to run.
Then, depending on how it has been configured (using something like ldconfig
) those libraries are resolved (found in the filesystem) and linked at runtime.
We can use this runtime linker to run any binaries. For example, this will resolve bash dependencies, resolve them, link them, and run bash, which ends up dropping me in a bash shell:
➜ /lib64/ld-linux-x86-64.so.2 /usr/bin/bash
[carlosrdrz@myserver ~]$
The thing is, to avoid having you prepend every command with its runtime linker, whatever runs the ELF binary does it for you.
It reads that .interp
section, finds the right runtime linker and calls it to run the binary, so the previous call is
equivalent to simply doing:
➜ /usr/bin/bash
[carlosrdrz@myserver ~]$
Alright so now we know how things run, but how can we print the list of dependencies required for a binary to run? You might have noticed there are some arguments we can use in the runtime linker to get info about it, for example:
➜ /lib64/ld-linux-x86-64.so.2 --list /usr/bin/bash
linux-vdso.so.1 (0x00007ffcff56a000)
libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f7cac760000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7cac400000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7cac8f4000)
That's useful! And at the same time looks very similar to what ldd
provides.
So, what is ldd
doing?
I was surprised when I learned this:
➜ whereis ldd
ldd: /usr/bin/ldd /usr/share/man/man1/ldd.1.gz
➜ readelf -l /usr/bin/ldd
readelf: /usr/bin/ldd: Error: Not an ELF file - it has the wrong magic bytes at the start
➜ file /usr/bin/ldd
/usr/bin/ldd: Bourne-Again shell script, ASCII text executable
ldd
is a script! I've always assumed it was a binary for some reason.
We can read its code and see what it does, or even better, we could just run it with bash -x
and see what it does.
➜ bash -x /usr/bin/ldd /usr/bin/bash
+ TEXTDOMAIN=libc
+ TEXTDOMAINDIR=/usr/share/locale
+ RTLDLIST='/lib/ld-linux.so.2 /lib64/ld-linux-x86-64.so.2 /libx32/ld-linux-x32.so.2'
+ warn=
[ ... ] reducing some output here for brevity [ ... ]
+++ LD_TRACE_LOADED_OBJECTS=1
+++ LD_WARN=
+++ LD_BIND_NOW=
+++ LD_LIBRARY_VERSION=
+++ LD_VERBOSE=
+++ /lib64/ld-linux-x86-64.so.2 /usr/bin/bash
linux-vdso.so.1 (0x00007fff40151000)
libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f3acff3c000)
libc.so.6 => /lib64/libc.so.6 (0x00007f3acfc00000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3ad00d0000)
+ return 0
+ exit 0
I think there are two things to notice from that (or by analyzing the script directly):
- There is a hardcoded list of possible runtime linkers
ldd
is simply calling/lib64/ld-linux-x86-64.so.2
with some env variables, which makes it print the shared libs dependencies.
So we've just learned that it turns out you can do this to get the list of shared libs for a binary:
➜ LD_TRACE_LOADED_OBJECTS=1 /lib64/ld-linux-x86-64.so.2 /usr/bin/bash
linux-vdso.so.1 (0x00007ffecb94b000)
libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f181395f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f1813600000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1813af3000)
Or also this (because as I said, the right runtime linker will get called anyway)
➜ LD_TRACE_LOADED_OBJECTS=1 /usr/bin/bash
linux-vdso.so.1 (0x00007ffecb94b000)
libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f181395f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f1813600000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1813af3000)
So... you might be guessing where am I going with all of this.
What happens if you specify your own .interp
header in your binaries?
➜ readelf -l ./run_server
Elf file type is EXEC (Executable file)
Entry point 0x37bb540
There are 12 program headers, starting at offset 64
Program Headers:
[...]
INTERP 0x00000000000002e0 0x00000000002002e0 0x00000000002002e0
0x0000000000000028 0x0000000000000028 R 0x1
[Requesting program interpreter: /usr/local/platform/lib/ld.so]
[...]
When you specify your own runtime linker, it will be used for resolving all those shared libraries.
What you might not expect is that, if ldd
does not have those runtime linkers in its hardcoded list of runtime linkers, it
will simply try to use the default one, so its output will be completely useless to you. ldd
will ask /lib64/ld-linux-x86-64.so.2
to print the shared libs, but ultimately, when running the binary, it is a different runtime linker linking those shared libraries!
Remember that weird thing where setting LD_LIBRARY_PATH
to /lib64
when running our binary made it find one of the libs? Maybe now you can guess what was going on!
/lib64/ld-linux-x86-64.so
is hardcoded to look for libs in some default paths, like /lib64
, whereas /usr/local/platform/lib/ld.so
(being a company-specific thing) was not (because of reasons I won't elaborate now).
When using ldd
, the default runtime linker was being used, so it found the lib in /lib64
, but when running the binary, the company runtime linker was used, and therefore the lib couldn't be found. When I set LD_LIBRARY_PATH
to /lib64
, it added that "default" dir to the runtime search dir, and made it available.
Learnings
Aside from the satisfaction of resolving a good old troubleshooting mystery, there are some good learnings here that could be useful in the future.
First, that whole thing about rpath
is something we might find in the future. It's good to know that sometimes binaries reference shared libs using relative paths, and ldd
is not very clear about that happening. Copying a binary to a different directory might make it not work!
Also, most importantly, the easiest way to get the source of truth from the runtime linker is to run your binary using LD env vars, since that ensures that the right runtime linker is giving you that information.
For example:
➜ LD_TRACE_LOADED_OBJECTS=1 $YOUR_BINARY
or:
➜ LD_DEBUG=libs $YOUR_BINARY
Those will print a lot of information about shared library usage in your binary, and you will be sure those are being printed by the real runtime linker.
It's worth clarifying that this assumes a glibc linker, whereas other projects (for example musl) use a different runtime linker that might not support these env vars and might have different mechanisms to figure this out.
You can also find more about this env vars with man ld.so
.
Having said all of that, the main issue continued to be the same: I moved my binary to a different folder, it had an rpath, and most of the libraries were in that rpath directory, so they could not be found.
Truth be told, though, using ldd
didn't help much with troubleshooting. In fact, it probably made things more difficult, since it wasn't aware of the binary using a different runtime linker, which made it give completely incorrect results.
A couple of warnings would have been nice! Maybe "rpath folder does not exist" or "runtime linker not found"? That would have been useful!