bindep — Strategies for finding binary dependencies#
A codebase might depend on another project's source code; or it might depend on another project's compiled binaries. Source code dependency relationships are mostly easy to identify; binary dependency relationships are not. We need to identify binary dependency relationships to ensure the Open Source ecosystem is secure and sustainably funded.
This project aims to provide tools that enable us to identify binary dependency relationships.
General Details#
FOSDEM 2026 talk
My initial proposal describing the broad approach — though the technical details are out of date
ecosyste.ms issue with more general details
Results: Finding Needed Dynamic Libraries in Python Wheels#
I analysed the most downloaded Python wheels, to see which dynamic libraries those wheels most depend on.
I attempted to download wheels for the 15,000 most downloaded Python packages according to hugovk's top-pypi-packages.
I successfully downloaded 13,874 wheels. I failed to download 1,126 wheels — mostly wheels that did not have builds for Linux available.
I only analysed dependencies originating in extension modules included in these wheels. Unfortunately, other kinds of
binary dependency relationships, like those implemented using libfft, will be more difficult to find. For more details
on this, see my post How Binary Dependencies Work Across Different
Languages.
Of those wheels, 1,531 wheels contained .so files. This is around 9% of the Python ecosystem, so this validates the
research direction — because we currently cannot reliably identify binary dependencies, it looks like we have
significant holes in the dependency graph of around 9% of the Python ecosystem.
I found a total of 12,137 .so files (of which 39 could not be read). Those .so files include both bundled
dependencies, and the .so files of each respective wheel's extension modules.
In those .so files, I looked up items listed as DT_NEEDED in the ELF file's .dynamic section — this gives us the
names of the libraries that each .so file depends on.
This means we can see:
- libs that extension modules depend on
- libs that bundled dependencies depend on
but we cannot see
- the libs that all of those libs depend on.
This is a significant limitation.
Among all .so files, I found 96,570 instances of a lib being needed. 2,862 unique libs were needed.
The 10 most required libs are relatively unsurprising:
libc,11927
libpthread,7827
libm,7113
libgcc_s,6619
libstdc++,6267
libdl,3186
ld-linux-x86-64,1835
librt,1434
libGL,899
libQt6Core,699
Some are a little interesting:
libxkbcommon,380
libtensorflow_framework,379
Some I did not expect to be so common:
libvtkfmt,315
libvtksys,314
libvtkscn,314
libvtktoken,314
libvtkCommonCore,313
The full results can be found in
results/260121-libs-found-in-python-wheels.txt.
The results are a little noisy — for example, a bunch of libs have names ending in hashes like -01abcdef. Maybe those
suffixes should be removed; but then again, many packages seem to depend on the same hashes. Anyway, I think this is
enough to get a general idea of the approach for now.
The source code is available here:
find_needed_libs.rs.
My initial proposal mentioned constructing a big map of which dynamic symbols are required by which language package manager packages, and which dynamic symbols are provided by which system package manager packages. I didn't take this approach in this case.
For one thing, the ELF files contain the name of the libraries they depend on, so we can figure that out without the symbols. And for another thing, knowing the filenames means we can examine system package managers to see which packages provide which dynamic library files. This should be relatively reliable.
But — we might still want to mine symbols for some other reason.
Authorship#
Vlad-Stefan Harbuz (vlad.website) unless otherwise noted.