dev-resources.site
for different kinds of informations.
Managing Large Debian Repositories with Pulp
Pulp is a free, open-source platform for software repository management. You can fetch, upload, and distribute content from various sources. Repository versioning makes sure that nothing is lost as you can always roll back to previous versions. The pulp_deb plugin adds APT repository support.
There is such a thing as Pulp Debian support, and it has been around for a while. It was expanded by ATIX for use with Katello a few years ago. It works great for small to medium-sized repositories. However, performance is not ideal.
Challenge
Around 2019, ATIX consultants wanted to synchronize all of Debian Stretch and Ubuntu Xenial for a demo. Unfortunately, they found that it generally takes about five hours, only to fail with a “Cannot allocate memory” error. What was going on?
To answer this question, they needed to take a closer look at the pulp_deb implementation. Code is organized into several steps. The implementation relies heavily on the python-debpkgr dependency, which in turn relies on deb822 from the python-debian library. python-debpkgr is mainly designed to take a pile of Debian packages and organize them into an APT repository. The structure of Debian repositories looks like this:
/dists/ stretch / Release
/dists/ stretch /main/binary -amd64/ Packages
/dists/ stretch / contrib /binary -amd64/ Packages
/dists/ stretch /non -free/binary -amd64/ Packages
/pool/
During a sync, we have the “MetadataStep,” which is provided with a list of releases, components, and packages (with meta data) from the Mongo DB. It then applies a logic: for every combination of architecture, component, and release, a list of packages is generated. These lists contain the paths to the actual .deb package files on the disk. Finally, each list is passed to a debpkgr call as an argument.
debpkgr is mainly designed to take a pile of Debian packages and turn them into a repo. So, it does just that: Each .deb file is accessed on the disk to extract the meta data debpkgr needs. Due to the way the package lists overlap for different architectures, many of these .deb files will actually be parsed multiple times.
The solution
Our experts’ first thought was: maybe there’s a quick-and-dirty fix? However, they also considered a complete redesign of the way debpkgr works. Another alternative might be dropping debpkgr (from the MetadataStep) and implementing everything themselves.
The basic idea was to exclusively use information from the Mongo DB to create the repository structure. The old implementation already had to parse the meta data from the Mongo DB in order to generate the lists that were then passed to debpkgr. This essentially remained unchanged. Our experts had to create the desired directory structure themselves. They also had to build the symlinks to the actual .deb files themselves. They then needed the ability to write Packages and Release files. As one always does, they happened upon a few stumbling blocks:
debpkgr generates md5sum, sha1, and sha256 for metadata. The existing data base model only stored sha256 hashes. Actually using the meta data from the data base revealed a bug. User-defined meta data fields/fields were not stored in the existing data base model.
Our consultants came up with the following results:
- Two major pull requests:
1.Ensure the db is used consistently by quba42 · Pull Request #61 · pulp/pulp_deb
2.MetadataStep performance by quba42 · Pull Request #57 · pulp/pulp_deb
An end to our memory problems
Syncs for medium-sized repositories (1500 packages) that are more than twice as fast
Syncing Ubuntu Xenial (main, restricted, universe, multiverse) for amd64 (53837 Packages) within 3h36m on the test system
What did everyone learn? It is important to know your tools! Furthermore, you have to take your time to plan the architecture and gain the required domain knowledge.
Featured ones: