Package Management for the 21.1st Century

A package manager is something any Linux user is familiar with – it's a piece of software tasked with installing, updating, and un-installing other software. Linux distributions tend to have battle-tested, production ready package managers. But with the new trend for "App Stores", a lot of other players have built their own package-manager-alikes, and its become evident that our experience with package managers for Linux distributions does not translate over.

Two specific blog posts underscore this issue very well. The first, "We've Packaged all of the Free Software… What Now?", describes the flaws in Debian and Ubuntu's model of software packages – the rough edges, so to speak. And "Why Platform-Specific Package Systems Exist and Won't Go Away" makes the case that the different scenarios we want a package system for are different enough that we simply can't take the existing options and extend them slightly to work – we need new thoughts about package systems, and a new approach to building them.

I think all of this talk is highly apropos: the package managers of the Linux style have some well-known and well-publicized flaws, while the App Stores currently being built solve only very simple cases; and, of course, nowadays every large piece of software has its own copy of a mediocre package system inside. I think we can build something better.

Here, as far as I see it, are the scenarios we need to support:

Low overhead, for embedded systems. It should be possible to choose optimized collections of packages – say, for example, a statically-compiled coreutils for some embedded system, or even a custom-compiled tiny base library.
A grander notion of dependencies. For example, Wordpress may depend on MySQL, but it need not be a locally-installed MySQL.
Cross-platform and cross-language. Languages like to build their own package system, and different platforms have different expectations on how to install packages.
Packaging data. Keep in mind that data can be large mathematical tables (say, GAP's group table) – infrequently updated and easy to manage – or antivirus signatures, whose timing-dependence is crucial to their whole operation.
Compositionality. Emacs is a package with a built-in package manager. There's no reason to install Emacs modes as system packages, and in fact that only leads to confusion when I have two versions of Org-Mode installed, one from the system and one from Emacs's package manager.
Local or system installation. Emacs modes and Firefox plugins are installed per-user, while antivirus signatures need to be system-wide, as should a libc update. Python packages could go either way.
Virtual environments. It would be neat to install an application-specific VM and then manage packages in that same VM. Or, witness the popularity of tools like virtualenv and rvm for managing different versions of packages for different projects.
Purchases and other Negotiation. I know this is a sore point for the free software world, but the reality is that if we want a single universal package system, we're going to need to support for-pay software.
Federated. Each project is going to want to run package repositories for its own software, and no distro wants to rely on a third party to maintain its package lists.

I think this is a tough list of requirements, but a doable one. But if we want this to work, we also need to make it transition-able: we want a solution that other tools can adopt as an alternate backend, then as the primary backend, and only then as the only backend. This is exceedingly difficult to do if we pick our own package format and installer, since there are so many different requirements. So I think it is necessary to start by separating package maintenance from installation. A package system should be able to find packages, search for updates, and buy software, but it should offload the task of installing these packages to other tools. Luckily, most existing package systems have some way of downloading the package as a file and installing it from this local file.

Furthermore, since we now have a vastly-reduced package space, and due to the needs to be cross-platform and cross-language, it seems most sensible that the package system of the 21.1st century ought to be a protocol. A well-designed protocol could be federated, and makes it easy to offload the work of installing to local tools.

Federation is a powerful requirement, because by adopting it as a central design feature of our protocol, we can subsume many other requirements. Here's how I imagine a protocol could work:

The world consists of a large number of independent repositories, each of which contains some number of packages.
Among these is a repository for your computer, which contains the packages installed on your computer, and repositories for each user, representing their locally-installed software. Virtual environments, like a Python virtualenv, could be represented with another repository of packages installed on that system.
Packages are namespaced. For example, all Python packages live in their own namespace.
Repositories are connected in a directed graph. Directed, because I don't want to open my computer's system repositories up to other people on the Internet to download from. Though other people of course might. This gives us a good notion of compositionality.
Repositories can filter or transform packages received from upstream repositories. For example, the npm2pkgbuild tool for Arch Linux transforms Node packages into packages manageable by the Arch package manager. This could be represented by a repository that does the same.
All repositories speak a single, common protocol to each other, which can describe updates to packages, list packages, and query versions. The protocol would also have to handle signing and security, including "revocation lists", where one repository could publish its view that another package is out of date (so that a mirror couldn't continue to offer a known-insecure version).
Simple libraries to query this graph for packages should exist. This way, pip, the Python installer, could query the graph for a package name, download the corresponding package, and install it.
Ideally, each repository should understand some notion of dependency resolution, so that it does not have to be replicated on the client side. Recall that we're not shipping code (just writing a protocol), so a core part of the protocol should be dependency resolution.

I think the model above also has the benefit of being simple to implement piece-by-piece, tool-by-tool. If just one language or community adopts the protocol, it could be proven to work, and in time people could set up repositories that mirror, say, the Debian archive, in a manner accessible by this system. Then over time apt could switch to using that mirror, or another tool using this mirror could be written that begins to replace apt.

The protocol itself needs very careful thought – the problem of shipping software is not trivial. A good first stab, for example, might be a protocol that allows for three operations: listing packages a repository has, querying for the available versions of a package, and perhaps other assorted metadata, and find all the dependencies of a package-version.

But there are flaws to this. For example, there's no good way to check for updates to just the installed packages, without downloading the full packages list or making multiple queries. Instead, we might imagine each repository having a vector clock of its neighbors and being able to list changes to its repository since a given point in time. For large archives that are not too-often updated (like OS package archives) this could be a good solution.

I think this aspect needs more work and more thought. But I'm confident a solution can be found that works for a much larger range of scenarios that current package systems, and I hope the above architecture spurs some thoughts that could lead us to this solution.

By Pavel Panchekha

09 September 2012

Package Management for the 21.1st Century