Monday, April 12, 2010

Chef and Gentoo Binary Repository

Ever since I watched Seth Chisamore's presentation on Chef at the Atlanta Ruby User's Group, I have had to rethink the idea of Gentoo Linux in Production. The first consequence of declaring your infrastructure as source code is that the virtual machine becomes disposable. Although Chef is distribution-agnostic, you do want to standardized on a single distribution to make debugging easier. Otherwise, since Chef is the primary agent in building your servers -- not the system administrator -- then the distribution and package manager that best work as a disposable server wins.

Now, I like Gentoo. This feeling of "liking" Gentoo is very much sunk cost fallacy. I cut my teeth on Slackware and Redhat 3 back in the days as a hobbyist, feeding a 486 box 5.25" floppies. Redhat was fun, but hunting down RPMs in the pre-yum days was not fun. When a friend showed me apt-get for Debian, it solved my problem of having to manually hunt down distributions. Very quickly, however, I found how outdated Debian is. Ubuntu came around and I switched over to that, yet I still found myself wanting to compile from source. At least with RPM-based distributions, source rpms were easy to work with. There were plenty of examples, and the dev and build environments for RPMs were sensible and practical. The build environment for for Debian was horrific nonsense. Ubuntu was no better. To this day, building a .deb package is deep magic that I never once succeeded in performing. I ended up compiling from source, like the good'ol'days of Slackware.

So when I seriously looked at Gentoo, it solved yet another problem for me. I wanted a package manager that I can build from source. Gentoo's portage system also has some features not found from its ancestors -- BSD port, from which most Rails developers would know better as Darwin and Macports. Gentoo portage understood the "USE" semantics, which map to the ./configure --enable-XXX during the build process. Macports uses the concept of "variants", but you could not simply flip on these USE flags. Furthermore, Gentoo lets you selectively mask and unmask packages. You could also selectively overlay other repositories, so you do not have to wait for the package maintainers to catch up to your bleeding edge packages. This worked wonderfully when CouchDB, RabbitMQ, and pals were not readily available anywhere.

USE flags bypasses the fundemental flaw of binary-only package managers. The maintainers have to split platforms along features into smaller packages. Debian and Redhat, for example, have to have a separate Ruby and Ruby + SSL package. For Rubyists, this means splitting the standard library into dependencies outside the control of the Ruby platform.

Masking lets the system administrator control exactly which packages to use. Most people think that Gentoo is a rolling-release, and while you can use it this way, what Gentoo provides is actually far more control over the exact package version than any of the binary packages.

One other common complaint about Gentoo in production is the installation of the compiler. This argument is bunk though. Most early Rails deployments onto production ends up on Ubuntu with build-essentials installed, because the binary packages for Ruby are typically old and compiled with pthreads. Although if one sticks to the principle that one should not have a compiler (and the security issues related to it), it doesn't make it right just because developers-turned-system-administrators install a build kit on production boxes ... it does mean that this problem is not exclusive of Gentoo.

So the biggest upside to Gentoo is that you can compile from source. You can take advantage of the homogenous Opteron chips on Slicehost and Rackspace Cloud. The biggest downside of Gentoo, though, is you have to compile from source. And in general, I've found that it works well for what I want to do.

Now, enter the Chef.

If the first implication of using Chef is that the VM becomes disposable, does that mean it matters whether I use Gentoo or not? Certainly, EngineYard deploys their customer's apps on Gentoo, made easier through Chef. But if I wanted to take advantage of having a monitoring system automatically detect down systems and bring up new VMs on the fly, do I want the infrastructure to recompile all the pieces from scratch? If I wanted to avoid Sunk Cost Fallacy by having disposable VMs built from infrastructure-as-source-code, then wouldn't it make sense to base this on a binary package distribution instead?

For now, the economics says, "Yes, throw away Gentoo."

Say though, I wanted to take advantage of the benefits of Gentoo and reduce the time it takes to bring up a clone server. I would have to use pre-built binaries. Gentoo offers binary packages, misleadingly saved with the .tbz extension. Up until this past weekend, I had the mistaken impression that these .tbz packages contains a bzip2-compressed tarballs. In fact, there is metadata appended to the end and extractable using the xpak format (yes, it is obscure and not well-known). This is similar to the RPM format, which is nothing more than a gzipped tarball with metadata. The Gentoo xpak format is essentially a dump of what is in the system's package database, and contains information on things such as the USE flags, build environment, original ebuild, etc.

Since Gentoo binary packages are often not only machine-specific, they tend to have dependencies based on USE flags. Figuring out a repository scheme for these binaries is a more difficult problem than Debian .deb and Redhat RPMs. You can see an example of an attempt in 2006. The author proposes:[arch]/[CHOST]/[gcc version]/[glibc version]/[package group]/[package name]/[package version]/file

We're in 2010 now. We're in the year of Rails 3. The Rails community have seen a couple years of Service Oriented Architectures, and lightweight web services framework such as Sinatra has a strong showing. JSON is supplanting XML for lightweight apps. Git and mercurial made a strong case for using SHA1 hashes. Trying to push all of the unique identifiers for a Gentoo binary package like the one above sounds laughable today, when we can easily do this:

POST[arch]/[CHOST]/package group]/[package name]/[package version]
"USE": [ "threads", "ruby", "-php" ],
"gcc": ">~4.2",
"glibc": "~2.10.1-r1"

and that gets back

"url": "[arch]/[CHOST]/[package group]/[package name]/[package version]/SHA1HASH.tbz2"

And if I let go the idea that a human being is going to manage this, we can compact the query URL to just a POST against and get back the binary hash.

The same query format can be used to request a build. Maybe by using the PUT verb to indicate "build if it isn't in the repository".

Once a setup is built from Chef, the next time a clone needs to be made with the same recipes would be able to find the cached binaries built with the same compiler flags and USE flags. Setup right, you would not need a compiler on the production system.

I think this would go a long way towards having a useable Gentoo system configured by Chef. We gain the advantage of being able to specify precise versions and USE flags, in addition to architecture-specific machine code without having to wait as long for each machine to self-compile itself when the binaries can be cached.


  1. I've been using Gentoo for years - on my laptop (friends told me I am an mazohist because of it), on my desktop, on my servers. Actually it's the distro I started with - and the one that allowed me to learn so much about Linux to be a sysadmin - and manage Suse, RedHat, Ubuntu ...etc.

    And having had maintaining experience with a bunch of distributions - Gentoo sucks because of need to compile everyhing (that's God for openoffice-bin and firefox-bin). On the other hand, other distributions suck even worse - once you have really tried Gentoo, you can never go back :)

    For a while, I've had this idea of the binary packages for Gentoo. And I think you just might be onto something with this - even for wide scale - general usage...

  2. So how many different binary combinations could we come up with of php and apache from use flags? this is where I see and have my problem with...we might as well just enable all the use flags to solve this problem then.

  3. Furthermore, I guess that is why RPM based distros fragment off certain features of a program into specific packages. We kinda have no choice but to compile it all if you want a binary package and make it feasible.