Platform changes and Bazel rebuilds

11 Jan 2024

Bazel is a build system from Google which uses a strong change detection model to solve a number of build correctness problems make-like systems struggle with. While it handles most cases of rebuilds correctly out of the box, one recurrent gap is that if glibc changes bazel doesn’t notice and may produce broken results. I’d like to talk about how we hacked around this problem at work, since there aren’t a lot of well documented solutions out there.

A bit of background

In make and similar build systems, a build product is “up to date” if it is newer than all the input files by which that product is defined. The logic is seemingly elegant. If a depends on b and b depends on c, when c changes then c will be newer than b. Thus b must be rebuilt, which will make b newer than a and cause a to rebuild. Easy.

One problem that make struggles with is incomplete dependency specifications. What if a also depends on d, but this dependency isn’t listed as part of the build definition? In this case make a will fail to update a when d changes.

make and friends are vulnerable to this class problem because they run build actions directly in your shell. Any file in your working directory is visible to build actions and could in reality be a dependency of your build.

make also struggles with build configuration. To make, any build step (action) is an opaque shell script which will execute if the file dependencies aren’t up to date. It has no understanding of build configurations, and requires user discipline to convey all such options as input files rather than as environment variables. Hence in part the conventional ./configure scripts.

bazel solves these problems in two related ways. At its core, bazel is a tool for building a plan of actions (basically fork() calls) including input files, input command line and input shell environment, along with other configuration. Each one of these actions is fingerprinted using a content hash of the input files and all the input configuration. The action fingerprint serves as a repeatable identifier for (theoretically) the unique build output defined by all the provided inputs.

One neat property this model provides is that it allows bazel to cache and recycle build products. If a previously built artifact is requested for rebuild, bazel can just fetch the previous artifact from a cache – even a remote cache! – and save the actual build work. While this isn’t always valuable in building small applications, it can be profitable when caching the results of running a test suite or caching the results of enormous builds (such as a web browser).

The fundamental risk of this model is that – as with make and file timestamps – if the build depends on something which Bazel doesn’t know is a build input, then Bazel can’t detect change. Should that input change, since Bazel is unaware of it the action(s) and their fingerprint(s) for the build won’t change and you may get an incorrect rebuild because of the incomplete dependencies.

To help ensure that build actions produce repeatable results, Bazel executes actions in (somewhat) isolated chroot-like environments called sandboxes. Sandboxes contain a view of the source code, narrowed to the explicit dependencies of the current build action. This makes it hard for sandboxed actions to use dependencies which are not stated.

The default sandboxing isn’t perfect, but it’s pretty good and even trying to use sandboxes provides a lot of soundness pressure make and friends lack. You can even opt into running your entire build in a Docker sandbox, or on an entirely different machine via remote execution if you want to get paranoid about sandboxing!

$ bazel build \
    --spawn_strategy=docker \
    --experimental_docker_image=ubuntu@sha256:... \
    //your:target

Now there’s a big huge large asterisk on the default sandboxing machinery, and that’s the system on which you’re doing builds. For instance if your build uses clang, which clang are you getting? When your build links against a library using ldd, what version of the library are you linking against?

The answer is that you’re usually doing a non-hermetic build using the host C compiler and linking against the host glibc.

So really your entire OS install state is a dependency of the build, but not one which is an explicit input. If you apt-get upgrade, suddenly your cc and glibc could change without Bazel noticing. Such a change should force rebuilds, but by default won’t since system files aren’t inputs to your build.

Using a workstation you’d probably never notice that you upgraded glibc and Bazel didn’t do rebuilds. glibc is good at forwards-compatibility, so the old build artifact will keep working fine for a long time.

But when working with Docker containers it’s easy to run into old glibcs and get backwards-incompatibility version issues. A /lib64/libc.so.6: version 'GLIBC_2.39' not found error because your Bazel cache contains entries from a too-new build can ruin your whole day.

Unfortunately, Bazel doesn’t natively understand glibc or have a feature well suited to doing so. There’s a whole mess of GH issues associated with this. https://github.com/bazelbuild/bazel/issues/16976 and https://github.com/bazelbuild/bazel/issues/8766 to name two of the most recently active.

How could we work around this?

Building in known contexts

We already talked about Dockerized execution, which is certainly one way to fully lock down the build context and make changes to that context explicit.

Another option is to take an entire sysroot as a dependency, so you’re always using a fixed compiler and a glibc out of a sysroot. Essentially this is doing builds inside an explicitly managed container to ensure that no build dependencies can accidentally change.

I’m given to understand this is somewhat like how builds work at Google, which is part of why there isn’t a better story for glibc “in the box”. Google’s internal build architecture apparently “has no such thing as an external dependency”, in which sysroots appear to be part of the story. I’m sure that works if you can fund a team to maintain a sysroot build matrix. Must be nice.

The Aspect folks provide a hermetic GCC toolchain which uses basically the same sysroot strategy.

A parallel fix would be using Nix environment/shell definitions to stabilize the tools used when building. For instance by wrapping Bazel with a tool that boots a Nix defined shell before calling out to Bazel, or by using rules_nixos to try and embed Nix within Bazel. This doesn’t directly solve the caching issue since the Nix environment isn’t strictly a build input, but it will at least get you reproducible builds after a manner.

The advantage of these strategies is that they’re reproducible, and some can work with Bazel’s remote execution capabilities. The downside is that since you’re doing essentially containerized builds, they’re heavyweight.

Workspace status?

Taking a step back, there are two possible solutions to the glibc versioning problem. One is to make sure that the build as configured always runs in the same environment. Nix, Docker and sysroot images are all ways to achieve that.

The other would be to make the glibc version an explicit build input so that if it changes that is visible to Bazel. This sounds a lot like Bazel’s workspace status feature.

Workspace status allows Bazel to capture information about the host and repository state for the build. This state is divided into “stable” keys which are expected to change infrequently and cause build artifacts to invalidate, and “volatile” keys which are expected to change and don’t cause rebuilds.

These keys are generated by running an arbitrary shell script (or other executable) before each build action. This seems to align nicely with wanting to inspect the glibc version, since we can just run a shell script to inspect that and other details before every action. Right?

Unfortunately repository status is only a build input of “stamped” (--stamp) builds, and then only an input to executable rules (*_binary) and will not cause intermediate library products to be rebuilt for stamping. So that doesn’t actually align with the semantics we want which is that if glibc changes the entire cache gets busted.

It’d be nice if there were a status command/state for semi-permanent build inputs such as the OS and glibc version which did have these global cache busting semantics, but there isn’t so we have to look elsewhere.

Salting the build

Since we can’t make the glibc version a build input via the workspace status machinery, are there other channels we could use to achieve the same result?

The primary thing Bazel considers as a build input is files, but environment variables and global cc flags are also supported. For instance bazel build --copt=<something> would apply a compiler option to all sub-builds, and a change to this flag is a build input change which triggers rebuilds. bazel build --action_env=VAR=val specifies an environment variable value for any build action which depends on the variable VAR as an input.

If our goal is simply to ensure that Bazel performs rebuilds on context changes, we could wrap Bazel with a script such as --copt=-DPLATFORM_FINGERPRINT=$(ldd --version | sha256sum). This allows us to create a build input with global scope which will cause any cc task to rebuild if it changes.

In a sense we’re just salting the fingerprints of all actions in build by adding inputs that happen to change when the external environment changes.

#!/bin/sh

# A sketch at a Bazel wrapper

next=$(
    (
        which -a bazelisk
        which -a bazel
    ) | grep -v "$(realpath $0)"
)
shift

function platform_args {
    glibc=$(ldd --version | awk '{print $4; exit}')
    echo "--action_env=PLATFORM_GLIBC_REV=$glibc"
    echo "--copt=-DPLATFORM_GLIBC_REV=$glibc"
}

# HACK: this doesn't quite work because Bazel accepts startup args before the command
command="$0"
shift

exec "$next" "$command" $(platform_args) "$@"

Rather than just passing through the glibc version, one could imagine a more generalized platform fingerprint hash value incorporating factors like the libc version, the OS release, an actual salt counter for busting the cache and anything else your build may be conditional on.

Some adjustments to Bazel WORKSPACE rules may be required to make this tack work, as --action_env values don’t flow into workspace actions by default. Likewise rules wrapping non-cc compilers such as go would require minor adjustments to make the PLATFORM_GLIBC_REV a global environment input.

Thankfully, Bazel’s git_repo support for pulling in rulesets makes forking or vendoring rules easy, so the prospect of running a patched rules_go and rules_python isn’t that daunting.

A last word

The best solution would be for sysroot images under Bazel to be as easy as other kinds of 3rd party dependencies. Since a sysroot image is “just” a normal build input file, they play nicely with all of Bazel’s change detection machinery and with remote execution.

rules_nixos achieves this by using Nix definitions to create the required context. I wish them great success and want to kick the tires one of these days, because that seems like a good general strategy for locking down the build context without having to do a ton of extra work.

However, given that the alternatives (Nix, sysroots) are heavyweight and there isn’t a way to use workspace status to achieve what we want build salting may be an acceptable solution. It isn’t a perfect solution because from Bazel’s perspective the implication of --action_env=PLATFORM_GLIBC_REV=2.21 is that as long as the option is set you’ll get the same build results.

Really it’d be better to be able to express glibc as a platform constraint that could inform remote execution worker selection but Bazel doesn’t support that. And we haven’t deployed remote execution yet anyway so I guess I’ll be using a salting shim script for a while.