20,000 Layers Under the Sea

tldr; My attempt to distill the hundreds of hours of research, experimentation, implementation, testing and bug fixing that went into the OCI Publish feature into a single post.

Phase 0: The Problem

To set the stage, publishing / storing Zarf packages within Docker registries is a concept the Zarf team had been kicking around for a while and in early January it became a critical feature request due to UDS needing a "package manager". The initial details of this request can be found in this discussion. I volunteered to take on the task of understanding + implementing the interaction between Zarf and OCI compliant registries (due to my previous experience re-designing Zarf's Rust injector system).

There were a few things I knew going in:

Publishing would need to work with every registry compliant with the OCI Distribution Spec
cosign/helm were already tools that were capable of publishing non images to registries
Publishing would need authentication
Publishing would need to to handle large layers (potentially 10+ GB/layer) (looking at you, Kubevirt)
Published packages would need able to be pulled/deployed/inspected after the fact

Phase 1: Research

The first thing I did was re-read the OCI Manifest Spec and OCI Artifact Spec to re-familiarize myself with what actually goes into an image and artifact respectively (yes there is a substantial difference between an OCI image and an OCI artifact).

Following this, I looked at the tools that exist today that I knew were doing something similar to what I was trying to accomplish: cosign and helm. After a cursory look at their imports, I discovered they were both using ORAs, a low-level CLI + Go library for interacting with OCI registries.

Sadly (at the time I started writing the feature), most libraries using ORAs were on the older v1 release, but v2 was released and stable. This meant that I did not have a lot of examples to work from as the ORAs v2 docs were lacking. Not to worry, I just simply read the entire library codebase, then read the ORAs CLI codebase to see how they were implementing the library. For good measure I went back and read how cosign + helm were using ORAs v1 to see how things had been done in the past. I like reading.

Phase 2: Experimentation

Armed with my new conceptual understanding of how ORAs worked, I started experimenting with the library. I started by spinning up a local Docker registry with simple authentication using the below docker-compose.yaml

services:
  registry:
    image: registry:2.8.1
    container_name: registry
    ports:
      - "666:5000" # expose the registry on port 666 because Macs don't like 5000
    volumes:
    - ./mnt/registry:/var/lib/registry # mount the registry volume to my local file system so I can inspect the layers
    - ./auth.htpasswd:/etc/docker/registry/auth.htpasswd # mount the auth file created from myuser:mypass
    environment:
      REGISTRY_AUTH: "{htpasswd: {realm: localhost, path: /etc/docker/registry/auth.htpasswd}}"
    restart: always
    networks:
      - default

I then wrote some simple and small Go code to figure out how to push a single layer to the registry using ORAs v2, once successful I turned my attention to pushing multiple layers, then finally creating an artifact manifest. This is where I encountered my first (of many) snags. I discovered that just because a registry supports the OCI Distribution Spec, it doesn't mean it supports the OCI Artifact Spec. Like the ORAs CLI, the Publish feature would need to attempt an artifact push, and if it failed, fall back to a regular image push. It was also at this time that I learned how to use Docker's Moby library to interact with Docker's authentication system to retrieve credentials for a given registry (that way I would not have to write my own).

I ran extensive tests on the Publish feature to ensure it would work with the following registries:

Docker Hub
distribution/distribution (registry:2.8.1)
ECR (Amazon Elastic Container Registry) (had a real fun time debugging this one that eventually culminated in me doing a packet capture to see what was actually being sent to the registry)
GHCR (GitHub Container Registry)

and with the following package types:

Regular Zarf packages
Zarf packages with large layers (Kubevirt)
Zarf packages with many layers (Big Bang) (the Big Bang extension was not 100% ready while I was creating this feature, but I did some manual testing to ensure it would work)

Phase 3: Implementation

Once I was confident that the Publish feature was working as expected, I started the process of integrating it into Zarf. A great deal of refactoring and rewriting went on during this time, with some amazing code reviews from Wayne Starr, and pair programming from Jon Perry (ie they ripped my garbage code to shreds and helped me rebuild from the ground up). In order to provide an improved user experience during publishes/pulls I drilled down into ORA's HTTP client in order to provide a byte by byte progress bar and status updates per layer. The following commands were added to the Zarf CLI from this effort:

zarf package publish <package> <registry> # publish a built package to a registry

zarf package inspect oci://<package>

zarf package deploy oci://<package>

zarf package pull oci://<package>

A great (if I do say so myself) walkthrough of using these commands can be found in the Zarf docs.

It was also at this time that I merged with Jon's package re-structuring work (discussion here) and the feature was nearly complete! (I also accidentally nuked the PR branch and had to re-create it, but that's a story to be told over drinks)

Phase 4: Testing

The final phase of the project was testing. I added a few end-to-end CLI tests to ensure the Publish feature was working as expected. During this I encountered a bug on our Windows CI runner that was caused by syft not properly closing file handles after it created them (there went twelve hours debugging that one, also another story requiring alcohol) pr here. Future testing of this feature will be done in a nightly CI job that will test publishing to all the registries listed preceding and will handle more complex edge cases (like Big Bang).

Conclusion

While this post doesn't do the Zarf team justice in the true amount of energy spent building and refining this feature, I do hope it showed a small peek behind the curtain about some of the work and effort that goes into creating quick, stable, two-week turnaround Zarf features.

As always, the team welcomes and encourages your PRs, issues, and feedback. If you have any questions about this feature or Zarf in general, please feel free to reach out to us in the #zarf channel on Slack or DM me @razzlegpt.