Strategies for Upgrading ARC

Category:

#GitHub

#DevOps

Tags:

#GitHub

#DevOps

#Containers

#ARC

Published: May 30, 2024 Reading Time: 9 min

So you’ve installed the Actions Runner Controller (ARC). You have your images properly configured and optimized. You’re finally scaled up and running, so it seems like the right time to call this project “done”. Glancing at the ARC Releases, you notice that a new version has been released. Now what do you do?

Read the docs

It should go without saying, but start by making sure you’ve read the latest documentation. The GitHub team does a great job of keeping the documentation up-to-date. They’ll provide you with the most current information on how to upgrade ARC. This is important, since there can be breaking changes between versions.

Why is it so hard?

The ARC codebase went through a major redesign last year. There were two goals that were part of the redesign:

Simply the experience by making ARC do just one thing and to it well: scale runners.
Eliminate the use of web hooks to increase the security and maintainability

The complicating factor is that ARC is a controller that manages Custom Resources. The Custom Resource Definitions (CRDs) are frequently being updated. Without a mutation webhook, the versioning strategy becomes more complex (and resolving that requires some significant changes to the controller). As a result, you need to always make sure you’re running the latest CRDs.

Helm is used for deploying ARC, but it doesn’t actually help with this problem. Helm made an explicit decision ( HIP-0011) to not handle upgrading or deleting CRDs. In fact, the best practices documentation notes:

There is no support at this time for upgrading or deleting CRDs using Helm. This was an explicit decision after much community discussion due to the danger for unintentional data loss. Furthermore, there is currently no community consensus around how to handle CRDs and their lifecycle.

As a result, this leaves you with the responsibility of maintaining the CRDs.

The GitOps approach

One of the most common strategies teams try to use is the GitOps approach. This is where you have a Git repository that contains the configuration for your cluster. When you want to upgrade ARC, you simply update the configuration in the repository. The GitOps tool then apply the changes to the cluster. On the surface, this seems to solve the problem. Unfortunately, it actually can introduce a few challenges:

The components of ARC need to be installed and uninstalled in a specific order
The Helm charts can change between versions, requiring you to understand what changes are needed to multiple resources
ARC itself is continuously creating and removing resources, a process that can interfere will tools that are trying to manage and commit changes made to the cluster

In short, these tools aren’t right for the job in this particular circumstance.

The preferred approach

Use Helm! That’s how ARC was designed to be installed, upgraded, uninstalled, and reconfigured. Let it uninstall the scale sets and the controller. Don’t try to manage it by hand. Also, don’t try to set the minimum and maximum runners to zero. The scale set won’t create new runners, but it may keep accepting jobs and preventing other clusters or scale sets from being able to accept those.

First, delete the scale sets by uninstalling those charts. This deletes the associated custom resources, roles, and bindings. This also removes the listener (so no new runners are created). Before you remove the controller, you’ll want to give the existing runners a chance to finish and get cleaned up. Thankfully, we can automate that part of the process. For example, a script like this can help identify when the runners are done:

 1function wait_for_runnerset() { 
 2  local namespace=$1
 3  local runnerset=$2
 4  while true; do
 5    echo "Waiting for runners $namespace/$runnerset to be removed ..."
 6        
 7    # Has the runnerset record been removed?
 8    local response=$(kubectl get "AutoscalingRunnerSet.actions.github.com/$runnerset" -n $namespace 2>&1)
 9    if [ "$(echo $response | grep -E -i -w 'error|not found' | wc -l)" == "1" ]; then
10      break
11    fi
12
13    # Give the system some time to cleanup managed resources
14    sleep 5
15  done
16}

This script will query the AutoscalingRunnerSet (ARS) custom resource every 5 seconds until it is removed. In the past, I waited on the runners. ARC developer @nikola-jokic recently pointed out a better approach. The AutoscalingRunnerset is cleaned up after the last runner finishes. When that resource is removed, it provides a clear indication that the controller has finished cleaning up everything related to the runners. When the resource is removed, you typically get an error message indicating that the resource does not exist. Alternatively, you could list all of the ARS resources in the namespace if you have isolated the runners in individual namespaces. That will typically return a message indicating that the resources were not found.

It’s very important to not delete the namespace or the ARC controller manager until all of the runner pods have been cleaned up. The namespace contains secrets that are being used by the runners. If that gets deleted before the controller finishes its cleanup, it can cause errors. This is especially true when the controller tries to unregister the scale set. The controller is responsible for cleaning up all of the resources related to the scale sets and for communications with GitHub. If you remove it prematurely, you could be left with a lot of resources to clean up.

Once you’ve removed all of the runners and scale sets, you can safely uninstall the ARC controller using Helm. If you have followed the previous steps, this step will finish almost instantly. Helm is able to remove ARC, but by design it doesn’t remove any CRDs. Removing those can also be automated. Because they all share the same group – actions.github.com – they are easily removed. For example, you can use a Bash script to all of the CRDs and delete them:

1function delete_crd() { 
2  local group_name=$1
3  local crds=$(kubectl get crd -o jsonpath="{ .items[?(@.spec.group==\"${group_name}\")].metadata.name }")
4  for crd in $crds; do
5    kubectl delete crd $crd
6  done
7}
8
9delete_crd actions.github.com

With everything removed and cleaned up, you can now install the ARC controller followed by each of your scale sets. The only thing to keep in mind is that all of the components must be running the same version. You can’t mix different controller, CRD, or scale set versions. At that point the cluster is upgraded to the newest version of ARC.

Personally, I like to use helmfile for managing the installation of multiple charts. It provides an easy way to coordinate and install everything in the proper order. It can even be used to track the configuration for the entire Kubernetes cluster. Being able to create cluster configurations using infrastructure-as-code also minimizes the operational overhead.

Avoiding downtime

If you want to minimize downtime while upgrading, make sure you’re running at least two clusters. At the moment, ARC assigns the available jobs to the first ARC controller that asks for them. That means the clusters are constantly competing for work. When one cluster is down or unavailable, it stops requesting jobs. All new work is assigned to the other cluster. If you have a minimum of two clusters, you can upgrade one while the other is still running.

Another common pattern for upgrading is to uninstall the scale sets on one cluster. Then, set up a new cluster with the latest versions of Kubernetes and ARC. Finish by deploying the scale sets. If the new cluster is correctly processing the jobs, then you can take down the old cluster once it finishes running its assigned jobs. This follows a Kubernetes best practice – frequently eliminate nodes (and clusters) and replace them with new ones. This ensure you have the latest fixes and security patches and that your environments can always be recreated from code. After that, repeat the process with the second cluster. Amazon actually recommends avoiding long-running instances. They suggest:

Replacing nodes regularly keeps your cluster healthy by avoiding configuration drift and issues that only happen after extended uptime (e.g. slow memory leaks). Automated replacement will give you good process and practices for node upgrades and security patching. If every node in your cluster is replaced regularly then there is less toil required to maintain separate processes for ongoing maintenance.

Remember that if you’re in Azure, you should use paired regions. You want to make sure that in a worst case outage scenario, you are always in a region that is prioritized for recovery.

When should I upgrade?

How often should you upgrade? Can you skip a release? The short version: always install and use the latest version. ARC does not follow a fixed release schedule. Updates are released as needed to improve performance, resolve known issues, and ensure compatibility with changes and improvements to the backend services.

It’s also important to understand that fixes are not backported to earlier versions. They are only available in the latest version. If you’re experiencing an issue, make sure you’re on the latest version. It may resolve the issues you’re facing.

There is one situation where it’s acceptable to use an older version. In rare cases, a release may contain a defect that impacts your runners. In that case, you should roll back to the previous version until the issue is resolved. Since ARC doesn’t release on a fixed schedule, fixes for critical issues are usually resolved quickly.

Avoiding the work

To be able to upgrade this frequently, it’s important to manage the environment using infrastructure as code and automation. It makes upgrading (or rolling back) a fast, simple process for any release of ARC. In fact, the part that takes the longest is usually waiting for the Kubernetes cluster to be created with your cloud provider. Properly automated, ARC (and any supporting services) can be upgraded or redeployed in a matter of minutes. Top performing teams can typically create a fully configured environment in ten minutes or less.

Setting up and managing ARC environments (or more specifically. Kubernetes) isn’t always easy. With the right tools and processes, you can make it a lot easier. There’s a saying in the DevOps community: “if it’s painful, do it more.” This applies to ARC as well. You will appreciate how much it eases the maintenance process. It will also simplify the process of testing and verifying changes.

And remember: the more you do it, the easier it gets.

Happy DevOp’ing!