Since December, the GitHub team behind Actions Runner Controller (ARC) has been hard at work, and they have some exciting updates. The latest release of ARC includes several new features and improvements that enhance the functionality and usability of the tool. As always, you need to be on the latest version of ARC to have full access to GitHub support and the latest features.
In December, the team released 0.10.0 and 0.10.1. The major feature in this release is that some key rate limiters are now configurable. I’ll cover some of these details in a future post, but for now the key takeaway is that k8sClientRateLimiterQPS
, k8sClientRateLimiterBurst
, and runnerMaxConcurrentReconciles
are now configurable. But what does this actually mean? The setting runnerMaxConcurrentReconciles
configures how many threads the EphemeralRunnerController uses for reconciliation. Instead of being limited to a single thread, it defaults to two threads (and the number is configurable). This allows it to do more in parallel, including making requests to create and delete runner resources. This provides a significant performance boost, especially in larger clusters with many resources.
The other two settings allow you to configure the throttling behavior that determines how the controller can make requests to the Kubernetes API. When ARC is creating or deleting runners and their resources, it needs to make calls to the API server. These requests are throttled to avoid overwhelming the server. However, in some cases, you want to increase the rate at which requests can be made to allow more changes to happen in parallel. The first value, k8sClientRateLimiterQPS
configures the number of queries per second (QPS) that the controller can make to the API server. The second value, k8sClientRateLimiterBurst
, configures the burst size, which is the peak number of requests that can be made per second. Between these three settings, ARC can process more resources for runners in parallel. Just be careful not to overwhelm your API server!
The final change worth highlighting is that ARC occasionally encounters errors when registering new runners. Rather than continuously retrying, the controller will now perform an exponential backoff to give the system time to recover. This avoids the “thundering herd” problem, where it’s creating lots of requests close together for all of the scale sets, resulting in all of the requests getting throttled.
So what did 0.11.0 bring to the table? First, it had a number of bug fixes to improve performance and eliminate some known issues. It also further improved how ARC processes resource cleanups by trying to detect and eliminate unnecessary reconcile passes. Normally, ARC will queue a new reconcile pass while waiting for Kubernetes to acknowledge that resources have been deleted. In some cases, ARC doesn’t need to wait for this acknowledgement and can proceed to the next step without waiting. In other cases, ARC was a bit too aggressive. For example, it might report readiness slightly before the pod is actually ready. That was also patched.
The real change, however, is the introduction of configurable metrics. In release 0.8.2, a metrics bug was introduced that caused certain metrics to have “high cardinality.” Basically, the system was creating an excessive number of metrics for startup and execution times. The root cause was the accidental re-inclusion of the labels runner_id
and runner_name
in the duration metrics. This could cause performance issues, and it made certain calculations for Grafana more challenging.
Rather than simply fix this issue, the team decided to make the metrics more configurable. This means that you have some control over the metrics that are generated and their cardinality. You can now customize the buckets and configure which of the available labels are included in the metrics. This can improve performance and make it easier to use some of the metrics in Grafana.
This change does have a consequence. however.
Enabling the metrics is no longer enough. You must also configure the new listenerMetrics
field in the scale set to specify the metrics you want to collect. The
values.yaml
includes a commented-out example of how to configure these metrics, so make sure to review that. If you are upgrading to this version and you use metrics, you’ll need to update the yalues.yaml
for your scale sets to include these new settings. You’ll notice in reviewing the sample that some labels (like job_worfklow_ref
) are no longer available; you may also need to update your Grafana dashboards to reflect this change. Because of these modifications, I’m working on an updated version of my dashboard that updates it to use the new metrics approach. I’m basing it on the default configuration, so it should be a good starting point. I hope to have more news on that shortly.
As you can seem there’s a few interesting changes to ARC in the last few months to improve its performance and metrics. If you haven’t updated to the latest version, now is a great time to do it!