Happy New Year! I hope that you had a wonderful holiday season, and wish you the very best for the coming year!
Welcome to my first post of 2025! This time, we’re going to talk about managing pod resource limits in Kubernetes. And in keeping with the theme, we’re going to discuss a new feature that was added to Kubernetes in version 1.32 (Penelope) in December, Pod-level resource specifications. The feature is in Alpha, so it’s not yet available by default and could see some refinement before it graduates to Beta and then GA.
The problem with Pods
Today, Kubernetes pods have a long-standing management challenge. To illustrate it, let’s consider a portion the GitHub Action Runner Controller’s (ARC) Docker-in-Docker (DinD) configuration:
1spec:
2 ⋮
3 containers:
4 - name: runner
5 image: ghcr.io/actions/actions-runner:latest
6 command: ["/home/runner/run.sh"]
7 ⋮
8 - name: dind
9 image: docker:dind
10 ⋮
This template creates two containers, one for the runner itself and one for its sidecar DinD container. This supports the activities that this pod is designed to perform. The next step in the tuning process for Kubernetes is to set resource limits. This prevents the pod from consuming excessive resources and makes it easier to plan and predict how pods will be scheduled on the cluster. Teams will start by defining some amount of memory (such as 8GiB) and CPU (such as 4 cores) that they want to allocate to the pod. This is where the problem starts.
Defining resource requests and limits
While the intention is to restrict the total resources used by the containers in the pod, the reality is that Kubernetes doesn’t actually support that level of configurability. Resource requests and limits can only be configured at the namespace or container level. As a result, we can only constrain the resources used across the pods in the namespace or the resources used by individual containers in the pod.
In this case, we’re not really concerned about the memory and CPU requirements for the runner
container or the dind
container. We’re more concerned about the total resources used by the pod. In fact, we know that when the dind
container is running, it will consume memory and CPU, but during that time the runner
will be mostly idle. The reverse is also true, since only one of the two is generally executing a given task at the time.
Unfortunately, we can only implement the requests or limits at the container level, so we have to decide how to split the resources between the two containers. We might request 2 cores and 4GiB of memory for each container, recognizing that the balance will be off at any given time. Worse, we might give each more total resources to ensure that each one can support a larger burst workload.
Redefining the pod
The latest release of Kubernetes includes the Alpha implementation of KEP 2837, pod-level resource specs. This new feature is designed to overcome the challenges described above. It allows us to define resource requests and limits at the pod level, which will then be applied to the containers in the pod. This is a significant improvement. It makes it possible to accurately define the intention. Resources can be allocated to the pod as a whole.
As an example, the new specification might look like this:
1spec:
2 resources:
3 requests:
4 memory: 8Gi
5 cpu: 4
6 limits:
7 memory: 8Gi
8 cpu: 4
9 ⋮
10 containers:
11 - name: runner
12 image: ghcr.io/actions/actions-runner:latest
13 command: ["/home/runner/run.sh"]
14 ⋮
15 - name: dind
16 image: docker:dind
17 ⋮
This new feature makes it possible to define the resource requests and limits at the pod level. Those settings then flow down to the containers, allowing the containers to share the defined resources. At the moment, this functionality is limited to CPU and memory, but in most cases that’s exactly what we need.
Under the covers, the will essentially give each container a request/limit that matches the pod’s request/limit. It will also schedule based on the pod’s request, if provided. At the same time, Kubernetes will ensure that the containers in the pod don’t exceed the pod’s limits. This makes it very easy to configure our desired state. In addition, the containers can still set their own requests or limits, but the totals can’t exceed anything declared at the pod level. This allows containers to unevenly request resources but still handle bursts appropriately.
Since it’s in Alpha, there are still a number of limitations. For example, the autoscaler does not yet have handling for the pod-level definitions. There’s a number of limitations documented in the KEP that are proposed to be addressed for the Beta:
- Surface a new field in the pod status,
DesiredResources
to make it clear the amount of resources required for a pod. - HugeTLB handling (for large memory page usage).
- Topology handling.
- Memory manager and CPU manager support for enforcing both pod-level and container-level memory limits. At the moment, the Alpha can’t enforce the total pod-level limits.
- VPA and Autoscaler support.
- Support for Windows containers.
There’s also another major limitation for cloud-hosted clusters. Generally speaking, the Alpha feature gates are not enabled. This can limit your ability to explore the functionality until it reaches at least Beta maturity. At the same time, it does allow local clusters to experiment with the feature and provide feedback to the Kubernetes community. This is a critical part of their development process. The feedback from the community is used to refine the feature and ensure it aligns with the actual usage and needs in the community.
Coming soon to a cluster near you
As a reminder, Alpha means that the feature may have bugs or issues and that changes are likely in the future. It is not recommended for production use or on long-lived clusters. There’s also a chance for an Alpha feature to be dropped at any time. This usually happens if the feature is not meeting the needs of the community or has technical issues that would prevent it from reaching its goals. I think that is less likely in this case, but it’s always a possibility.
Like all Alpha features, it has a set of graduation criteria that must be met before it can move to Beta. While there is no guarantee when it will move into Beta, the current target is 1.33 (March/April 2025). Once it moves to Beta, a feature has 3 releases (approximately 9 months) to graduate to General Availability (GA), or release a Beta (consisting of changes to the REST API). If it can’t meet these requirements, the feature is deprecated in the next release.
Knowing that this is coming, I’m looking forward to seeing how it evolves. This is another small change that could have big impacts to the way we manage, maintain, and tune clusters going forward. This feature has the potential to make it easier to define the resource requirements for pods, which will make it easier to predict and plan for the resource needs of the cluster.