Planning Kubernetes Cloud Deployments

Category:

#GitHub

#DevOps

Tags:

#GitHub

#DevOps

#Containers

#ARC

Published: July 27, 2024 Reading Time: 9 min

To remain stable and performant, Kubernetes relies on having resources reserved for its services and components. Last week we built an understanding of how that works and the impact on a node. This week we’ll look at how to plan for those reservations in the cloud.

With a managed Kubernetes instance, the cloud provider typically takes ownership of configuring most of the key components of the cluster. This includes the API Server, etcd, and the node configurations. The cloud provider also sets the resource reservations for the Kubernetes components. This directly impacts how you use their environment and how it scales. The key thing to understand is that each cloud provider has a different way of configuring their clusters and supporting your scaling decisions. That includes how they calculate the resource reservations.

This post is not intended to compare the providers, their relative strengths and weaknesses, or the impact of their decisions on your workload. It is also not intended as a comparison for determining which provider to use. Instead, it’s intended to help you make educated decisions when planning a deployment with a given cloud provider. If you don’t understand the limitations and configurations, it’s incredibly hard to predict how your cluster will behave.

The control plane

The cloud providers are not completely transparent with the details of how they configure the control plane components. We often get some insights when they announce performance improvements, like this announcement from AWS that touts a 4x scaling improvement. Details like this also make the importance of those settings very visible. As another example, Google documents that they automatically create non-configurable resource quotas for clusters with under 100 nodes, and an Autopilot cluster dynamically configures the pods-per-node.

Another way to understand the control plane is to look at the troubleshooting guides. For example, Azure’s guide for the API server and etcd provides guidance for diagnosing issues with the API server and etcd (a common source of issues). While not revealing the configuration or logic, it does help you to identify services and components that are impacted by those decisions.

Finally, each provider generally offers best practice guidance for using their cluster. These documents also include general guidance that can improve your overall experience. The Azure guidance for large workloads reveals that the cluster’s control plane is scaled based on the cluster size and API server utilization. It also highlights how to monitor control plane metrics and some of the challenges with node metrics after scaling.

The AWS best practices guide includes insights such as customizing the scan interval for the autoscaler, pointing out the default scan interval is 10 seconds on AWS and that each scan has many API calls that can result in rate limiting or challenges with control plane availability. Increasing that interval can reduce the API calls by sacrificing how fast it recognizes scaling needs.

Google provides best practices through blogs and targeted guides, such as Best practices for GKE networking. This guide mentions that if you create a GKE cluster with the --enable-dataplane-v2 option, the cluster will use eBPF and configure a NetworkLogging CRD to log allowed and denied connections. It also documents some of their built-in network restrictions and how to work within them.

Resource reservations

Each cloud provider takes a very different approach to resource reservations in order to maximize stability and uptime. Some are more aggressive to ensure a stable cluster, while others are less restrictive and rely on your Kubernetes expertise to avoid issues. Last week, I showed how you can query the reservations on a given node manually. If you’re using a cloud provider, they each provide a way to understand and calculate those reservations.

Google provides documentation on their calculations. Azure also details their Kubernetes reservations in their online documentation. I haven’t found the same documentation with Amazon, but that doesn’t mean they don’t provide those details. They actual provide their bootstrap script, which documents their reservations.

To save you some time with the calculations, I’ve compiled some key details for each provider.

CPU Reservations

This is the easiest value to calculate for each cloud provider. Google and Amazon use the same metrics for CPU reservations. Azure uses a different calculation, but at the moment only publishes specific numbers for cores (instead of a calculation). The CPU reservations (in millicores, or 1/1000 core time) are:

Provider	1 core	2 cores	4 cores	8 cores	16 cores	32 cores	64 cores
Amazon EKS	60	70	80	90	110	150	230
Google GKE	60	70	80	90	110	150	230
Azure AKS	60	100	140	180	260	420	740

Or as a percentages of available total CPU capacity on the node:

Provider	1 core	2 cores	4 cores	8 cores	16 cores	32 cores	64 cores
Amazon EKS	6%	3.5%	2%	1.1%	0.7%	0.5%	0.4%
Google GKE	6%	3.5%	2%	1.1%	0.7%	0.5%	0.4%
Azure AKS	6%	5%	3.5%	2.3%	1.6%	1.3%	1.2%

In addition, most providers typically allocate 100 millicores as a system reservation (for OS system daemons, such as udev). Overall, the default CPU reservations are fairly small, especially if you are using 8 cores or more (which is generally recommended). For Amazon and Google, there is a greater reliance on the administrator managing the commitment of CPU resources. Azure tends to reserve slightly more to ensure that the Kubernetes core services remain performant (trading off a level of control and required expertise).

Memory reservations

Memory reservations are significantly different between each of the providers and tend to change the most over time. Google uses a formula to calculate the memory to reserve based on the available node memory. For Azure AKS running Kubernetes versions before 1.29, the algorithm is almost identical to Google GKE. The biggest difference is that these versions on Azure Azure gave an eviction threshold of 750 MiB, while Google’s threshold is only 100 MiB. That means Azure is more aggressive in evicting pods to ensure the system remains stable (again, trading off a level of control and required expertise).

Beyond that, the reserved memory for both based on the available node memory can be calculated as:

Provider	< 1GiB	4 GiB	8 GiB	16 GiB	32 GiB	64 GiB	128 GiB	256 GiB
GKE	255 MiB	1 GiB	1.8 GiB	2.6 GiB	3.56 GiB	5.48 GiB	9.32 GiB	11.88 GiB
Azure	N/A	1 GiB	1.8 GiB	2.6 GiB	3.56 GiB	5.48 GiB	9.32 GiB	11.88 GiB

If you’re running Azure AKS with Kubernetes 1.29 or later, the logic changes. Azure adopts a memory reservation formula that is similar to AKS. With this approach, the memory reservation is based on the maximum pods supported by the node. The eviction rule for both is 100 MiB and relies more on expertise and the reservations.

For Azure, it’s the lesser of 25% of the memory or (20 MB * max_pods) + 50 MB. The maximum number of pods is configurable. The default value for the pods per node depends on the selected networking, so it’s especially important to understand your pod density. The default max_pods is typically 110 pods, creating a default memory reservation of 2250 MB (or 25% of memory on 4 GiB and 8 GiB nodes). If you only need 20 pods per node, then the reservation is (20 * 20 MB) + 50 MB = 450 MB. Quite the savings!

The formula for AWS is different: (11MiB * max_pods) + 255 MiB. The maximum number of pods is calculated based on the node’s instance type and the CNI version using a script provided by AWS ( with instructions!). The script limits the maximum to 110 if there’s less than 30 vCPUs, otherwise the maximum is 250. As an example, an m5.large supports 29 pods, so the memory reservation is (11 * 29) + 255 = 574 MiB. For an instance type supporting 110 pods, the reservation is 1465 MiB. Unlike Azure, this number does not consider user-provided limits. That said, the reservations can often be less Azure’s default reservation.

Storage reservations

Generally speaking, each provider has an eviction threshold set to 10% of the file system and 5% of the inodes for Kubernetes. Unfortunately, Azure and Amazon don’t directly document these decisions. They are, however, available by examining the nodes themselves. Google is more transparent. In addition to the eviction threshold, GKE reserves system storage, using the lesser of:

50% of the boot disk capacity
35% of the boot disk capacity, plus 6 GiB
100 GiB

That means on a typical 300 GiB disk, Azure and EKS will reserve 30 GiB for Kubernetes. The remainder is available as ephemeral storage for containers and images (and managed by the Kubernetes administrator). By comparison, GKE will reserve 100 GiB for the system and 30 GiB for Kubernetes. The remaining 170 GiB is available for ephemeral storage. In this case, Google is trying to ensure that the system has enough guaranteed storage to remain stable and performant.

There is some additional logic that Google uses if the storage is backed by local SSDs. They set the eviction threshold to 37.5 GiB * num_ssd. The system reservation is 25 GiB + (25 GiB * num_ssd), with a max reservation of 100 GiB. Now you understand why every configuration decision in Kubernetes – including storage – is so critical.

Start your planning

While they are similar, each provider has different approaches to reserving resources for Kubernetes. For planning pod density, you need to know what’s available on your nodes. For planning scale, you need to understand how their decisions impact the reservations and managed control plane. Each of these components helps to improve the performance, stability, and responsiveness of the cluster.

As a final thought, be aware that limitations on the control plane frequently lead to a requirement to build more clusters instead of larger clusters. Limitations on the control plane can also lead to unexpected performance issues, especially when the cluster is managing (or requesting) large numbers of pods or new nodes. This is why we often have to use larger nodes for scaling – to minimize the need to add new nodes and maximize the number of pods a single node is supporting. The tradeoff is that we lose the ability to use small nodes to minimize cost while achieving isolation and resiliency. In those cases, a strong infrastructure-as-code approach is the most effective way to manage the clusters.

You now have better understanding of how the cloud providers reserve resources. Armed with this knowledge, you can make better decisions about how to scale your clusters and manage your workloads.