How to upgrade nodes in a kubernetes cluster?
For keeping the nodes of a kubernetes cluster (e.g. AWS EKS) up to date with kernel bugfixes etc, how should the existing nodes be replaced once an updated image becomes available (i.e. once the ASGs have been reconfigured so new nodes spawn with a more recent AMI)?
For example, is there a command to cordon off all existing (or old) nodes from pod scheduling? Is there a benefit to performing rolling restarts of deployments etc, before otherwise attempting to drain the nodes? Will fresh nodes automatically spin up in proportion to pod scheduling demand, even when other remaining nodes (cordoned or partway drained) are nearly idle/unoccupied? Or is it better to disengage the autoscaler and perform manual scaling? How soon after a node is drained would the instance be automatically terminated? Will manually deleting a node (in kubernetes) cause the aws cluster autoscaler to terminate that instance immediately? Or should termination be done with the aws cli? Will persistent data be lost if a node is deleted (or terminated) before it fully drained? Also, can some pods be granted exceptions from getting evicted (e.g. stateful long-running interactive user sessions, such as jupyterhub), while still ensuring their host node does get refreshed as soon as those pods finish? And if so, can this be overridden (when there is an urgent security patch)?
irl last edited by
Given how kubernetes is intended for high availability online services, I'm surprised the topic isn't more clearly consolidated in the kubernetes documentation (or more prominent in both the CKA and CKS curriculum). I did find docs from https://docs.aws.amazon.com/eks/latest/userguide/update-workers.html and https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies , covering different strategies (and technicalities) for node pool upgrades. There are also a number of related ongoing kubernetes feature https://github.com/kubernetes/kubernetes/issues/4301 , to more intelligently move pods.
OS patches do not strictly necessitate a new AMI (machine image), but kernel and glibc patches do require a restart (disrupting all pods anyway) and a fresh AMI ensures consistency between nodes.
To drain a node has the effect of cordoning it off (that is, tainting it with an annotation to prevent scheduling of new pods) and (gracefully) evicting its current pods. Note, it does not affect static/mirror pods (which are outside of API control) and daemonset pods (since their controller ignores the taint). The
kubectl draincommand blocks until complete (requiring parallel invocations to drain multiple nodes), and requires a
--forceargument if the node hosts any bare pods (pods with no controller to replace them).
Pod disruption budgets are objects created to specify how many of an application's pods should be kept available during voluntary disruptions. For a long-running stateful application, a restrictive PDB can be used to prohibit certain pods from being evicted (which prevents drain from succeeding).
The cluster autoscaler deployment is responsible for instructing ASGs (AWS EC2 autoscaling groups) to launch additional (suitable) instances if there exist (pending) pods with no place (node) to be scheduled to. It also drains and terminates instances, but only after several conditions are all satisfied for an extended period (10min): the node has low utilisation (less than half its capacity requested by its hosted pods); there is suitable space available on other nodes (accounting for affinities, etc); the autoscaler would dare to evict the pods (no inhibiting disruption budgets or annotations, no
kube-systempods other than static or daemonset or with explicit disruption budgets, no bare pods, no local storage); and the cluster did not need more nodes. (Note cluster autoscaling is intentionally not concerned with actual CPU load. Also differs from default ASG behaviour by not preferentially terminating the instances with oldest launch templates.)
You could manually remove an instance with the command
aws autoscaling terminate-instance-in-auto-scaling-group --instance-id --should-decrement-desired-capacity. (Note, may have to use --no-should-.. if the ASG is already at its MinSize.) The kubelet should attempt (using systemd) to detect whenever an instance is shutting down, and to delay it by a grace period so that remaining pods have an opportunity to shut down cleanly.
The node controller is responsible for creating node objects which represent the currently available instances, and deleting those nodes when the corresponding instance is terminated. Note that deletions in kubernetes generally cascade. There may be cases where node deletion could discard data that a pod had claimed from a dynamically provisioned "persistent" volume. Obviously any data stored locally on the node will be jettisoned.
When a set of replica pods is scaled down (such as for deployments using horizontal pod autoscalers) the replicaset controller preferentially deletes a pod that has been ready for a shorter time (or none), as well as trying to leave the pods more evenly spread between nodes. This means that load fluctuations tend to only relinquish the youngest nodes, and will not organically migrate pods from cordoned older nodes to newer nodes unless the older pods were manually annotated with negative deletion costs.
During a rollout restart, the deployment controller maintains pod availability within a tolerance (by default ±25%, rounded upward). Thus, for deployments with less than 4 replicas, a rollout restart will ready a healthy replacement pod before it deletes each original pod. This surge mechanism avoids disruption to the application (whereas draining relies on rectification by other controllers after a disruption). Rollout is an alternative approach that could be used to relocate pods from cordoned nodes onto other (new) nodes.