Kubernetes fails to do do garbage collection on images
For quiet some time every once in a while I get an alert, that the disk of one of my Kubernetes cluster nodes is filling up. I pretty quickly found out, that it were the Docker images. As I did not have time to deeper analyze the issue, I just used
docker system pruneto get rid of everything not running (yeah, I know this is highly discouraged).
Now I started to look into this issue and it seems, that the garbage collector has issues with the images:
max@nb [~] -> % k get events -owide LAST SEEN TYPE REASON OBJECT SUBOBJECT SOURCE MESSAGE FIRST SEEN COUNT NAME 3m15s Warning ImageGCFailed node/cluster00 kubelet, cluster00 failed to get imageFs info: non-existent label "docker-images" 69d 20073 cluster00.16fa70fb01b64a69 4m15s Warning ImageGCFailed node/cluster02 kubelet, cluster02 failed to get imageFs info: non-existent label "docker-images" 69d 20085 cluster02.16fa6da5f5b30c31 54s Warning ImageGCFailed node/cluster04 kubelet, cluster04 failed to get imageFs info: non-existent label "docker-images" 69d 20082 cluster04.16fa6ea64df694c5 3m57s Warning ImageGCFailed node/cluster05 kubelet, cluster05 failed to get imageFs info: non-existent label "docker-images" 69d 20087 cluster05.16fa6d1f167fe7f4 48s Warning ImageGCFailed node/cluster06 kubelet, cluster06 failed to get imageFs info: non-existent label "docker-images" 69d 20077 cluster06.16fa700540142542 2m21s Warning ImageGCFailed node/cluster11 kubelet, cluster11 failed to get imageFs info: non-existent label "docker-images" 69d 20074 cluster11.16fa70c17857fef8 max@nb [~] -> %
I have two other nodes in the cluster, that are newer than the others and do not seem to have the same issue:
max@nb [~] -> % kubectl get nodes NAME STATUS ROLES AGE VERSION cluster00 Ready control-plane,master 4y107d v1.23.8 cluster02 Ready 2y270d v1.23.8 cluster04 Ready 3y267d v1.23.8 cluster05 Ready 4y107d v1.23.8 cluster06 Ready 3y171d v1.23.8 cluster11 Ready 592d v1.23.8 cluster12 Ready 139d v1.23.8 cluster13 Ready 139d v1.23.8 max@nb [~] -> %
The cluster was initially setup more than four years ago manually with
kubeadm. If I remember correctly, the version was 1.9.6 back then. And I updated it from there up to 1.23.8 right now. When the last two nodes (the working ones) were added, the cluster was on 1.19.8.
It is very much possible, that I forgot some configuration change in any of the upgrades. 69 days ago (the time the first appeared as of
get events) I did the upgrade from 1.19.8 to 1.23.8 (and all versions in-between, of course). Could be, that the issue started then. Could also be, that the upgrade cleared the events log. Not sure if the issue was already there before, to be honest.
Has anyone any idea where I could start looking for the reason of this issue?
Thanks in advance!
I think I fixed the issue.
It seems it comes up, when the
kubeletservice starts before the
dockerservice is up and running. So I added the directive
[Unit]block of the
kubeletservice and restarted it. The issue seems to be gone now.