Kubernetes fails to do do garbage collection on images



  • For quiet some time every once in a while I get an alert, that the disk of one of my Kubernetes cluster nodes is filling up. I pretty quickly found out, that it were the Docker images. As I did not have time to deeper analyze the issue, I just used docker system prune to get rid of everything not running (yeah, I know this is highly discouraged).

    Now I started to look into this issue and it seems, that the garbage collector has issues with the images:

    max@nb [~]
    -> % k get events -owide
    LAST SEEN   TYPE      REASON          OBJECT          SUBOBJECT   SOURCE                MESSAGE                                                             FIRST SEEN   COUNT   NAME
    3m15s       Warning   ImageGCFailed   node/cluster00              kubelet, cluster00    failed to get imageFs info: non-existent label "docker-images"      69d          20073   cluster00.16fa70fb01b64a69
    4m15s       Warning   ImageGCFailed   node/cluster02              kubelet, cluster02    failed to get imageFs info: non-existent label "docker-images"      69d          20085   cluster02.16fa6da5f5b30c31
    54s         Warning   ImageGCFailed   node/cluster04              kubelet, cluster04    failed to get imageFs info: non-existent label "docker-images"      69d          20082   cluster04.16fa6ea64df694c5
    3m57s       Warning   ImageGCFailed   node/cluster05              kubelet, cluster05    failed to get imageFs info: non-existent label "docker-images"      69d          20087   cluster05.16fa6d1f167fe7f4
    48s         Warning   ImageGCFailed   node/cluster06              kubelet, cluster06    failed to get imageFs info: non-existent label "docker-images"      69d          20077   cluster06.16fa700540142542
    2m21s       Warning   ImageGCFailed   node/cluster11              kubelet, cluster11    failed to get imageFs info: non-existent label "docker-images"      69d          20074   cluster11.16fa70c17857fef8
    max@nb [~]
    -> %
    

    I have two other nodes in the cluster, that are newer than the others and do not seem to have the same issue:

    max@nb [~]
    -> % kubectl get nodes
    NAME        STATUS   ROLES                  AGE      VERSION
    cluster00   Ready    control-plane,master   4y107d   v1.23.8
    cluster02   Ready                     2y270d   v1.23.8
    cluster04   Ready                     3y267d   v1.23.8
    cluster05   Ready                     4y107d   v1.23.8
    cluster06   Ready                     3y171d   v1.23.8
    cluster11   Ready                     592d     v1.23.8
    cluster12   Ready                     139d     v1.23.8
    cluster13   Ready                     139d     v1.23.8
    max@nb [~]
    -> %
    

    The cluster was initially setup more than four years ago manually with kubeadm. If I remember correctly, the version was 1.9.6 back then. And I updated it from there up to 1.23.8 right now. When the last two nodes (the working ones) were added, the cluster was on 1.19.8.

    It is very much possible, that I forgot some configuration change in any of the upgrades. 69 days ago (the time the first appeared as of get events) I did the upgrade from 1.19.8 to 1.23.8 (and all versions in-between, of course). Could be, that the issue started then. Could also be, that the upgrade cleared the events log. Not sure if the issue was already there before, to be honest.

    Has anyone any idea where I could start looking for the reason of this issue?

    Thanks in advance!



  • I think I fixed the issue.

    It seems it comes up, when the kubelet service starts before the docker service is up and running. So I added the directive After=docker.service to the [Unit] block of the kubelet service and restarted it. The issue seems to be gone now.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2