Skip to content

Draining node is called multiple times (30+) while deleting a machine #130

@nasusoba

Description

@nasusoba

I created k3s cluster with CAPZ clusterclass, while deleting a machine, even though node drain is successful, CAPI keeps draining the same machine for 30+ times. Need more investigation if this is specific to capi k3s.

Log:

#1st time
I0606 09:56:31.424772       1 machine_controller.go:379] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:56:33.831664       1 machine_controller.go:669] "WARNING: ignoring DaemonSet-managed Pods: kube-system/cloud-node-manager-877r7, kube-system/etcd-proxy-dq28c, kube-system/svclb-traefik-9d54adee-jmxgh\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
I0606 09:56:34.146008       1 machine_controller.go:927] "evicting pod kube-system/traefik-f4564c4f4-fhqsl\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
...
E0606 09:56:54.324252       1 machine_controller.go:686] "Drain failed, retry in 20s" err="error when waiting for pod \"local-path-provisioner-84db5d44d9-lhvh6\" in namespace \"kube-system\" to terminate: global timeout reached: 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
....
# 2nd time (repeated for 30+ times under 2mins)
I0606 09:57:05.236640       1 machine_controller.go:379] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="a706ce50-e0f7-443c-bcd7-33a3168a85ab" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:57:08.328683       1 machine_controller.go:669] "WARNING: ignoring DaemonSet-managed Pods: kube-system/cloud-node-manager-877r7, kube-system/etcd-proxy-dq28c, kube-system/svclb-traefik-9d54adee-jmxgh\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="a706ce50-e0f7-443c-bcd7-33a3168a85ab" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
I0606 09:57:08.328722       1 machine_controller.go:690] "Drain successful" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="a706ce50-e0f7-443c-bcd7-33a3168a85ab" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
...
# 37th time -> machine got deleted
I0606 09:59:42.458023       1 machine_controller.go:379] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:59:42.765445       1 machine_controller.go:643] "Could not find node from noderef, it may have already been deleted" err="nodes \"demok3s-tbbs6-hmxp4\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:59:42.765538       1 machine_controller.go:710] "Could not find node from noderef, it may have already been deleted" err="Node \"demok3s-tbbs6-hmxp4\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
I0606 09:59:42.766135       1 machine_controller.go:468] "Deleting node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:59:43.098869       1 controller.go:329] "Reconciler error" err="failed to patch Machine default/demok3s-jtx7s-tpjl2: machines.cluster.x-k8s.io \"demok3s-jtx7s-tpjl2\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654"

Could be solved by adding setting nodedraintimeout to something short like 30s. But it is a potential issue if we keep sending lots of unnecessary messages to api server.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions