Skip to content

Node labels removed and re-added on nfd-master restart #2408

@yuyue9284

Description

@yuyue9284

What happened:

When nfd-master restart, node labels are removed and then re-added during startup, even though the node features have not changed.

What you expected to happen:

The node labels should stay untouched.

How to reproduce it (as minimally and precisely as possible):

We have a prod cluster with 2k nodes, nearly everytime the nfd-master restart, we can see nodes' labels got removed and then added back.

Anything else we need to know?:

nfd version: v0.18.2, very similar to this issue: #1802

We only observed this behavior on several large clusters: 1k+ nodes. We can confirm the nodefeature does not change, as we found not all node will get the label added back, if we label the node with some irrelevant label, we could see nfd master pickup the changes and add back the nfd labels to the node.

not sure if it is related to list from ETCD directly.

Time line of the nfd-master:

2026-01-07 19:17:21.765 - "starting the nfd api controller"
2026-01-07 19:20:53.166 - "informer caches synced" duration="3m31.4009872s"
2026-01-07 19:20:53.181 - "starting the NFD master updater pool" parallelism=10
2026-01-07 19:20:53.181 - "http server starting" port=":8080"
2026-01-07 19:20:53.181 - attempting to acquire leader lease kube-system/nfd-master.nfd.kubernetes.io...
2026-01-07 19:21:10.536 - successfully acquired lease kube-system/nfd-master.nfd.kubernetes.io
2026-01-07 19:21:14.186 - "will process all nodes in the cluster"

Request received from the api-server:

1/7/2026, 7:17:21.795 PM - /nodefeatures?limit=200 → 200
1/7/2026, 7:17:31.774 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:17:44.843 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:17:53.568 PM - /nodefeatures?continue=...&limit=200 → 429 (RV 615561940)
1/7/2026, 7:18:10.124 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:18:36.574 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:18:40.567 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615561940)
1/7/2026, 7:18:57.726 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:19:03.211 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615561940)
1/7/2026, 7:19:21.653 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:19:36.705 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615561940)
1/7/2026, 7:19:46.684 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:20:09.740 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:20:13.360 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615561940)
1/7/2026, 7:20:35.855 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615561940)
1/7/2026, 7:20:40.265 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615567111)
1/7/2026, 7:20:53.123 PM - /nodefeatures?...&resourceVersion=615567111&watch=true → 200
1/7/2026, 7:20:54.261 PM - /nodefeatures?limit=200&resourceVersion=615567111 → 200
1/7/2026, 7:21:03.284 PM - /nodefeatures?continue=...&limit=200 → 410 Expired (RV 615567111)
1/7/2026, 7:21:03.526 PM - /nodefeatures?limit=200&resourceVersion=615567111 → 410 Expired
1/7/2026, 7:21:03.652 PM - /nodefeatures?limit=200 → 200 (fresh list, RV 615638081)
1/7/2026, 7:21:04.285 PM - /nodefeatures?continue=...&limit=200 → 410 Expired (RV 615561940)
1/7/2026, 7:21:04.358 PM - /nodefeatures?limit=200 → 200 (fresh list)
1/7/2026, 7:21:12.952 PM - /nodefeatures?continue=...&limit=200 → 200 (RV 615638081)
1/7/2026, 7:21:15.662 PM - /nodefeatures?...&resourceVersion=615638308&watch=true → 200
1/7/2026, 7:21:17.160 PM - /nodefeatures?...&resourceVersion=615639272&watch=true → 200
1/7/2026, 7:21:18.912 PM - /nodefeatures?...&resourceVersion=615641654&watch=true → 200
1/7/2026, 7:21:26.771 PM - /nodefeatures?continue=...&limit=200 → 500 ServerTimeout (RV 615638081)
1/7/2026, 7:28:25.931 PM - /nodefeatures?...&resourceVersion=615797355&watch=true → 200

Environment:

  • Kubernetes version (use kubectl version): 1.33.2
  • Cloud provider or hardware configuration: azure
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions