Skip to content

Skip modprobe in peermem container when module is already loaded#683

Merged
karthikvetrivel merged 1 commit intomainfrom
fix/peermem-fast-path-skip-modprobe
Apr 10, 2026
Merged

Skip modprobe in peermem container when module is already loaded#683
karthikvetrivel merged 1 commit intomainfrom
fix/peermem-fast-path-skip-modprobe

Conversation

@karthikvetrivel
Copy link
Copy Markdown
Member

@karthikvetrivel karthikvetrivel commented Apr 3, 2026

Problem

When a driver pod is deleted and recreated, the fast-path optimization skips kernel module compilation/installation
since modules are already loaded in memory. However, the peermem sidecar unconditionally runs chroot /run/nvidia/driver modprobe nvidia-peermem, which fails because /lib/modules/<kernel>/ has no .ko files on the fast path.

Fix

Add an early check for /sys/module/nvidia_peermem/refcnt in reload_nvidia_peermem() before attempting modprobe. This mirrors the pattern used in nvidia-validator's isNvidiaModuleLoaded().

Test plan

  • Deleted a running driver pod (kubectl delete pod )
  • Verified new pod reaches Ready state with all containers running
  • Verified k8s-driver-manager logs show "skipping the uninstallation"
  • Verified nvidia-peermem-ctr logs show "already loaded, skipping modprobe"

Comment thread rhel8/nvidia-driver
Copy link
Copy Markdown
Contributor

@tariq1890 tariq1890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @karthikvetrivel ! Let's get one more approve before merging

@JunAr7112 JunAr7112 self-requested a review April 8, 2026 21:53
@JunAr7112
Copy link
Copy Markdown
Contributor

This LGTM.

@cdesiniotis
Copy link
Copy Markdown
Contributor

Do we also need to make this change in the rhel10 directory?

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the fix/peermem-fast-path-skip-modprobe branch from 26ab94a to 446061a Compare April 8, 2026 23:54
@karthikvetrivel
Copy link
Copy Markdown
Member Author

Do we also need to make this change in the rhel10 directory?

Good catch @cdesiniotis, fixed.

@karthikvetrivel karthikvetrivel merged commit 8734c26 into main Apr 10, 2026
42 checks passed
@tariq1890 tariq1890 deleted the fix/peermem-fast-path-skip-modprobe branch April 10, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants