-
Notifications
You must be signed in to change notification settings - Fork 55
Description
What steps did you take and what happened?
Description
When deploying a Cluster API management cluster using CAPMox (Cluster API Provider for Proxmox), VM provisioning fails if the cluster spans multiple Proxmox nodes — even when CephFS shared storage is used for Cloud-Init ISOs.
With a single node (the one hosting the template and management cluster), everything works fine.
When multiple nodes are allowed, CAPMox fails to inject the Cloud-Init ISO and VM provisioning stops with VMProvisionFailed.
Environment
Cluster API setup
- Guide followed: [Quick Start — Cluster API](https://cluster-api.sigs.k8s.io/user/quick-start)
- Infrastructure provider: CAPMox (Proxmox)
- Cluster API version: v1.10.x
- Kubernetes version (target cluster): v1.34.0
- Management cluster: kind (running inside a VM)
- OS: Ubuntu 22.04
- Shell: fish
- Network: 10.3.35.0/24 (university network, restricted internet access)
Proxmox environment
-
Proxmox cluster: 4 nodes
server-citic-5, server-citic-6, server-citic-7, server-citic-8 -
Version: Proxmox VE 8.x
-
Shared storage: CephFS (mounted at
/mnt/pve/cephfs) -
VM storage pool for disks: RBD (
vms) -
Template VM:
- ID:
103 - OS: Ubuntu 22.04 (built with [image-builder](https://image-builder.sigs.k8s.io/capi/providers/proxmox))
- QEMU Guest Agent enabled
- Cloud-init enabled
- Disk located on
vmspool - Cloud-init config stored in
cephfs
- ID:
Permissions setup:
- User:
capmox@pve - API Token:
capmox@pve!capi - Token permissions (confirmed in GUI):
Administratoron/,/storage/cephfs,/storage/vms, etc.Propagate: true- Shared storage CephFS enabled across all nodes.
Environment Variables
Used in Fish shell (.clusterctl.fish):
set -x CLUSTER_TOPOLOGY true
set -x PROXMOX_URL "https://10.3.35.98:8006"
set -x PROXMOX_TOKEN "capmox@pve!capi"
set -x PROXMOX_SECRET "••••••••••••••••••••••••••••••••"
set -x PROXMOX_SOURCENODE "server-citic-5"
set -x TEMPLATE_VMID 103
set -x CONTROL_PLANE_ENDPOINT_IP 10.3.35.150
set -x NODE_IP_RANGES "[10.3.35.151-10.3.35.170]"
set -x GATEWAY "10.3.35.1"
set -x IP_PREFIX 24
set -x DNS_SERVERS "[8.8.8.8,8.8.4.4]"
set -x ALLOWED_NODES "[server-citic-5,server-citic-6,server-citic-7,server-citic-8]"
set -x BRIDGE "vmbr0"
set -x BOOT_VOLUME_DEVICE "scsi0"
set -x PROXMOX_ISO_POOL "cephfs"
set -x PROXMOX_STORAGE_POOL "vms"
set -x PROXMOX_ISO_STORAGE_POOL "cephfs"Reproduction steps
-
Create management cluster:
kind create cluster
-
Initialize CAPI with CAPMox and IPAM:
clusterctl init --infrastructure proxmox --ipam in-cluster
-
Generate and apply the cluster manifest:
clusterctl generate cluster capi-quickstart \ --kubernetes-version v1.34.0 \ --control-plane-machine-count=3 \ --worker-machine-count=3 > capi-quickstart.yaml kubectl apply -f capi-quickstart.yaml
Observed behavior
When ALLOWED_NODES includes multiple Proxmox nodes:
set -x ALLOWED_NODES "[server-citic-5,server-citic-6,server-citic-7,server-citic-8]"Cluster creation fails during VM provisioning with this error:
unable to inject CloudInit ISO:
Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF
The clusterctl describe cluster output shows:
│ ProxmoxMachine - VMProvisionFailed
│ unable to inject CloudInit ISO:
│ Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF
Proxmox task log:
TASK ERROR: failed to stat '/var/tmp/pveupload-XXXXXXXXXXXXXX'
Despite CephFS being configured as shared storage, the upload request to the remote node fails.
Important findings
- When
ALLOWED_NODES=[server-citic-5](the same node that hosts the management cluster and the VM template), the cluster provisions successfully.
- When multiple nodes are allowed, CAPMox sometimes tries to clone the template from another node (e.g.,
server-citic-8) and fails when uploading the cloud-init ISO to CephFS. - CephFS is shared and mounted on all nodes (
/mnt/pve/cephfs,Shared: Yes). - The token
capmox@pve!capihas fullAdministratorprivileges on all storages and paths.
This behavior indicates that the CAPMox controller is attempting to use the Proxmox API “upload” operation even for shared storage, leading to “failed to stat /var/tmp/pveupload-XXXX” errors when the API call is made to remote nodes.
What did you expect to happen?
Expected behavior
When using shared storage (CephFS):
- CAPMox should not attempt to re-upload the Cloud-Init ISO to the target node.
- Instead, it should write the ISO directly in the shared path (available to all cluster nodes) or reuse the template’s existing storage pool.
Logs / outputs
From kubectl describe cluster:
VMProvisionFailed
unable to inject CloudInit ISO:
Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF
From Proxmox task viewer:
TASK ERROR: failed to stat '/var/tmp/pveupload-XXXXXXXXXXXXXX'
Working scenario (single node):
- No CloudInit injection errors.
- Cluster successfully reaches
Provisionedstate.
Possible root cause
In CAPMox’s VM creation flow (pkg/services/proxmoxmachine), the controller always issues an HTTP upload request to the selected node’s /api2/json/nodes/<target>/storage/<pool>/upload, even when the storage is marked as “shared: true”.
When Proxmox receives this call on a node different from where the /var/tmp/pveupload-* file exists, the stat operation fails because the temporary upload file only exists locally.
Suggested fix or enhancement
Before uploading Cloud-Init ISO:
- Detect whether the selected storage pool is shared (
storage.shared=truefrom/api2/json/storage). - If shared, use the same CephFS path without triggering an “upload” operation to another node.
- Optionally, add a configuration parameter or environment variable to force CAPMox to reuse shared storage for cloud-init ISO injection.
Cluster API version
❯ kubectl get providers -A
NAMESPACE NAME AGE TYPE PROVIDER VERSION
capi-ipam-in-cluster-system ipam-in-cluster 3h5m IPAMProvider in-cluster v1.0.3
capi-kubeadm-bootstrap-system bootstrap-kubeadm 3h5m BootstrapProvider kubeadm v1.11.2
capi-kubeadm-control-plane-system control-plane-kubeadm 3h5m ControlPlaneProvider kubeadm v1.11.2
capi-system cluster-api 3h5m CoreProvider cluster-api v1.11.2
capmox-system infrastructure-proxmox 3h5m InfrastructureProvider proxmox v0.7.4
Kubernetes version
core in 🌐 100 in ~
❯ clusterctl version
clusterctl version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"a3139c21c0dbd6d7a930abbe6bd2050c60f328bc", GitTreeState:"clean", BuildDate:"2025-10-07T16:09:56Z", GoVersion:"go1.24.7", Compiler:"gc", Platform:"linux/amd64"}
core in 🌐 100 in ~
❯ kubectl version
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.34.0
Anything else you would like to add?
- The issue does not occur when using only one node (
allowedNodes=[server-citic-5]). - CephFS storage is fully functional across all nodes.
- This problem prevents CAPMox from distributing control planes or workers across multiple Proxmox nodes, effectively breaking multi-node scheduling.
Label(s) to be applied
kind/bug
area/infrastructure-proxmox
triage/accepted
needs-investigation
/help