Skip to content

[CAPMox] Cloud-Init ISO injection fails on shared CephFS storage when deploying across multiple Proxmox nodes #569

@angelcruzlasso

Description

@angelcruzlasso

What steps did you take and what happened?

Description

When deploying a Cluster API management cluster using CAPMox (Cluster API Provider for Proxmox), VM provisioning fails if the cluster spans multiple Proxmox nodes — even when CephFS shared storage is used for Cloud-Init ISOs.

With a single node (the one hosting the template and management cluster), everything works fine.
When multiple nodes are allowed, CAPMox fails to inject the Cloud-Init ISO and VM provisioning stops with VMProvisionFailed.

Environment

Cluster API setup


Proxmox environment

  • Proxmox cluster: 4 nodes
    server-citic-5, server-citic-6, server-citic-7, server-citic-8

  • Version: Proxmox VE 8.x

  • Shared storage: CephFS (mounted at /mnt/pve/cephfs)

  • VM storage pool for disks: RBD (vms)

  • Template VM:

Permissions setup:

  • User: capmox@pve
  • API Token: capmox@pve!capi
  • Token permissions (confirmed in GUI):
Image
  • Administrator on /, /storage/cephfs, /storage/vms, etc.
  • Propagate: true
  • Shared storage CephFS enabled across all nodes.
Image

Environment Variables

Used in Fish shell (.clusterctl.fish):

set -x CLUSTER_TOPOLOGY true
set -x PROXMOX_URL "https://10.3.35.98:8006"
set -x PROXMOX_TOKEN "capmox@pve!capi"
set -x PROXMOX_SECRET "••••••••••••••••••••••••••••••••"
set -x PROXMOX_SOURCENODE "server-citic-5"
set -x TEMPLATE_VMID 103
set -x CONTROL_PLANE_ENDPOINT_IP 10.3.35.150
set -x NODE_IP_RANGES "[10.3.35.151-10.3.35.170]"
set -x GATEWAY "10.3.35.1"
set -x IP_PREFIX 24
set -x DNS_SERVERS "[8.8.8.8,8.8.4.4]"
set -x ALLOWED_NODES "[server-citic-5,server-citic-6,server-citic-7,server-citic-8]"
set -x BRIDGE "vmbr0"
set -x BOOT_VOLUME_DEVICE "scsi0"
set -x PROXMOX_ISO_POOL "cephfs"
set -x PROXMOX_STORAGE_POOL "vms"
set -x PROXMOX_ISO_STORAGE_POOL "cephfs"

Reproduction steps

  1. Create management cluster:

    kind create cluster
  2. Initialize CAPI with CAPMox and IPAM:

    clusterctl init --infrastructure proxmox --ipam in-cluster
  3. Generate and apply the cluster manifest:

    clusterctl generate cluster capi-quickstart \
      --kubernetes-version v1.34.0 \
      --control-plane-machine-count=3 \
      --worker-machine-count=3 > capi-quickstart.yaml
    
    kubectl apply -f capi-quickstart.yaml

Observed behavior

When ALLOWED_NODES includes multiple Proxmox nodes:

set -x ALLOWED_NODES "[server-citic-5,server-citic-6,server-citic-7,server-citic-8]"

Cluster creation fails during VM provisioning with this error:

unable to inject CloudInit ISO:
Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF

The clusterctl describe cluster output shows:

│ ProxmoxMachine - VMProvisionFailed
│ unable to inject CloudInit ISO:
│ Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF
Image

Proxmox task log:

TASK ERROR: failed to stat '/var/tmp/pveupload-XXXXXXXXXXXXXX'

Despite CephFS being configured as shared storage, the upload request to the remote node fails.


Important findings

  • When ALLOWED_NODES=[server-citic-5] (the same node that hosts the management cluster and the VM template), the cluster provisions successfully.
Image Image
  • When multiple nodes are allowed, CAPMox sometimes tries to clone the template from another node (e.g., server-citic-8) and fails when uploading the cloud-init ISO to CephFS.
  • CephFS is shared and mounted on all nodes (/mnt/pve/cephfs, Shared: Yes).
  • The token capmox@pve!capi has full Administrator privileges on all storages and paths.

This behavior indicates that the CAPMox controller is attempting to use the Proxmox API “upload” operation even for shared storage, leading to “failed to stat /var/tmp/pveupload-XXXX” errors when the API call is made to remote nodes.


What did you expect to happen?

Expected behavior

When using shared storage (CephFS):

  • CAPMox should not attempt to re-upload the Cloud-Init ISO to the target node.
  • Instead, it should write the ISO directly in the shared path (available to all cluster nodes) or reuse the template’s existing storage pool.

Logs / outputs

From kubectl describe cluster:

VMProvisionFailed
unable to inject CloudInit ISO:
Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF

From Proxmox task viewer:

TASK ERROR: failed to stat '/var/tmp/pveupload-XXXXXXXXXXXXXX'

Working scenario (single node):

  • No CloudInit injection errors.
  • Cluster successfully reaches Provisioned state.

Possible root cause

In CAPMox’s VM creation flow (pkg/services/proxmoxmachine), the controller always issues an HTTP upload request to the selected node’s /api2/json/nodes/<target>/storage/<pool>/upload, even when the storage is marked as “shared: true”.

When Proxmox receives this call on a node different from where the /var/tmp/pveupload-* file exists, the stat operation fails because the temporary upload file only exists locally.


Suggested fix or enhancement

Before uploading Cloud-Init ISO:

  1. Detect whether the selected storage pool is shared (storage.shared=true from /api2/json/storage).
  2. If shared, use the same CephFS path without triggering an “upload” operation to another node.
  3. Optionally, add a configuration parameter or environment variable to force CAPMox to reuse shared storage for cloud-init ISO injection.

Cluster API version

❯ kubectl get providers -A

NAMESPACE NAME AGE TYPE PROVIDER VERSION
capi-ipam-in-cluster-system ipam-in-cluster 3h5m IPAMProvider in-cluster v1.0.3
capi-kubeadm-bootstrap-system bootstrap-kubeadm 3h5m BootstrapProvider kubeadm v1.11.2
capi-kubeadm-control-plane-system control-plane-kubeadm 3h5m ControlPlaneProvider kubeadm v1.11.2
capi-system cluster-api 3h5m CoreProvider cluster-api v1.11.2
capmox-system infrastructure-proxmox 3h5m InfrastructureProvider proxmox v0.7.4

Kubernetes version

core in 🌐 100 in ~
❯ clusterctl version

clusterctl version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"a3139c21c0dbd6d7a930abbe6bd2050c60f328bc", GitTreeState:"clean", BuildDate:"2025-10-07T16:09:56Z", GoVersion:"go1.24.7", Compiler:"gc", Platform:"linux/amd64"}

core in 🌐 100 in ~
❯ kubectl version
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.34.0

Anything else you would like to add?

  • The issue does not occur when using only one node (allowedNodes=[server-citic-5]).
  • CephFS storage is fully functional across all nodes.
  • This problem prevents CAPMox from distributing control planes or workers across multiple Proxmox nodes, effectively breaking multi-node scheduling.

Label(s) to be applied

kind/bug
area/infrastructure-proxmox
triage/accepted
needs-investigation
/help

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestkind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions