[CAPMox] Cloud-Init ISO injection fails on shared CephFS storage when deploying across multiple Proxmox nodes

### What steps did you take and what happened?


### **Description**

When deploying a Cluster API management cluster using CAPMox (Cluster API Provider for Proxmox), VM provisioning fails if the cluster spans multiple Proxmox nodes — even when CephFS shared storage is used for Cloud-Init ISOs.

With a single node (the one hosting the template and management cluster), everything works fine.
When multiple nodes are allowed, CAPMox fails to inject the Cloud-Init ISO and VM provisioning stops with VMProvisionFailed.

### **Environment**

**Cluster API setup**

* Guide followed: [[Quick Start — Cluster API](https://cluster-api.sigs.k8s.io/user/quick-start)](https://cluster-api.sigs.k8s.io/user/quick-start)
* Infrastructure provider: **CAPMox (Proxmox)**
* Cluster API version: v1.10.x
* Kubernetes version (target cluster): v1.34.0
* Management cluster: kind (running inside a VM)
* OS: Ubuntu 22.04
* Shell: fish
* Network: 10.3.35.0/24 (university network, restricted internet access)

---

### **Proxmox environment**

* Proxmox cluster: 4 nodes
  `server-citic-5, server-citic-6, server-citic-7, server-citic-8`
* Version: Proxmox VE 8.x
* Shared storage: **CephFS (mounted at `/mnt/pve/cephfs`)**
* VM storage pool for disks: **RBD (`vms`)**
* Template VM:

  * ID: `103`
  * OS: Ubuntu 22.04 (built with [[image-builder](https://image-builder.sigs.k8s.io/capi/providers/proxmox)](https://image-builder.sigs.k8s.io/capi/providers/proxmox))
  * QEMU Guest Agent enabled
  * Cloud-init enabled
  * Disk located on `vms` pool
  * Cloud-init config stored in `cephfs`

**Permissions setup:**

* User: `capmox@pve`
* API Token: `capmox@pve!capi`
* Token permissions (confirmed in GUI):

<img width="1291" height="507" alt="Image" src="https://github.com/user-attachments/assets/0ff7b1a4-553d-4a7d-a582-249cb7d840a1" />

  * `Administrator` on `/`, `/storage/cephfs`, `/storage/vms`, etc.
  * `Propagate: true`
* Shared storage **CephFS** enabled across all nodes.

<img width="1261" height="153" alt="Image" src="https://github.com/user-attachments/assets/d4e870c4-807e-44a2-81cb-9ca1306bb326" />

---

### **Environment Variables**

Used in Fish shell (`.clusterctl.fish`):

```fish
set -x CLUSTER_TOPOLOGY true
set -x PROXMOX_URL "https://10.3.35.98:8006"
set -x PROXMOX_TOKEN "capmox@pve!capi"
set -x PROXMOX_SECRET "••••••••••••••••••••••••••••••••"
set -x PROXMOX_SOURCENODE "server-citic-5"
set -x TEMPLATE_VMID 103
set -x CONTROL_PLANE_ENDPOINT_IP 10.3.35.150
set -x NODE_IP_RANGES "[10.3.35.151-10.3.35.170]"
set -x GATEWAY "10.3.35.1"
set -x IP_PREFIX 24
set -x DNS_SERVERS "[8.8.8.8,8.8.4.4]"
set -x ALLOWED_NODES "[server-citic-5,server-citic-6,server-citic-7,server-citic-8]"
set -x BRIDGE "vmbr0"
set -x BOOT_VOLUME_DEVICE "scsi0"
set -x PROXMOX_ISO_POOL "cephfs"
set -x PROXMOX_STORAGE_POOL "vms"
set -x PROXMOX_ISO_STORAGE_POOL "cephfs"
```

---

### **Reproduction steps**

1. Create management cluster:

   ```bash
   kind create cluster
   ```
2. Initialize CAPI with CAPMox and IPAM:

   ```bash
   clusterctl init --infrastructure proxmox --ipam in-cluster
   ```
3. Generate and apply the cluster manifest:

   ```bash
   clusterctl generate cluster capi-quickstart \
     --kubernetes-version v1.34.0 \
     --control-plane-machine-count=3 \
     --worker-machine-count=3 > capi-quickstart.yaml

   kubectl apply -f capi-quickstart.yaml
   ```

---

### **Observed behavior**

When `ALLOWED_NODES` includes **multiple Proxmox nodes**:

```fish
set -x ALLOWED_NODES "[server-citic-5,server-citic-6,server-citic-7,server-citic-8]"
```

Cluster creation fails during VM provisioning with this error:

```
unable to inject CloudInit ISO:
Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF
```

The `clusterctl describe cluster` output shows:

```
│ ProxmoxMachine - VMProvisionFailed
│ unable to inject CloudInit ISO:
│ Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF
```

<img width="1872" height="415" alt="Image" src="https://github.com/user-attachments/assets/7ea97910-68ee-49c4-9747-2aca708297c4" />

Proxmox task log:

```
TASK ERROR: failed to stat '/var/tmp/pveupload-XXXXXXXXXXXXXX'
```

Despite CephFS being configured as shared storage, the upload request to the remote node fails.

---

### **Important findings**

* When `ALLOWED_NODES=[server-citic-5]` (the same node that hosts the management cluster and the VM template), **the cluster provisions successfully**.

<img width="413" height="839" alt="Image" src="https://github.com/user-attachments/assets/8a5b9a3b-c5f4-41f2-9751-413d55e648c8" />

<img width="1234" height="652" alt="Image" src="https://github.com/user-attachments/assets/ea962d87-1f07-40e0-808c-6528a5ef6777" />

* When multiple nodes are allowed, CAPMox sometimes tries to clone the template from another node (e.g., `server-citic-8`) and fails when uploading the cloud-init ISO to CephFS.
* CephFS is shared and mounted on all nodes (`/mnt/pve/cephfs`, `Shared: Yes`).
* The token `capmox@pve!capi` has full `Administrator` privileges on all storages and paths.

This behavior indicates that the CAPMox controller is attempting to use the **Proxmox API “upload” operation** even for shared storage, leading to “failed to stat /var/tmp/pveupload-XXXX” errors when the API call is made to remote nodes.

---






### What did you expect to happen?

### **Expected behavior**

When using **shared storage (CephFS)**:

* CAPMox should **not** attempt to re-upload the Cloud-Init ISO to the target node.
* Instead, it should write the ISO directly in the shared path (available to all cluster nodes) or reuse the template’s existing storage pool.

---

### **Logs / outputs**

**From `kubectl describe cluster`:**

```
VMProvisionFailed
unable to inject CloudInit ISO:
Post "https://10.3.35.98:8006/api2/json/nodes/server-citic-8/storage/cephfs/upload": EOF
```

**From Proxmox task viewer:**

```
TASK ERROR: failed to stat '/var/tmp/pveupload-XXXXXXXXXXXXXX'
```

**Working scenario (single node):**

* No CloudInit injection errors.
* Cluster successfully reaches `Provisioned` state.

---

### **Possible root cause**

In CAPMox’s VM creation flow (`pkg/services/proxmoxmachine`), the controller always issues an HTTP `upload` request to the selected node’s `/api2/json/nodes/<target>/storage/<pool>/upload`, even when the storage is marked as “shared: true”.

When Proxmox receives this call on a node different from where the `/var/tmp/pveupload-*` file exists, the `stat` operation fails because the temporary upload file only exists locally.

---

### **Suggested fix or enhancement**

Before uploading Cloud-Init ISO:

1. Detect whether the selected storage pool is **shared** (`storage.shared=true` from `/api2/json/storage`).
2. If shared, **use the same CephFS path** without triggering an “upload” operation to another node.
3. Optionally, add a configuration parameter or environment variable to force CAPMox to reuse shared storage for cloud-init ISO injection.

### Cluster API version

❯ kubectl get providers -A

NAMESPACE                           NAME                     AGE    TYPE                     PROVIDER      VERSION
capi-ipam-in-cluster-system         ipam-in-cluster          3h5m   IPAMProvider             in-cluster    v1.0.3
capi-kubeadm-bootstrap-system       bootstrap-kubeadm        3h5m   BootstrapProvider        kubeadm       v1.11.2
capi-kubeadm-control-plane-system   control-plane-kubeadm    3h5m   ControlPlaneProvider     kubeadm       v1.11.2
capi-system                         cluster-api              3h5m   CoreProvider             cluster-api   v1.11.2
capmox-system                       infrastructure-proxmox   3h5m   InfrastructureProvider   proxmox       v0.7.4

### Kubernetes version

core in 🌐 100 in ~
❯ clusterctl version

clusterctl version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"a3139c21c0dbd6d7a930abbe6bd2050c60f328bc", GitTreeState:"clean", BuildDate:"2025-10-07T16:09:56Z", GoVersion:"go1.24.7", Compiler:"gc", Platform:"linux/amd64"}

core in 🌐 100 in ~
❯ kubectl version
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.34.0

### Anything else you would like to add?

* The issue does **not occur** when using only one node (`allowedNodes=[server-citic-5]`).
* CephFS storage is fully functional across all nodes.
* This problem prevents CAPMox from distributing control planes or workers across multiple Proxmox nodes, effectively breaking multi-node scheduling.

### Label(s) to be applied

kind/bug
area/infrastructure-proxmox
triage/accepted
needs-investigation
/help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CAPMox] Cloud-Init ISO injection fails on shared CephFS storage when deploying across multiple Proxmox nodes #569

What steps did you take and what happened?

Description

Environment

Proxmox environment

Environment Variables

Reproduction steps

Observed behavior

Important findings

What did you expect to happen?

Expected behavior

Logs / outputs

Possible root cause

Suggested fix or enhancement

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CAPMox] Cloud-Init ISO injection fails on shared CephFS storage when deploying across multiple Proxmox nodes #569

Description

What steps did you take and what happened?

Description

Environment

Proxmox environment

Environment Variables

Reproduction steps

Observed behavior

Important findings

What did you expect to happen?

Expected behavior

Logs / outputs

Possible root cause

Suggested fix or enhancement

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions