Skip to content

csi_failover

Jaime Conchello edited this page Sep 4, 2025 · 3 revisions

Failover Test

This example demonstrates a Deployment with a single pod using a RWO PVC, which can be moved to another node if the original node fails. The PVC remains available and its data stays intact.

Step 1: Create a PVC

Define a PersistentVolumeClaim requesting 1Gi of storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc-opennebula
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: opennebula-fs

Apply the PVC:

kubectl apply -f pvc.yaml

Step 2: Create a deployment

Create a Deployment with a single replica that mounts the PVC:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pvc-failover-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pvc-failover
  template:
    metadata:
      labels:
        app: pvc-failover
    spec:
      containers:
        - name: app
          image: busybox
          command: ["sh", "-c", "echo $HOSTNAME >> /data/example && sleep infinity"]
          volumeMounts:
            - mountPath: /data
              name: pvc-storage
      volumes:
        - name: pvc-storage
          persistentVolumeClaim:
            claimName: test-pvc-opennebula

Apply the Deployment:

kubectl apply -f deployment.yaml

Step 3: Verify Pod and Node

Check that the pod is running and see which node it is on:

$ kubectl get pods -l app=pvc-failover -o wide
NAME                           READY   STATUS   NODE
pvc-failover-deployment-<XX>   1/1     Running  capone-workload-md-0-dmlwv-mvdpm

Check all nodes in the cluster:

$ kubectl get nodes
NAME                               STATUS   ROLES           AGE     VERSION
capone-workload-g4rrh              Ready    control-plane   10m     v1.31.4
capone-workload-md-0-dmlwv-mvdpm   Ready    <none>          8m5s    v1.31.4
capone-workload-md-0-dmlwv-t65nc   Ready    <none>          8m22s   v1.31.4

Step 4: Simulate Node Failure

Terminate the node where the pod is running:

onevm terminate capone-workload-md-0-dmlwv-mvdpm --hard

Verify the node is removed:

$ kubectl get nodes
NAME                               STATUS   ROLES           AGE     VERSION
capone-workload-g4rrh              Ready    control-plane   10m     v1.31.4
capone-workload-md-0-dmlwv-t65nc   Ready    <none>          8m54s   v1.31.4

Step 5: Handle VolumeAttachment

When a node fails, Kubernetes keeps the VolumeAttachment object, assuming the volume is still attached to that node. The Attach/Detach controller automatically takes care of this:

  • After ~6 minutes (default timeout), it detects the node is not recovering.
  • It triggers the detach function in the CSI driver.
  • The VolumeAttachment is deleted, and the PVC becomes available to attach to a new node.

This behavior is intentional, designed to avoid false positives during partial network outages.

Step 6: Verify Pod Failover

Check that the pod has been rescheduled on a new node and is running:

$ kubectl get pods -l app=pvc-failover -o wide
NAME                           READY   STATUS   NODE
pvc-failover-deployment-<XX>   1/1     Running  capone-workload-md-0-dmlwv-t65nc
Clone this wiki locally