This repository was archived by the owner on May 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 27
This repository was archived by the owner on May 16, 2024. It is now read-only.
Everything seems ok but no vhca device in test Pod. #9
Copy link
Copy link
Open
Description
Hi, I've met a problem and have no idea how to fix. I have several nodes deployed with rdma sriov device plugin and the sriov cni, and everythin goes ok and pods can communicate with others via the vhca device whether the pods are launched on the same node or not. But one day, one the node goes bad, new pod launched on it fails to require a vhca device(the pod is launched normally and in Running phase), but everything seems ok.
I've checked the log as below:
- checking the rdma/vhca resource on node:
# kubectl describe node 10.128.2.30
...
Capacity:
cpu: 48
ephemeral-storage: 52399108Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131747876Ki
nvidia.com/gpu: 8
pods: 110
rdma/vhca: 8
Allocatable:
cpu: 48
ephemeral-storage: 48291017853
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131645476Ki
nvidia.com/gpu: 8
pods: 110
rdma/vhca: 8
...
- When creating a new pod on the node, I get log from rdma sriov device plugin:
2018/08/07 07:33:33 allocate request: &AllocateRequest{ContainerRequests:[&ContainerAllocateRequest{DevicesIDs:[16:5f:e4:4f:a7:28],}],}
2018/08/07 07:33:33 allocate response: {[&ContainerAllocateResponse{Envs:map[string]string{},Mounts:[],Devices:[&DeviceSpec{ContainerPath:/dev/infiniband,HostPath:/dev/infiniband,Permissions:rwm,}],Annotations:map[string]string{},}]}
- I use
test-sriov-pod.yaml
to create test pod, the pod can be lauched normally and in Running phase, but the network interface is not a vhca device and no vhca devices found withshow_gids
:
# ethtool -i eth0
driver: veth
version: 1.0
firmware-version:
expansion-rom-version:
bus-info:
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
# show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
n_gids_found=0
- The sriov cni configuration is as below and it's the only cni on that node:
{
"name": "mynet",
"type": "sriov",
"if0": "ens5f0",
"ipam": {
"type": "host-local",
"subnet": "10.55.206.0/24
"rangeStart": "10.55.206.11",
"rangeEnd": "10.55.206.19",
"routes": [
{ "dst": "0.0.0.0/0" }
],
"gateway": "10.55.206.1"
}
}
Besides, I found that all the vhca interface is in down
status with command ip a
, and I let them up with command ifconfig <eth-name> up
manually, but nothing is changed.
Thanks for your help!
Metadata
Metadata
Assignees
Labels
No labels