mellanox / k8s-rdma-sriov-dev-plugin Goto Github PK

Kubernetes Rdma SRIOV device plugin

License: Apache License 2.0

Go 63.54% Shell 35.75% Dockerfile 0.71%

dpdk sriov k8s-device-plugin rdma ib roce kubernetes

k8s-rdma-sriov-dev-plugin's Introduction

[DEPRECATED] This repository is no longer maintained

You are still welcome to explore, learn, and use the code provided here. Alternatively, repository k8s-rdma-shared-dev-plugin can be used.

k8s-rdma-sriov-dev-plugin

(https://hub.docker.com/r/rdma/k8s-rdma-sriov-dev-plugin)

This is simple rdma device plugin that support IB and RoCE SRIOV vHCA and HCA. This also support DPDK applications for Mellanox NICs. This plugin runs as daemonset. Its container image is available at rdma/k8s-rdma-sriov-dev-plugin.

How to use SRIOV mode?

1. Create per node sriov configuration

Edit example/sriov/rdma-sriov-node-config.yaml to describe sriov PF netdevice(s). In this example it is eth0 and eth1.

Note: (a) Do not add any VFs. (b) Do not enable SRIOV manually.

This plugin enables SRIOV for a given PF and does necessary configuration for IB and RoCE link layers.

2. Create ConfigMap

Create config map which holds SRIOV netdevice information from this config yaml file. This is per node configuration. In below example, they are eth0 and eth1 in rdma-sriov-node-config.yaml.

kubectl create -f example/sriov/rdma-sriov-node-config.yaml

3. Deploy device plugin

kubectl create -f example/device-plugin.yaml

4. Create Test pod

Create test pod which requests 1 vhca resource.

kubectl create -f example/sriov/test-sriov-pod.yaml

How to use HCA mode?

1. Use CNI plugin such as Contiv, Calico, Cluster

Make sure to configure ib0 or appropriate IPoIB netdevice as the parent netdevice for creating overlay/virtual netdevices.

2. Create ConfigMap

Create config map to describe mode as "hca" mode. This is per node configuration.

kubectl create -f example/hca/rdma-hca-node-config.yaml

3. Deploy device plugin

kubectl create -f example/device-plugin.yaml

4. Create Test pod

Create test pod which requests 1 vhca resource.

kubectl create -f example/hca/test-hca-pod.yaml

k8s-rdma-sriov-dev-plugin's People

Contributors

Stargazers

Watchers

k8s-rdma-sriov-dev-plugin's Issues

VF is not allocated after recreating a pod

I'm facing a similar issue to #14. The device plugin worked once, but the device plugin has not allocated a VF to a pod due to insufficient rdma/vhca after recreating the pod. Then, I tried to disable and enable SR-IOV and to reload the driver, but it doesn't work.

Here is the log on the device plugin.

$ kubectl logs --namespace=kube-system rdma-sriov-dp-ds-9t5gs
2020/03/05 08:07:13 Starting K8s RDMA SRIOV Device Plugin version= 0.2
2020/03/05 08:07:13 Starting FS watcher.
2020/03/05 08:07:13 Starting OS watcher.
2020/03/05 08:07:13 Reading /k8s-rdma-sriov-dev-plugin/config.json
2020/03/05 08:07:13 loaded config:  {"mode":"sriov","pfNetdevices":["enp96s0f0"]}
2020/03/05 08:07:13 sriov device mode
Configuring SRIOV on ndev= enp96s0f0 9
max_vfs =  4
cur_vfs =  4
vf = &{2 virtfn2 true false}
vf = &{0 virtfn0 false false}
Fail to config vfs for ndev = enp96s0f0
Fail to configure sriov; error =  Link not found
2020/03/05 08:07:13 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2020/03/05 08:07:13 Registered device plugin with Kubelet
exposing devices:  []

Kubernetes version is

Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

The device plugin would be the latest version. But I don't know why the digest ID is not different from one on Dockerhub and it doesn't start when indicating the same digest ID on Dockerhub in the manifest.

docker.io/rdma/k8s-rdma-sriov-dev-plugin:latest                                                                  application/vnd.docker.distribution.manifest.list.v2+json sha256:9071e25d277d2c4cdb83443d57fecf0d98fe1d49b8bd873ba3d6eda131d12181 25.9 MiB  linux/amd64,linux/ppc64le                                   io.cri-containerd.image=managed

how can i get ib library in container?

hi, guys, how can i get ib library in container?
need i copy libib*.so* to my docker image or this plugin already do this work ?

rdma sriov device plugin returns device or resource busy

Hi, I've met a problem. I have several nodes deployed with rdma sriov device plugin and the sriov cni. when i running the rdma sriov device plugin , it returns error.
I've checked the log as below:

[root@localhost bin]# kubectl logs rdma-sriov-dp-ds-mzv9m  -n kube-system
2018/09/18 03:21:57 Starting K8s RDMA SRIOV Device Plugin version= 0.2
2018/09/18 03:21:57 Starting FS watcher.
2018/09/18 03:21:57 Starting OS watcher.
2018/09/18 03:21:57 Reading /k8s-rdma-sriov-dev-plugin/config.json
2018/09/18 03:21:57 loaded config:  {"mode":"sriov","pfNetdevices":["ens5f0"]}
2018/09/18 03:21:57 sriov device mode
Configuring SRIOV on ndev= ens5f0 6
max_vfs =  8
cur_vfs =  0
Fail to enable sriov for netdev = ens5f0
Fail to configure sriov; error =  write /sys/class/net/ens5f0/device/sriov_numvfs: device or resource busy
2018/09/18 03:21:57 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2018/09/18 03:21:57 Registered device plugin with Kubelet
exposing devices:  []

And i check the /sys/class/net/ens5f0/device/sriov_numvfs file with echo.

[root@localhost bin]# echo 0 > /sys/class/net/ens5f0/device/sriov_numvfs
[root@localhost bin]# echo 8 > /sys/class/net/ens5f0/device/sriov_numvfs 
-bash: echo: write error: Device or resource busy

Enviroment:

[root@localhost bin]# mst version
mst, mft 4.10.0-104, built on Jul 01 2018, 17:14:32. Git SHA Hash: 9999fe7

[root@localhost bin]# kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
[root@localhost bin]# kubelet --version
Kubernetes v1.10.4

Thanks for your help!

Could the device plugin support K8S v1.9 ?

Could the device plugin support K8S v1.9 ?
If not, could you tell me where to get the plugin package/docker image that supports K8S v1.9?
Thanks！

kubectl logs rdma-sriov-dp-ds-9v7k6 -n kube-system， logs：

2020/08/19 09:28:31 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2020/08/19 09:28:31 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
2020/08/19 09:28:31 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2020/08/19 09:28:31 shared hca mode
2020/08/19 09:28:31 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2020/08/19 09:28:31 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
2020/08/19 09:28:31 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2020/08/19 09:28:31 shared hca mode
2020/08/19 09:28:31 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2020/08/19 09:28:31 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
2020/08/19 09:28:31 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2020/08/19 09:28:31 shared hca mode
2020/08/19 09:28:31 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
root@inspur013:~#

Unable to Connect the HCA's through the link

I deployed the rdma device plugin in HCA mode in kubernetes cluster. When I tried to make a connection test using "ib_read_bw", the output is as follows:

                RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 0
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet

local address: LID 0000 xxx
GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
remote address: LID 0000 xxx
GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00

The commands I used are simply ' ib_write_bw -d mlx5_0 [target_ip]' and 'ib_read_bw -d mlx5_0
'. Could anyone please help with this issue? I appreciate your help.

Support for connectX-3 pro vpi?

Hi, for connectX-3 pro vpi devices, is there any chance it could work with kubernetes? It seems from https://community.mellanox.com/docs/DOC-3153 that it only supports connectX-4/5...

what the meaning of configure ib0 as the parent netdevice?

hi ~

I use k8s-rdma-sriov-dev-plugin with HCA mode and I haven't config ib0 as the parent netdevice.
so what the meaning of configure ib0 as the parent netdevice? additionaly, my kubernetes cni is calico.

in the issue #11 I have deployed k8s-rdma-sriov-dev-plugin successfully and can use radm in the pod.

Appreciate if you can give some suggestions

All vhca resources can be seen in the container even only one rdma/vhca is requested.

I 've create a test pod from test-sriov-pod.yaml, in which only one rdma/vhca is requested, and the get in the container with kubectl exec . When I run the command ibv_devinfo in the container, I find that all the vhca devices cab be seen in the container. Is that normal?

Is liveness probe a consideration?

I don't find any liveness probe configuration in the device-plugin.yaml.

Is there any concern using this probe with rdma device plugin?

Deprecated k8s apiversion in example

extensions/v1beta1 is deprecated for DaemonSet in Kuberentes v1.16

k8s-rdma-sriov-dev-plugin/example/device-plugin.yaml

Line 1 in 7cf8cbc

apiVersion: extensions/v1beta1

better change to apps/v1, which is available since v1.9

reference: https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/

CentOS 7 support

Is your plugin supported on node OS CentOS 7 ?

Failed to pull image "mellanox/rdma-sriov-dev-plugin"

When I create the daemonset from the file device-plugin.yaml, I get the following errors, it means failed to pulling image "mellanox/rdma-sriov-dev-plugin".
Events: Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 6h kubelet, ln01 MountVolume.SetUp succeeded for volume "device-plugin"
Normal SuccessfulMountVolume 6h kubelet, ln01 MountVolume.SetUp succeeded for volume "default-token-5stpq"
Normal SuccessfulMountVolume 5h kubelet, ln01 MountVolume.SetUp succeeded for volume "config"
Warning Failed 5h kubelet, ln01 Error: ImagePullBackOff
Warning Failed 5h (x2 over 5h) kubelet, ln01 Failed to pull image "mellanox/rdma-sriov-dev-plugin": rpc error: code = Unknown desc = Error response from daemon: pull access denied for mellanox/rdma-sriov-dev-plugin, repository does not exist or may require 'docker login'
Warning Failed 5h (x2 over 5h) kubelet, ln01 Error: ErrImagePull
Normal Pulling 1h (x60 over 5h) kubelet, ln01 pulling image "mellanox/rdma-sriov-dev-plugin"
Normal BackOff 2m (x1492 over 5h) kubelet, ln01 Back-off pulling image "mellanox/rdma-sriov-dev-plugin"

I run the command
docker pull pulling image mellanox/rdma-sriov-dev-plugin
got the same error too.

Then ,I try to build the image from the Dockerfile, but got the following errors.

git clone https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin
[root@ln01 k8s-rdma-sriov-dev-plugin]# docker build - < Dockerfile
Sending build context to Docker daemon 2.048kB
Step 1/10 : FROM golang:1.10.1 as build ---> 1af690c44028
Step 2/10 : WORKDIR /go/src/k8s-rdma-sriov-dp ---> Using cache ---> ba4c5f087dcd
Step 3/10 : RUN go get github.com/golang/dep/cmd/dep ---> Using cache ---> be138283c818
Step 4/10 : COPY Gopkg.toml Gopkg.lock ./ COPY failed: stat /var/lib/docker/tmp/docker-builder174256260/Gopkg.toml: no such file or directory.
I need help, thank you!

How to config the ConfigMap

Hi, thanks for your work! I'm working on this recently and I got confused for the node ConfigMap. I have 7 nodes in my cluster with a Mellanox ConnectX-4 Lx device and 7 sriov vfs on each node. The systems are all Ubuntu 16.04. Typing command ifconfig, I can get like this:

ens5f0    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1c:8c  
          inet addr:10.128.1.5  Bcast:10.128.1.255  Mask:255.255.255.0
          inet6 addr: fe80::526b:4bff:fe2f:1c8c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3438707 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2557173 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3391158743 (3.3 GB)  TX bytes:2311580366 (2.3 GB)

ens5f1    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1c:8d  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

ens5f2    Link encap:Ethernet  HWaddr b6:96:dc:55:e9:df  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18515 errors:0 dropped:0 overruns:0 frame:0
          TX packets:487 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2850062 (2.8 MB)  TX bytes:83849 (83.8 KB)

ens5f3    Link encap:Ethernet  HWaddr 72:96:21:22:9e:ed  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19140 errors:0 dropped:0 overruns:0 frame:0
          TX packets:522 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2953775 (2.9 MB)  TX bytes:88006 (88.0 KB)

ens5f4    Link encap:Ethernet  HWaddr 3e:f9:0f:af:df:9e  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18753 errors:0 dropped:0 overruns:0 frame:0
          TX packets:497 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2890892 (2.8 MB)  TX bytes:83641 (83.6 KB)
ens5f5    Link encap:Ethernet  HWaddr da:2a:71:b3:9e:19  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19218 errors:0 dropped:0 overruns:0 frame:0
          TX packets:530 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2969260 (2.9 MB)  TX bytes:88424 (88.4 KB)

ens5f6    Link encap:Ethernet  HWaddr 4e:eb:0e:d5:bb:05  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18793 errors:0 dropped:0 overruns:0 frame:0
          TX packets:512 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2897280 (2.8 MB)  TX bytes:86343 (86.3 KB)

ens5f7    Link encap:Ethernet  HWaddr 6e:39:97:e5:bc:4e  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19338 errors:0 dropped:0 overruns:0 frame:0
          TX packets:523 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2990496 (2.9 MB)  TX bytes:89079 (89.0 KB)

So, how can I config the node ConfigMap? I have tried like this:

  config.json: |
    {
        "pfNetdevices": [
                "ens5f0",
                "ens5f1",
                "ens5f2",
                "ens5f3",
                "ens5f4",
                "ens5f5",
                "ens5f6",
                "ens5f7",
        ]
    }

But when I started the device plugin daemonset, there is node resource rdma/vhca in the node description. And when I try to start the test-pod.yaml, the Pod is always in Pending status because there is no sufficient node to deploy the Pod.

Error Message "Operation Not Supported" when setting max_tx_rate/rate or min_tx_rate

Environment

ConnectX-5 Mellanox Card
OS: Ubuntu 16.04.6 LTS
Kernel: 4.4.0-87-generic
Mellanox OFED driver version: 4.5-1.0.10
Mellanox Modules Loaded Include:
- mlx5_ib
- mlx5_core
- mlxfw
- mlx4_ib
- mlx4_en
- mlx4_core
- mlx4_compat

Setup

The setup that was followed was from documentation from both the 4.5 OFED manual as well as from the Mellanox community articles including https://community.mellanox.com/s/article/howto-configure-rate-limit-per-vf-for-connectx-4-connectx-5.
We have enabled SRIOV via the following command and the Virtual Functions are working correctly mlxconfig -d /dev/mst/mt4119_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=4.
We are able to communicate to other hosts via the ib_send_bw command.
Following the 2019 guide presented here we made sure to enable the mlx4_core parameter in the file /etc/modprobe.d/mlx4.conf by adding in the following line and restarting the machine:

options mlx4_core debug_level=1 enable_qos=1 enable_vfs_qos=1

Problem

After issuing the following command to change the max bandwidth we received an error.

Command

ip link set dev enp4s0f0 vf 1 rate 1000

Error

RTNETLINK answers: Operation not supported

The same error occurs for both of the following commands as well:

ip link set dev enp4s0f0 vf 1 min_tx_rate 1000
ip link set dev enp4s0f0 vf 1 max_tx_rate 1000

an error "No such device" is reported, when using hca mode with RoCE adapter

Environment:

I deployed k8s-rdma-sriov-dev-plugin in HCA mode for RoCE.
multus and calico is installed，and I add below settings for calico.

         - name: IP_AUTODETECTION_METHOD
           value: "interface=ens3f0"

What happened:
Details are as follows:

Run two pods for testing, their names is test-hca1 and test-hca2.

execute lspic in pod/test-hca1 , the RoCE device can be found.

$ lspci -v | grep Mella
5e:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
5e:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

then, execute rdma_server in pod/test-hca1 , and execute rdma_client in pod/test-hca2 , an error "No such device" is reported.

$ kubectl exec -it test-hca1 -- rdma_server
rdma_server: start

$ kubectl exec -it test-hca2 -- rdma_client -s 10.244.0.8
rdma_client: start
rdma_create_ep: No such device
rdma_client: end -1

Next，execute show_gids in pod/test-hca1 , gid of RoCE device cannot be found.

$ kubectl exec -it test-hca1 bash
$ show_gids
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
n_gids_found=0

---

the yaml file used in the above test is as follows:

apiVersion: v1
kind: Pod
metadata:
  name: test-hca1
spec:
  containers:
  - name: test-hca
    image: "asdfsx/mofed_benchmark"
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/hca: 1
    command: ["/bin/bash", "-c", "sleep 2000000000000"]
    stdin: true
    tty: true

The Dockerfile of docker image used in the above test is : https://github.com/Mellanox/mofed_dockerfiles/blob/master/Dockerfile.centos7.2.mofed-4.4

How to write Dockerfile to use vhca in container?

Hi, I want to communicate with RDMA between containers. I created two Pods with image mellanox/centos_7_2_mofed_4_4_0_1_9_0, then tested with ib_send_bw command. Everything goes fine. So could you please share the Dockerfile for mellanox/centos_7_2_mofed_4_4_0_1_9_0? Thanks!

can not use rdma_client, when using hca mode with calico

To use hca mode with calico, I add these settings

            - name: IP_AUTODETECTION_METHOD
              value: "interface=enp175s0"
            - name: IP6_AUTODETECTION_METHOD
              value: "interface=enp175s0"

After create the whole network, I try to do connectivity test using rdma_server/rdma_client.
so I create to 2 pod first

apiVersion: v1
kind: Pod
metadata:
  name: iperf-server
spec:  # specification of the pod's contents
  containers:
  - name: iperf-server
    image: "asdfsx/mofed_benchmark"
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/hca: 1
    command: ["/bin/bash", "-c", "sleep 2000000000000"]
    stdin: true
    tty: true
---
apiVersion: v1
kind: Pod
metadata:
  name: iperf-client-1
spec:  # specification of the pod's contents
  containers:
  - name: iperf-client-1
    image: "asdfsx/mofed_benchmark"
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/hca: 1
    command: ["/bin/bash", "-c", "sleep 2000000000000"]
    stdin: true
    tty: true

The image asdfsx/mofed_benchmark is builded by using this dockerfile

Then start the rdma_server

$ kubectl exec -it iperf-server -- rdma_server
rdma_server: start

Start the rdma_client, but got error

$ kubectl exec -it iperf-client-1 -- rdma_client -s 10.244.0.8
rdma_client: start
rdma_create_ep: No such device
rdma_client: end -1
command terminated with exit code 255

I want to know why this happen.I'm totally confused~~~

Daemonset logs says `Link not found`

I have followed the README to create configmap and daemonset, but it seems like the device plugin does not work correctly:

# kubectl logs xxxxx -n kube-system
2018/10/17 08:13:49 Reading /k8s-rdma-sriov-dev-plugin/config.json
2018/10/17 08:13:49 loaded config:  {"mode":"sriov","pfNetdevices":["enp97s0f0","enp97s0f1","enp97s0f2","enp97s0f3"]}
2018/10/17 08:13:49 sriov device mode
Configuring SRIOV on ndev= enp97s0f0 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f0
Fail to configure sriov; error =  Link not found
Configuring SRIOV on ndev= enp97s0f1 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f1
Fail to configure sriov; error =  Link not found
Configuring SRIOV on ndev= enp97s0f2 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f2
Fail to configure sriov; error =  Link not found
Configuring SRIOV on ndev= enp97s0f3 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f3
Fail to configure sriov; error =  Link not found

Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-3+18f677ae060806", GitCommit:"18f677ae0608064799c7e7f2bc2732d37f22efe3", GitTreeState:"clean", BuildDate:"2018-10-16T12:37:20Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"linux/amd64"}

got all VF with port state 'DOWN'

I deployed rdma-sriov-dev-plugin in SRIOV mode for IB. The VFs were created but when running ibstat, I got the result that only mlx5_0 had port state "Active", while all others ( mlx5_1~mlx5_10 for 10 VFs) had port state "DOWN" (as image below). and I have checked policy for all VFs is "Follow".

I'm using ubuntu16.04 , kubernetes v 1.15.
What should I do to make all VF ports active? Would it be something wrong in SM? I have SM running on IB switch with virtualization turned on.

Failed to Create QP

I tried to deploy the rdma device plugin in HCA mode in my kubernetes cluster. I followed the instruction and the device plugin can be registered successfully. If I run "kubectl describe node [node_name]", I can find the rdma/hca resource. If I run "ibstat" in the pods, the inifiniband information shows up and the status is active/up.

However, when I tried to run a connection test using "ib_read_bw", it threw me following error: "Couldn't get device attribute.
Unable to create QP.
Failed to create QP.
Couldn't create IB resource."

I simply run the test by running "ib_read_bw" in one pod and running "ib_read_bw [target_pod_ip_addr]" in another pod. Could anyone please help with this issue? I appreciate your help.

some question about HCA mode

1、what the meaning of pfNetdevices ? the rdma-hca-node-config.yml don't have this field

2、when I deploy plugin as the following:

kubectl create -f example/hca/rdma-hca-node-config.yaml
kubectl create -f example/device-plugin.yaml
kubectl create -f example/hca/test-hca-pod.yaml

I see the logs from test pod

[root@Mellanox]# kubectl logs mofed-test-pod
/dev/infiniband:
total 0
crw-------. 1 root root 231,  64 Sep  3 12:37 issm0
crw-rw-rw-. 1 root root  10,  57 Sep  3 12:37 rdma_cm
crw-rw-rw-. 1 root root 231, 224 Sep  3 12:37 ucm0
crw-------. 1 root root 231,   0 Sep  3 12:37 umad0
crw-rw-rw-. 1 root root 231, 192 Sep  3 12:37 uverbs0

/sys/class/net:
total 0
-rw-r--r--. 1 root root 4096 Sep  3 12:37 bonding_masters
lrwxrwxrwx. 1 root root    0 Sep  3 12:37 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx. 1 root root    0 Sep  3 12:37 lo -> ../../devices/virtual/net/lo
lrwxrwxrwx. 1 root root    0 Sep  3 12:37 tunl0 -> ../../devices/virtual/net/tunl0

it is the output of test-pod right?

here is my node info

Capacity:
 cpu:                56
 ephemeral-storage:  569868560Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             527877808Ki
 nvidia.com/gpu:     8
 pods:               110
 rdma/hca:           1k
Allocatable:
 cpu:                56
 ephemeral-storage:  525190864027
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             527775408Ki
 nvidia.com/gpu:     8
 pods:               110
 rdma/hca:           1k

3、In my host node. I can use Infiniband to set with ib0.
Except set rdma/hca: 1, how can I use Infiniband in a pod , as I don't see ib0 from the logs of test-hca-pod

Appreciate if you can give some suggestions

Using Mellanox ConnectX-4 Lx, in the k8s pod command 'show_gids' returns nill while using calico cni.

[root@rdma-test-wv8kj tmp]# show_gids
DEV PORT INDEX GID IPv4 VER DEV

n_gids_found=0

other info like :
[root@rdma-test-wv8kj tmp]# ethtool eth0
Settings for eth0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
MDI-X: Unknown
Cannot get wake-on-lan settings: Operation not permitted
Link detected: yes
[root@rdma-test-wv8kj tmp]# ethtool -i eth0
driver: veth
version: 1.0
firmware-version:
expansion-rom-version:
bus-info:
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@rdma-test-wv8kj tmp]# ofed_
ofed_info ofed_rpm_info ofed_uninstall.sh
[root@rdma-test-wv8kj tmp]# ofed_info -s
MLNX_OFED_LINUX-5.0-2.1.8.0:
[root@rdma-test-wv8kj tmp]# ibv_devices
device node GUID
------ ----------------
mlx5_0 98039b03003b0346
mlx5_1 98039b03003b0347
[root@rdma-test-wv8kj tmp]# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 14.27.1016
node_guid: 9803:9b03:003b:0346
sys_image_guid: 9803:9b03:003b:0346
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: MT_2420110004
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 14.27.1016
node_guid: 9803:9b03:003b:0347
sys_image_guid: 9803:9b03:003b:0346
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: MT_2420110004
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

The performance of vhca?

Is the performance(bindwidth and latency) same with the physical hca or is just 1/numvfs times of the physical hca performance?

version of Kubernetes required

the lowest version of Kubernetes required ?

Driver doesn't support SRIOV configuration via sysfs

We have a Mellanox ConnectX-3 dual port nic.

We're following this guide: https://community.mellanox.com/s/article/reference-deployment-guide-for-k8s-cluster-with-mellanox-rdma-device-plugin-and-multus-cni-plugin-with-two-network-interfaces--flannel-and-mellanox-sr-iov---draft-x

Everything runs smoothly, until I activate the device plugin. The plugin installs fine, but when I look at the logs I see

/sys/class/net/eth2/device/sriov_numvfs: Function not implemented

I did manually what I think the plugin does on each physical node running k8s. For example,

echo 8 | sudo tee /sys/class/net/eth2/device/sriov_numvfs

I always get the same error:

/sys/class/net/eth2/device/sriov_numvfs: Function not implemented

Also, if I do

dmesg | grep -i mlx

I see this error:

mlx4_core 0000:03:00.0: Driver doesn't support SRIOV configuration via sysfs.

As an experiment, I've also tried to activate VFs through the mlx_core driver configuration (as for example described here: https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-3-with-kvm--ethernet-x).
In this case VFs come up and everything works fine, but unfortunately this doesn't seem to be compatible with the SRIOV device plugin.

This is really blocking us..any suggestion would be highly appreciated!

Thanks.

Restart the plugin failed on some nodes.

Hi, I have started a new node and create the sriov device plugin, everythin goes fine. Then I delete the plugin and then to create it again, but failed.

Here is the Pod log on that node:

# kubectl logs rdma-sriov-dp-ds-p8gvx -n kube-system
2018/07/20 06:26:27 Starting K8s RDMA SRIOV Device Plugin version= 0.2
2018/07/20 06:26:27 Starting FS watcher.
2018/07/20 06:26:27 Starting OS watcher.
2018/07/20 06:26:27 Reading /k8s-rdma-sriov-dev-plugin/config.json
2018/07/20 06:26:27 loaded config:  {"mode":"sriov","pfNetdevices":["ens5f0"]}
2018/07/20 06:26:27 sriov device mode
Configuring SRIOV on ndev= ens5f0 6
max_vfs =  9
cur_vfs =  9
vf = &{0 virtfn0 false false}
Fail to config vfs for ndev = ens5f0
Fail to configure sriov; error =  Link not found
2018/07/20 06:26:27 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2018/07/20 06:26:27 Registered device plugin with Kubelet
exposing devices:  []

but the network interface ens5f0 is actually exist. Here is the ifconfig result:

ens5f0    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1a:44  
          inet addr:10.128.1.17  Bcast:10.128.1.255  Mask:255.255.255.0
          inet6 addr: fe80::526b:4bff:fe2f:1a44/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2448992848 errors:0 dropped:61776 overruns:0 frame:0
          TX packets:1031347927 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2935127310369 (2.9 TB)  TX bytes:590907008963 (590.9 GB)

ens5f1    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1a:45  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

ens5f4    Link encap:Ethernet  HWaddr 3a:2f:bb:84:34:85  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:554052 errors:0 dropped:36597 overruns:0 frame:0
          TX packets:32977 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:117147687 (117.1 MB)  TX bytes:5181028 (5.1 MB)

......

Beside ens5f0 and ens5f1, other network interface are created by the plugin, and even when I deleted the plugin, they are still exists.

Everything seems ok but no vhca device in test Pod.

Hi, I've met a problem and have no idea how to fix. I have several nodes deployed with rdma sriov device plugin and the sriov cni, and everythin goes ok and pods can communicate with others via the vhca device whether the pods are launched on the same node or not. But one day, one the node goes bad, new pod launched on it fails to require a vhca device(the pod is launched normally and in Running phase), but everything seems ok.
I've checked the log as below:

checking the rdma/vhca resource on node:

# kubectl describe node 10.128.2.30  
...
Capacity:
 cpu:                48
 ephemeral-storage:  52399108Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             131747876Ki
 nvidia.com/gpu:     8
 pods:               110
 rdma/vhca:          8
Allocatable:
 cpu:                48
 ephemeral-storage:  48291017853
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             131645476Ki
 nvidia.com/gpu:     8
 pods:               110
 rdma/vhca:          8
...

When creating a new pod on the node, I get log from rdma sriov device plugin:

2018/08/07 07:33:33 allocate request: &AllocateRequest{ContainerRequests:[&ContainerAllocateRequest{DevicesIDs:[16:5f:e4:4f:a7:28],}],}
2018/08/07 07:33:33 allocate response:  {[&ContainerAllocateResponse{Envs:map[string]string{},Mounts:[],Devices:[&DeviceSpec{ContainerPath:/dev/infiniband,HostPath:/dev/infiniband,Permissions:rwm,}],Annotations:map[string]string{},}]}

I use test-sriov-pod.yaml to create test pod, the pod can be lauched normally and in Running phase, but the network interface is not a vhca device and no vhca devices found with show_gids:

# ethtool -i eth0
driver: veth
version: 1.0
firmware-version: 
expansion-rom-version: 
bus-info: 
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

# show_gids 
DEV	PORT	INDEX	GID					IPv4  		VER	DEV
---	----	-----	---					------------  	---	---
n_gids_found=0

The sriov cni configuration is as below and it's the only cni on that node:

{
    "name": "mynet",
    "type": "sriov",
    "if0": "ens5f0",
    "ipam": {
        "type": "host-local",
        "subnet": "10.55.206.0/24
	"rangeStart": "10.55.206.11",
	"rangeEnd": "10.55.206.19",
        "routes": [
            { "dst": "0.0.0.0/0" }
        ],
        "gateway": "10.55.206.1"
    }
}

Besides, I found that all the vhca interface is in down status with command ip a, and I let them up with command ifconfig <eth-name> up manually, but nothing is changed.

Thanks for your help!

Configure ib0 for overlay/virtual netdevice

Can you elaborate on the steps necessary to "...configure ib0 or appropriate IPoIB netdevice as the parent netdevice for creating overlay/virtual netdevices."?

Is this supposed to work if you have multiple networks on the host compute nodes? For instance, my k8s runs over ethernet and I have IB installed on a few of the compute nodes. Am I able to launch pods and use the IB network between pods if it is not the default? Is it possible to change the CNI to default to an Infiniband if present?

CUDA direct access between GPUs using PCI can't work with the plugin.

Hi, have you tested the plugin with Nvidia GPU? I found that when using NCCL to test GPU communication with the plugin, the test program will hang. Below is the detail of my test:

Environment:
OS: Ubuntu 16.04
kubelet: 1.10.4
NCCL version: 2.2

Test:
test code: https://github.com/NVIDIA/nccl-tests
test command: NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128 -f 2 -g 2 . -g 2 means using 2 GPUS in the test thread.
With the environment variable NCCL_DEBUG=INFO, you can find lines line INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer which means uses CUDA direct access between GPUs, using NVLink or PCI.

Result:

When the test Pod is lauched on node without the k8s-rdma-sriov-dev-plugin， the test program runs normally and gets log like this:

...
caffe:34:34 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
caffe:34:34 [1] INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
# NCCL Tests compiled with NCCL 2.2
# Using devices
#   Rank  0 on      caffe device  0 [0x04] GeForce GTX 1080 Ti
#   Rank  1 on      caffe device  1 [0x05] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
caffe:34:34 [0] INFO Launch mode Group/CGMD
           8             2   float     sum    0.017   0.00   0.00    0e+00    0.017   0.00   0.00    0e+00
          16             4   float     sum    0.017   0.00   0.00    0e+00    0.017   0.00   0.00    0e+00
          32             8   float     sum    0.017   0.00   0.00    0e+00    0.017   0.00   0.00    0e+00
          64            16   float     sum    0.017   0.00   0.00    0e+00    0.017   0.00   0.00    0e+00
         128            32   float     sum    0.017   0.01   0.01    0e+00    0.017   0.01   0.01    0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 0.00292487

When the test Pod is lauched on node with the k8s-rdma-sriov-dev-plugin, the test program will hang after pring the log:

...
caffe:8833:8833 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
caffe:8833:8833 [1] INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
# NCCL Tests compiled with NCCL 2.2
# Using devices
#   Rank  0 on      caffe device  0 [0x05] GeForce GTX 1080 Ti
#   Rank  1 on      caffe device  1 [0x08] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
caffe:8833:8833 [0] INFO Launch mode Group/CGMD

RDMA_CM failure in sriov mode

Based on the document, RDMA_CM should work in sriov mode. However, I am not able to run ib_write_bw -R test while the normal ib_write_bw test is OK.
Below is the message I got when running ib_write_bw -R.

`
#Container1
ib_write_bw -R
#ethtool -i eth0
driver: mlx5_core
version: 4.2-1.2.0
...

#Container2(on the same node as Container1's)
ib_write_bw -R 10.16.190.11(IP addr of eth0 in Container1)
rdma_resolve_route failed
Unable to perform rdma_client function
Unable to init the socket connection`

What could be the cause?