Code Monkey home page Code Monkey logo

Comments (9)

paravmellanox avatar paravmellanox commented on June 3, 2024

You only need to add the PF netdevice information here.
Device plugin automatically detects its child VF and uses them.
Please remove child VF netdevices from the list.

from k8s-rdma-sriov-dev-plugin.

paravmellanox avatar paravmellanox commented on June 3, 2024

Also , please don't enable sriov by yourself. This device plugin enables sriov and does necessary configuration of the VF for Infiniband and RoCE depending on upstream kernel or MOFED.

from k8s-rdma-sriov-dev-plugin.

paravmellanox avatar paravmellanox commented on June 3, 2024

@flymark2010 fly I am updated documentation for same. Let me know how it goes with only PFs in the list.

from k8s-rdma-sriov-dev-plugin.

flymark2010 avatar flymark2010 commented on June 3, 2024

Sorry for no reply for so long. We've been waiting for the new OFED driver and now we have installed driver OFED 4.4, and then tried again, but still failed.

First ,I'm not sure the meaning of "don't enable sriov by yourself". I used the comand mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=9 and the reboot the system. Then I get the hca info with command ibv_devinfo:

hca_id:	mlx5_1
	transport:			InfiniBand (0)
	fw_ver:				14.23.1000
	node_guid:			506b:4b03:002f:1a3d
	sys_image_guid:			506b:4b03:002f:1a3c
	vendor_id:			0x02c9
	vendor_part_id:			4117
	hw_ver:				0x0
	board_id:			MT_2420110034
	phys_port_cnt:			1
	Device ports:
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				14.23.1000
	node_guid:			506b:4b03:002f:1a3c
	sys_image_guid:			506b:4b03:002f:1a3c
	vendor_id:			0x02c9
	vendor_part_id:			4117
	hw_ver:				0x0
	board_id:			MT_2420110034
	phys_port_cnt:			1
	Device ports:
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

and the Ethernet interface info with command ifconfig:

ens5f0    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1a:3c  
          inet addr:10.128.1.16  Bcast:10.128.1.255  Mask:255.255.255.0
          inet6 addr: fe80::526b:4bff:fe2f:1a3c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:34153 errors:0 dropped:231 overruns:0 frame:0
          TX packets:8405 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:7302014 (7.3 MB)  TX bytes:7965863 (7.9 MB)

ens5f1    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1a:3d  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

We have a card with two port on the node, but only one is used. Actually the hca mlx5_0 and Ethernet interface ens5f0 are active.

The content of rdma-sriov-node-config.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: kube-system
data:
  config.json: |
    {
        "mode" : "sriov",
        "pfNetdevices": [ "ens5f0" ]
    }

Then I created the device plugin DamonSet, all the DamonSet Pods run normally with status Running.
Then I can see 9 virtual hca devices and all the port state is PORT_ACTIVE, same with the Ethernet interface.

But here is still no resource rdma/vhca in the node description, and the test Pod is always in Pending state with message Warning FailedScheduling 50s (x91 over 25m) default-scheduler 0/7 nodes are available: 7 Insufficient rdma/vhca..

from k8s-rdma-sriov-dev-plugin.

paravmellanox avatar paravmellanox commented on June 3, 2024

np @flymark2010.
I will make the documentation more crisp instead of ""don't enable sriov by yourself".
Basically rdma device plugin enables the SRIOV and does necessary rdma configuration.
Therefore, user should not enable it by writing to sysfs files.
What you have done to enable at HCA (firmware/hardware) level is correct.

Can you please share the output of

ip link show ens5f0

and

kubectl show logs --namespace=kube-system <pod_of_device_plugin_ds>

This will help to debug/understand why vhca resources are not published or something else went wrong.

from k8s-rdma-sriov-dev-plugin.

flymark2010 avatar flymark2010 commented on June 3, 2024

output for ip link show ens5f0:

4: ens5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 50:6b:4b:2f:1a:3c brd ff:ff:ff:ff:ff:ff
    vf 0 MAC fa:8e:ae:13:f6:8f, spoof checking off, link-state auto
    vf 1 MAC da:46:61:b7:b2:2f, spoof checking off, link-state auto
    vf 2 MAC 6a:f6:3c:f9:75:69, spoof checking off, link-state auto
    vf 3 MAC ee:f9:5e:0a:c9:1e, spoof checking off, link-state auto
    vf 4 MAC fe:8c:fe:4a:af:bb, spoof checking off, link-state auto
    vf 5 MAC 9a:4c:c5:74:7f:75, spoof checking off, link-state auto
    vf 6 MAC a2:8a:40:ee:a1:89, spoof checking off, link-state auto
    vf 7 MAC 0e:d1:77:26:c3:68, spoof checking off, link-state auto
    vf 8 MAC 72:ff:98:e1:54:9c, spoof checking off, link-state auto

Output for device plugin log is repeating with the following log:

2018/07/11 05:45:51 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2018/07/11 05:45:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
2018/07/11 05:45:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/07/11 05:45:51 sriov device mode
Configuring SRIOV on ndev= ens5f0 6
max_vfs =  9
cur_vfs =  9
vf = &{0 virtfn0 true false}
vf = &{1 virtfn1 true false}
vf = &{2 virtfn2 true false}
vf = &{3 virtfn3 true false}
vf = &{4 virtfn4 true false}
vf = &{5 virtfn5 true false}
vf = &{6 virtfn6 true false}
vf = &{7 virtfn7 true false}
vf = &{8 virtfn8 true false}

I'm sure the device plugin feature gate is setted for k8s, here is the ps result:

# ps -ef | grep kubelet
root      2082     1  5 11:10 ?        00:08:09 /usr/local/kubernetes/kubelet --address=10.128.1.16 --hostname-override=10.128.1.16 --pod-infra-container-image=10.128.2.6/kube-system/pause-amd64:3.0 --experimental-bootstrap-kubeconfig=/etc/kubernetes/bootstrap.kubeconfig --kubeconfig=/etc/kubernetes/kubelet.kubeconfig --cert-dir=/etc/kubernetes/ssl --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/usr/local/kubernetes --cluster-dns=10.0.0.2 --cluster-domain=cluster.cloudwalk. --hairpin-mode hairpin-veth --feature-gates=DevicePlugins=true --allow-privileged=true --fail-swap-on=false --logtostderr=true --v=2
root     16674 10575  0 13:46 pts/2    00:00:00 grep --color=auto kubelet

from k8s-rdma-sriov-dev-plugin.

paravmellanox avatar paravmellanox commented on June 3, 2024

@flymark2010
plugin seems to configure the VFs correctly. Feature gate is also enabled.
what is the kubeadm, kubelet and kubeadm versions are you using? 1.10.3 or higher should work.

from k8s-rdma-sriov-dev-plugin.

flymark2010 avatar flymark2010 commented on June 3, 2024

The kubelet version is 1.9.0. I'll try higher kubelet version.

from k8s-rdma-sriov-dev-plugin.

flymark2010 avatar flymark2010 commented on June 3, 2024

After upgrading the kubelet version to 1.10.4, I can see the resource rdma/vhca in the node description, and the test Pod can run normally.
Thanks a lot!

from k8s-rdma-sriov-dev-plugin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.