Code Monkey home page Code Monkey logo

intel-technology-enabling-for-openshift's Introduction

Intel® Technology Enabling for OpenShift*

Overview

The Intel Technology Enabling for OpenShift project provides Intel Data Center hardware feature-provisioning technologies with the Red Hat OpenShift Container Platform (RHOCP). The technology to deploy and manage the End-to-End (E2E) solutions as well as the related reference workloads for these features are also included in the project.

These Intel Data Center hardware features currently include:

  • Intel® Software Guard Extensions (Intel® SGX)
  • Intel® Data Center GPU Flex Series
  • Intel® Data Center GPU Max Series
  • Intel® QuickAssist Technology (Intel® QAT)

The following features will be included in future releases.

  • Intel® Data Streaming Accelerator (Intel® DSA)
  • Intel® In-Memory Analytics Accelerator (Intel® IAA)
  • Intel® FPGA N6000

See details about Supported Intel Hardware features and Supported RHOCP Versions.

For detailed information about releases, please refer Release Information.

Figure-1 is the Architecture and Working Scope of the project

Alt text

Figure-1 Intel Technology Enabling for OpenShift Architecture

Supported platforms

This section describes the RHOCP infrastructure and Intel hardware features supported by this project. The project lifecycle and support channels can also be found here.

Getting started

See reference BIOS Configuration required for each feature.

Provisioning RHOCP cluster

Use one of these two options to provision an RHOCP cluster:

In this project, we provisioned RHOCP 4.14 on a bare-metal multi-node cluster. For details about the supported RHOCP infrastructure, see the Supported Platforms page.

Provisioning Intel hardware features on RHOCP

Please follow the steps below to provision the hardware features

  1. Setting up Node Feature Discovery
  2. Setting up Machine Configuration
  3. Setting up Out of Tree Drivers
  4. Setting up Device Plugins

Verifying hardware feature provisioning

You can use the instructions in the link to verify the hardware features provisioning.

Upgrade (To be added)

Reference end-to-end solution

The reference end-to-end solution is based on Intel hardware feature provisioning provided by this project.

Intel AI Inferencing Solution with OpenVINO and RHOAI

Reference workloads

Here are some reference workloads built on the end-to-end solution and Intel hardware feature provisioning in this project.

  • Large Language Model (To be added)
  • Open Federated Learning (To be added)

Advanced Guide

This section discusses architecture and other technical details that go beyond getting started.

Release Notes

Check the link for the Release Notes.

Support

If users encounter any issues or have questions regarding Intel Technology Enabling for OpenShift, we recommend them to seek support through the following channels:

Commercial support from Red Hat

This project relies on features developed and released with the latest RHOCP release. Commercial RHOCP release support is outlined in the Red Hat OpenShift Container Platform Life Cycle Policy and Intel collaborates with Red Hat to address specific requirements from our users.

Open-Source Community Support

Intel Technology Enabling for OpenShift is run as an open-source project on GitHub. Project GitHub issues can be used as the primary support interface for users to submit feature requests and report issues to the community when using Intel technology provided by this project. Please provide detailed information about your issue and steps to reproduce it, if possible.

Contribute

See CONTRIBUTING for more information.

Security

To report a potential security vulnerability, please refer to security.md file.

License

Distributed under the open source license. See LICENSE for more information.

Code of Conduct

Intel has adopted the Contributor Covenant as the Code of Conduct for all of its open source projects. See CODE_OF_CONDUCT file.

intel-technology-enabling-for-openshift's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

intel-technology-enabling-for-openshift's Issues

Build getting failed when trying to build DGPU driver image via KMM on Openshift 4.12

Hello Team,

As we are not able to use the pre-build mode container image having a different kernel version than our kernel version i.e. 4.18.0-372.71.1.el8_6.x86_64, we are trying the On-premise Build Mode approach via KMM to build DGPU driver image in Openshift 4.12 as documented in below github link.

[https://github.com/intel/intel-technology-enabling-for-openshift/tree/main/kmmo#managing-intel-dgpu-driver-with-kmm-operator]

Build is getting triggered, but unfortunately we are seeing below error(snippet below)

Also sharing the complete logs of the build pod for the analysis, require the team support take it up forward.
intel-dgpu-on-premise-build-mode-r8hjv-undefined.log

Log snippet

/build/intel-gpu-i915-backports/drivers/gpu/drm/i915/fabric/netlink.c:1333:14: note: (near initialization for 'nl_iaf_cmds[19].start')
cc1: some warnings being treated as errors
make[6]: *** [scripts/Makefile.build:318: /build/intel-gpu-i915-backports/drivers/gpu/drm/i915/fabric/netlink.o] Error 1
make[5]: *** [scripts/Makefile.build:558: /build/intel-gpu-i915-backports/drivers/gpu/drm/i915/fabric] Error 2
make[4]: *** [scripts/Makefile.build:558: /build/intel-gpu-i915-backports/drivers/gpu/drm/i915] Error 2
make[3]: *** [Makefile:1584: _module_/build/intel-gpu-i915-backports] Error 2
make[2]: *** [Makefile.build:13: modules] Error 2
make[1]: *** [Makefile.real:105: modules] Error 2
make: *** [Makefile:50: modules] Error 2
error: build error: error building at STEP "RUN git clone -b ${I915_RELEASE} --single-branch https://github.com/intel-gpu/intel-gpu-i915-backports.git && cd intel-gpu-i915-backports && install -D COPYING /licenses/i915/COPYING && export LEX=flex; export YACC=bison && export OS_TYPE=rhel_9 && export OS_VERSION="9.2" && cp defconfigs/i915 .config && make olddefconfig && make modules -j $(nproc) && make modules_install": error while running runtime: exit status

v1.2.0 Release Checklist for RHOCP 4.14.x

About v1.2.0 Release

1.2.0 release is the maintenance release for OCP 4.14.x.

  • In this maintenance release, we intend to cut a quick release and make sure all the existing key features continue working properly on the new OCP y release.
  • We will not introduce any major change including the major components upgrading and complicated bug fixing.
  • None of the new features will be added into the maintenance release.
  • The Maintenance release will rely on the CI/CD Pipeline to make sure the project comply with the OpenShift Operator Aggressive mode, which means the Operator released on the old Openshift version by default should continue work on the new OpenShift release

v1.2.0 Release Checklist

  • #113
  • #107
  • Quick bug fixing if needed- (not a bug-for SGX + NFD rule not updated in 1.2.0, will be added to release 1.2.1)
  • Update Documentation
  • Cut the release
  • #194

Intel Device Plugins returns error "permission denied" on RHOCP 4.15.3

Summary:

Intel Device Plugins returns error "permission denied" on RHOCP 4.15.3. Kubelet is running with the wrong label. The same issue was observed and fixed on 4.14.10. See this for more details: #113. The SeLinux Regression fix is not integrated into RHOCP 4.15 properly.

Error:

Failed to serve gpu.intel.com/i915: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"

Root Cause:

Kubelet should run as kubelet_exec_t and not unconfined_service_t label.

sh-5.1# ps -AZ | grep kubelet
system_u:system_r:unconfined_service_t:s0 34373 ? 01:59:27 kubelet
sh-5.1#

Remove RunAsAny (root) for qatlib container

Summary

Currently qatlib workload runs with custom SCC using IPC_LOCK and root permissions

Detail

qatlib workload needs IPC_LOCK permission, added via custom SCC based on restricted-v2 default SCC. The container also needs to run as root according to qatlib doc. This is added with RunAsAny permission in custom SCC. This also enables container to access devices as root

Possible solutions

  1. To avoid access to host devices as root, follow #35. Figure how to enable CRIO flag for every host. Possibly via privileged container daemonset.
  2. Possibility to run qatlib container as non-root or a specific user?

VAAPI can't use other rate control profiles of GPU

When deploying in Openshift, only one rate control profile Is available:

image

We have installed the gpu driver from 1.1.0 (with firmware)

version: intel-data-center-gpu-driver-container:2.0.0-5.14.0-284.28.1.el9_2.x86_64 and can access the GPU with OpenCL and other utilities with no issue.

It seems that one possible cause of the limitation may listed here: https://github.com/intel/media-driver?tab=readme-ov-file#known-issues-and-limitations

The GPU configuration sets the mode of enable_guc to 3 (see attached doc syskerneldebugdri1i915_.txt), but according to
https://wiki.archlinux.org/title/intel_graphics#Enable_GuC_/_HuC_firmware_loading we should see the message i915 0000:00:02.0: [drm] HuC firmware i915/icl_huc_9.0.0.bin version 9.0 authenticated:yes

but instead,

image

It may be possible that HuC fw is not loaded correctly, and the rate control isn't available

How to enable GPU SRIOV in KubeVirt pod?

There are use cases like Windows Cloud Gaming that may run in VM (KubeVirt). At the same time, vGPU (SRIOV) may be needed as well to support multiple instances of the workload.

Is there a BKM to enable GPU SRIOV in KubeVirt for Intel dGPU (Flex 140 or 170)?

Regression: OCP 4.14 KMM v2 unable to pull pre-built certified driver container image

Summary:

Regression: KMM v2.0.0/v2.0.1 on OCP 4.14 is unable to pull the certified driver container image from Red Hat registry once the pre-built mode module is deployed in default openshift-kmm namespace. This behavior is unexpected as it is working on KMM v1 deployments. See issue filed in KMM downstream repo: rh-ecosystem-edge/kernel-module-management#992

Note: Only KMM v2.0.0/v2.0.1 is available on OCP 4.14+. KMM v1 is unavailable.

Analysis:

In KMM v1, the node successfully pulled the image with default OCP cluster global pull secret. In KMM v2, the worker pod pulls the image but the global pull secret is not mounted on the pod and thus the pull fails.

Workaround:

  1. Use command below to Copy global pull secret pull-secret in openshift-config namespace to openshift-kmm namespace.
$ oc get secrets pull-secret -n openshift-config -o json  | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid","annotations"])'  | oc apply -n openshift-kmm -f -
  1. Set module.spec.imageRepoSecret.name to pull-secret in pre-build mode KMM Module intel-dgpu.yaml.

Impact:

Above workaround is an additional nontrivial step that impacts the user experience. Pre-built mode is intended to be as seamless as possible.

Tentative Proposal:

Request KMM to use the global pull secret and mount it on the worker pod.

Update:

Fix to be included in KMM 2.0.2, official release target Feb 27

Upstream & OCP Tasks Tracking List

Summary

This issue is used as the tracking List used by us to track the issues or tasks filed in the other projects. After we triage and root cause some bugs, new features or enhancement to a specific project, we will file an issue in that project, and then just add one item into the checklist to track the upstream or OCP issues and tasks. And then we will resolve these issues with the specific upstream project to finish the tasks one by one in upstream and downstream to OCP.

Issue & Task List

P1-Blocker: GPU workload can not access the GPU devices from the Container environment without setsebool container_use_devices on

updates according to @mregmi @vbedida79's comments

Summary

GPU workload can not access the GPU devices from the Container environment without setsebool container_use_devices on

Detail

GPU Workload pods requesting gpu.intel.com/i915 resource cant be executed- until they have access for /dev/drm on the GPU node.
This can be achieved by setting- setsebool container_use_devices on on the host node. This is not feasible to implement if a cluster has multiple GPU nodes and this permission has to be set on each node manually.

Root cause

The /dev/drm access permission is not been added to the container_device_t policy so the access of the /dev/drm is blocked by SELinux which makes the workload app in the can't access the GPU device node files from the container environment.

Solution

  • Work with container-selinux upstream to add the needed permission, and make sure the new container-selinux with the fixing got merged into OCP release.
  • Before it is merged into OCP release, we have to distribute this new policy through user-container-policy project.

Workaround

To ensure all GPU workloads (clinfo, AI inference) work properly, please run the following command on the GPU nodes.

  1. Find all nodes with an Intel Data Center GPU card using the following command:
$ oc get nodes -l intel.feature.node.kubernetes.io/gpu=true

Example output:

NAME         STATUS   ROLES    AGE   VERSION
icx-dgpu-1   Ready    worker   30d   v1.25.4+18eadca
  1. Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal. Repeat step 2 for any other nodes with an Intel Data Center GPU card.
$ chroot /host
$ setsebool container_use_devices on

Inconsistent out of memory issue on node with rhods notebooks

Summary

On OCP 4.13 using RHODS (RedHat openshift data science) with OpenVINO notebooks- the kernel restarts inconsistently with out of memory messages

Details

OCP cluster 4.13 with Intel Data Center Flex 170 GPU and notebook with memory requests and limits as 56GB.
When using RHODS with openvino notebook specifically while executing stable diffusion notebook, the python notebook kernel restarts inconsistently, dmesg on node shows:

[    0.019134] Early memory node ranges
[    0.023751] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.023753] PM: hibernation: Registered nosave memory: [mem 0x0009d000-0x000fffff]
[    0.023755] PM: hibernation: Registered nosave memory: [mem 0x59039000-0x59039fff]
[    0.023757] PM: hibernation: Registered nosave memory: [mem 0x590fb000-0x590fbfff]
[    0.023758] PM: hibernation: Registered nosave memory: [mem 0x5ee4e000-0x5ee4efff]
[    0.023760] PM: hibernation: Registered nosave memory: [mem 0x5ee85000-0x5ee85fff]
[    0.023760] PM: hibernation: Registered nosave memory: [mem 0x5ee86000-0x5ee86fff]
[    0.023762] PM: hibernation: Registered nosave memory: [mem 0x5eebd000-0x5eebdfff]
[    0.023764] PM: hibernation: Registered nosave memory: [mem 0x5ef0b000-0x5efecfff]
[    0.023765] PM: hibernation: Registered nosave memory: [mem 0x66d71000-0x6866dfff]
[    0.023766] PM: hibernation: Registered nosave memory: [mem 0x6866e000-0x69897fff]
[    0.023766] PM: hibernation: Registered nosave memory: [mem 0x69898000-0x69dfdfff]
[    0.023768] PM: hibernation: Registered nosave memory: [mem 0x6f800000-0x8fffffff]
[    0.023769] PM: hibernation: Registered nosave memory: [mem 0x90000000-0xfdffffff]
[    0.023769] PM: hibernation: Registered nosave memory: [mem 0xfe000000-0xfe010fff]
[    0.023770] PM: hibernation: Registered nosave memory: [mem 0xfe011000-0xfed1ffff]
[    0.023770] PM: hibernation: Registered nosave memory: [mem 0xfed20000-0xfed44fff]
[    0.023771] PM: hibernation: Registered nosave memory: [mem 0xfed45000-0xffffffff]
[    0.237871] Freeing SMP alternatives memory: 36K
[    3.572274] Non-volatile memory driver v1.3
[    3.653525] Freeing initrd memory: 89312K
[    4.228204] Freeing unused decrypted memory: 2036K
[    4.232827] Freeing unused kernel image (initmem) memory: 2788K
[    4.247331] Freeing unused kernel image (text/rodata gap) memory: 2040K
[    4.251702] Freeing unused kernel image (rodata/data gap) memory: 60K
[   11.014980] i2c i2c-0: 16/32 memory slots populated (from DMI)
[   11.014982] i2c i2c-0: Systems with more than 4 memory slots not supported yet, not instantiating SPD
[   12.964055] EDAC i10nm: No hbm memory
[ 1357.676966] i915 0000:33:00.0: [drm] Local memory IO size: 0x000000037a800000
[ 1357.676968] i915 0000:33:00.0: [drm] Local memory available: 0x000000037a800000
[407440.611017]  out_of_memory+0xed/0x2e0
[407440.611029]  mem_cgroup_out_of_memory+0x13a/0x150
[407440.611116] memory: usage 58720252kB, limit 58720256kB, failcnt 23
[407440.611117] memory+swap: usage 58720252kB, limit 58720256kB, failcnt 17987903
[407440.611133] Tasks state (memory values in pages):
[407440.612317] Memory cgroup out of memory: Killed process 1535268 (python3.8) total-vm:1209308992kB, anon-rss:41459744kB, file-rss:466276kB, shmem-rss:4kB, UID:1000750000 pgtables:151104kB oom_score_adj:778
[408339.735618]  out_of_memory+0xed/0x2e0
[408339.735629]  mem_cgroup_out_of_memory+0x13a/0x150
[408339.735686] memory: usage 58720256kB, limit 58720256kB, failcnt 23
[408339.735687] memory+swap: usage 58720256kB, limit 58720256kB, failcnt 21385997
[408339.735703] Tasks state (memory values in pages):
[408339.736085] Memory cgroup out of memory: Killed process 2725201 (python3.8) total-vm:132980172kB, anon-rss:41961444kB, file-rss:304524kB, shmem-rss:4kB, UID:1000750000 pgtables:90372kB oom_score_adj:778
[457794.119151]  out_of_memory+0xed/0x2e0
[457794.119162]  mem_cgroup_out_of_memory+0x13a/0x150
[457794.119215] memory: usage 58720256kB, limit 58720256kB, failcnt 23
[457794.119217] memory+swap: usage 58720256kB, limit 58720256kB, failcnt 24769451
[457794.119234] Tasks state (memory values in pages):
[457794.119591] Memory cgroup out of memory: Killed process 2740651 (python3.8) total-vm:132968056kB, anon-rss:41960760kB, file-rss:305636kB, shmem-rss:4kB, UID:1000750000 pgtables:90380kB oom_score_adj:778

Todo/Solutions

Need to confirm the root cause, if its affected via CPU or GPU or memory issues on the node itself
Also execute other openvino notebooks and verify the issue

Facing issues when trying to build DGPU driver image via KMM on OKD ie. Openshift upstream

Hello Team,

We are working on the On-premise Build Mode approach via KMM to build DGPU driver image documented in below github link as we are using a 6.x kernel version with OKD.

[https://github.com/intel/intel-technology-enabling-for-openshift/tree/main/kmmo#managing-intel-dgpu-driver-with-kmm-operator]

using On-premise Build Mode
Prior to using this mode, run the following commands to create a ConfigMap and include the dockerfile to build the driver container image:

$ git clone https://github.com/intel/intel-data-center-gpu-driver-for-openshift.git && cd intel-data-center-GPU-driver-for-openshift/docker

$oc create -n openshift-kmm configmap intel-dgpu-dockerfile-configmap --from-file=dockerfile=intel-dgpu-driver.Dockerfile

To use this mode, run the following command:

$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/kmmo/intel-dgpu-on-premise-build-mode.yaml

Herein we have slight changes made in the configmap as driver toolkit is not available for OKD & we have build a custom image. Build is getting triggered & we are seeing below error(snippet below)

Also sharing the complete logs of the build pod for the analysis, require the team support take it up forward.

Turn off this advice by setting config variable advice.detachedHead to false

Generating local configuration database from kernel ...Kernel version parse failed!
make: *** [Makefile:45: olddefconfig] Error 1
error: build error: error building at STEP "RUN git clone ...odules_install": error while running runtime: exit status 2
[intel-dgpu-on-premise-build-mode-build-5mc65-undefined.log](https://github.com/intel/intel-technology-enabling-for-openshift/files/11845798/intel-dgpu-on-premise-build-mode-build-5mc65-undefined.log)

Firmware path can be tuned via KMM 2.0

NFD Feature Rules for GPU type aren't deployed

Per https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/labels.md ,

Nodes can label the cards that are available using these rules: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/nfd/overlays/node-feature-rules/platform-labeling-rules.yaml

This should probably be installed by default since this is via the OpenShift Operator deployment; otherwise admins have to create the rules themselves.

Support the Heterogenous(different type of) Intel GPU cards in the same OCP cluster

Summary

Support the heterogeneous (different) Intel GPU cards in the same OCP cluster.

Detail

In the Scenario, When in the same cluster, different Intel GPU cards like Max-1100, Flex-140, and Flex-170 are provisioned. A mechanism should be provided for the users to pick up the proper GPU card they want to run the workloads on.
To align with the taints/tolerance mechanism from Red Hat OpenShift AI accelerator Profile, We will use the same taints/tolerance mechanism for this feature.

To properly label(taint) the nodes in the cluster automatically, we will rely on the NFD node tainting feature.

So this feature rely on issue openshift/cluster-nfd-operator#356

Note

The feature is for the heterogeneous (different) Intel GPU cards in the same OCP cluster.
The different Intel dGPU cards in the same node are not supported.

Support SHA Digests image in the device plugin CRD

Summary

The SHA Digests should be supported by Intel device plugin operator and protect the integrity of the device plugin image.

Prioirty

Detail

By default, on OCP, the SHA256 Digests is used to point to the device plugin image and secure the integration of the image.
While intel device plugin operator can't support the SHA26 Digests.
See #03220971
see Image SHA Digests vs. Image Tags

Suggestion Solutions

  • to work around this issue, while creating the DevicePlugin CR (using SGXDevicePluin as an example), an errormessage like "intel-sgx-plugin@sha256. Make sure you use '/intel-sgx-plugin:'"might be shown to suggest users replace the SHA digests with the image tag like 0.24.0.
  • to resolve the issue, the intel device plugin device plugin operator should support SHA Digests in the device plugin CRD

Setting kernel command line parameters during Day 1 OCP cluster provisioning

Summary:

The goal of this issue is to explore how to set kernel command line parameters during Day 1 provisioning of OCP cluster.

Kernel command line parameters like intel_iommu are required to enable feature provisioning for QAT, DSA, and IAA. The current approach is to use machine configuration but that approach leads to reboot. It may be possible to set this parameter while the user is provisioning the cluster on Day 1. Pathfinding and exploration needed to determine feasibility.

v1.1.0 Release Checklist

v1.1.0 Release Checklist

  • Adding support for OCP 4.13.10+

  • Decide upstream version of Intel Device Plugins. Current leading candidate is v0.28.0

  • Move to UBI 9 based images for all plugin, init-container and GPU driver container images

  • Release GPU driver version 2.0.0 for OpenShift here https://github.com/intel/intel-data-center-gpu-driver-for-openshift

  • Review SELinux requirements and if any patches needed in container-selinux project.

  • Review NFD configuration and add node feature rules.

  • Review KMM configuration, Module CR updates, for KMM version 1.1.1

  • Machine Configuration

  • Build init-container and plugin images

  • Build operator bundle and operator image

  • Update libraries, images to ubi9 and verify SGX, QAT, GPU workloads

  • Update Documentation

  • Cut the release

QAT end-to-end Solution: Run and develop qatlib based application seamless in OCP Container environment

Summary

Run and develop qatlib based applications seamlessly in OCP container environment with RH distributed qatlib packages

Detail

The 1.0.1 release already provided the Intel QAT provisioning for OCP platform. The end user can access the QAT resource in the OCP container environment. However, the end user can not run and develop qatlib based applications easily and seamlessly with RH distributed qatlib packages.

Currently, RH qatlib packages are distributed through some specific repo, and the end users have to install the related qat libraries with proper subscriptions. There are no proper documents for the end users on how to configure the repo and install the related rpm packages.

So it is not easy for the users to run and develop qatlib based applications in the OCP platform.

Suggested Solutions

  1. The RH distributed qatlib package should be carefully tested and verified in the OCP environment
  • lib should be built without --disable-fast-crc-in-assembler option
  • lib should be built with option --enable-systemd=no for the container environment
  • other requirments see qalib install readme
  1. The qatlib can be very easily installed and used by the users in the OCP container environment
  • qatlib packages should be integrated into some UBI base images so the end user can directly install from the default repo without adding any other repo
  • the steps to configure and install the qatlib should be well-documented
  1. some RH distributed qatlib based reference workload should be provided

Action

The case is filed to RH

P1-Blocker: Can not run QAT workload in the non-priviledged container

Update the issue according to the comments from @vbedida79 @mregmi @mythi

Summary

To make a containerized QATlib-based app run in the OCP, we have to run it as the privileged container. However the privileged container is not allowed for user application workload for the security risk in OCP.

Detail

In OCP Containerized environment. When run the test case from cpa_sample_code from qatlib to validate the functionality on OCP using a non privileged container, the tests fail during the initialization as it cannot allocate memory. All the tests pass when used in a privileged container.

dma_map_slab:200 VFIO_IOMMU_MAP_DMA failed va=7f8709cca000 iova=200000 size=200000 -- errno=12
[error] SalCtrl_ServiceInit() - : Failed to initialise all service instances
[error] SalCtrl_ServiceEventStart() - : Private data is NULL
qaeMemInit started
ADF_UIO_PROXY err: adf_init_ring: unable to get ringbuf(v:(nil),p:(nil)) for rings in bank(0)
ADF_UIO_PROXY err: icp_adf_transCreateHandle: adf_init_ring failed
ADF_UIO_PROXY err: adf_user_subsystemInit: Failed to initialise Subservice SAL
ADF_UIO_PROXY err: adf_user_subsystemStart: Failed to start Subservice SAL
ADF_UIO_PROXY err: icp_adf_subsystemUnregister: Failed to shutdown subservice SAL.
quickassist/lookaside/access_layer/src/sample_code/performance/cpa_sample_code_main.c, main():479 Could not start sal for user space
[error] SalCtrl_AdfServicesStartedCheck() - : Sal Ctrl failed to start in given time

root case

in OCP IPC_LOCK capability needed in the SCC(security context constraints) to enable the DMA from userspace for QAT VFIO device

Solution

Intel SGX Device Plugin returns error "permission denied" for OpenShift 4.13

Summary

During installation of Intel SGX Device Plugin, an error occurs which states lack of access permissions for kubelet.sock socket from intel-sgx-plugin pod. This error happens in OpenShift 4.13 and was not present in OpenShift 4.12.

Detail

During installation of Intel SGX Device Plugin below error occurs:

oc -n openshift-operators logs pod/intel-sgx-plugin-ng262 -c intel-sgx-plugin
...
E0823 14:04:16.980975       1 manager.go:146] Failed to serve sgx.intel.com/provision: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"

As a workaround, I added privileged access rights for the DaemonSet/Pod by using below command line:

oc -n openshift-operators edit ds/intel-sgx-plugin

After replacing:

securityContext:
    allowPrivilegeEscalation: false

with:

securityContext:
    privileged: true

started working. Most probably, such privileges escalation is not needed and can be limited to necessary only privileges.

Resolving this issue would be very helpful/beneficial because 4.13 is a current version of OpenShift, and this plugin works without any issues in OpenShift 4.12, which is a previous version.

Also, it would be great to make sure that such issue does not occur for the upcoming OpenShift version 4.14. Many thanks in advance!

Update as of Dec 14 2023 from @mregmi latest comment:
Still waiting on fix to propagate to OCP 4.13 and 4.14 (https://issues.redhat.com/browse/OCPBUGS-20022)

  • Issue root cause: Kubelet is running with wrong label on OCP 4.13 and higher

Workaround:

Since the kubelet is running with wrong label on OCP 4.13 and beyond, we need to run SELinux in permissive mode as a workaround. To do this, please run the following command on all the nodes.

  1. Find all nodes in the OCP cluster:
$ oc get nodes

Example output:

NAME         STATUS   ROLES    AGE   VERSION
icx-dgpu-1   Ready    worker   30d   v1.25.4+18eadca
  1. Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal. Repeat step 2 for any other nodes in the cluster.
$ chroot /host
$ setenforce Permissive

Avoiding node rebooting for machine configurations

Summary:

Node rebooting presents several challenges. Certain machine configurations require reboot of node(s) on a OpenShift cluster. Typically, machine configuration (MachineConfig) updates or changes on a OpenShift cluster are facilitated by the Machine Config Operator (MCO). From a cluster administrator or end user perspective, reboots may not be preferred in a production environment for a variety of reasons.

Process:

A reboot of a node typically involves cordoning the node (prevents the scheduler from placing new pods onto that node). Then, the node is drained which means all running pods are removed from the node. When possible, the scheduler will attempt to reschedule pod(s) that are evicted on node A to another node B. This scenario can prove challenging. During reboot, the node state will be not ready and then the node will become ready if the reboot succeeds gracefully. Finally, the node is uncordoned, (marked as schedulable) meaning new pods can be scheduled on the node. If multiple nodes are targeted by a specific MachineConfig, typically the nodes are rebooted sequentially.

Examples:

  • Since the default firmware directory /lib/firmware is read-only on OCP cluster nodes, MachineConfig is used to set an alternative firmware path via firmware_class.path=/var/lib/firmware so that out-of-tree (OOT) firmware can be loaded on a RHCOS node. Kernel Module Management (KMM) Operator copies the firmware from the driver container to the alternative firmware path after the driver container is deployed. This approach is used to load OOT GPU firmware and provision the Intel GPU card on OpenShift.

  • Similarly, for QAT, the kernel parameter intel_iommu is turned on via MCO. All MCO operations trigger a one time reboot per node to reach the desired configuration.

Goal:

When possible, the goal is to perform the configuration operations at runtime to avoid disruption to the cluster and workloads.

Possible Solutions to Certain Scenarios:

In certain scenarios, it may be possible to facilitate a configuration change at runtime.

  • For the alternative firmware path, it may be possible to have KMM configure the lookup path at runtime before loading any module.
    The lookup path is configured on the node with the following command: echo /var/lib/firmware > /sys/module/firmware_class/parameters/path
    For more details on firmware search paths, review details here.

  • Another option is to deploy a privileged DaemonSet that configures the lookup path at runtime and then sleeps forever.
    Note if the node is rebooted, this lookup path has to be configured again. With the above 2 options, the lookup path should always be configured prior to load of any module. This should be facilitated by design.

  • Here is a successful example of facilitating a node configuration change at runtime: KMM 1.1 facilitates removal of an in-tree module prior to loading the OOT module at runtime.

Using Intel GPUs with Microshift

Identify Intel GPUs supporting Microshift - Intel hardware suitable for edge devices.
Try out the installation and document the steps needed (e.g if any special drivers installation is required during installation etc.)

v1.2.1 Release Checklist

v1.2.1 Release

  • 1.2.1 release is GA for OCP 4.14.11
  • Support for Intel Data Center GPU Max Series
  • All features verified from previous release- will be supported
  • Upgrades and major bug fixes- if any

Release Checklist

  • Machine Configs for GPU review- (Reboot free GPU provisioning supported, MCO no longer used)
  • KMM review
  • NFD review (for SGX and NFR)
  • Plugin images and changes
  • #214
  • Update tests
  • Build operator bundle and certify- if needed
  • Update Documentation, readmes, release table
  • Cut the release

Device Plugin CR webhook error

Summary:

When creating a [Sgx/Qat/Gpu]DevicePlugin CR after the Intel Device Plugins operator is installed, a webhook error is observed. This error is observed via OpenShift web console as well as through CLI.

Error message:

Error "failed calling webhook "vsgxdeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-controller-manager-service.openshift-operators.svc:443/validate-deviceplugin-intel-com-v1-sgxdeviceplugin?timeout=10s": dial tcp 10.131.1.204:9443: connect: connection refused" for field "undefined".

Analysis:

A webhook is called to validate [Sgx/Qat/Gpu]DevicePlugin CR yaml input parameters when user creates the DevicePlugin. Initial evidence suggests this issue is network or environment related.

Impact:

User is unable to consistently create [Sgx/Qat/Gpu]DevicePlugin CR on OpenShift cluster. User experience impacted. The error is not consistent, on the first try, the error may appear and then continue to appear for subsequent retries. If the user keeps clicking the "Create" button on web console, the DevicePlugin CR is eventually created without error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.