azure / azhpc-diagnostics Goto Github PK
View Code? Open in Web Editor NEWScripts that run on Azure VM's and gather variety of diagnostic information to debug common issues with VM, GPU and Infiniband.
License: MIT License
Scripts that run on Azure VM's and gather variety of diagnostic information to debug common issues with VM, GPU and Infiniband.
License: MIT License
We need a way of verifying which version of the tool was run to produce a given output.
This is important for determining whether oddities in the output are because of the system it was run on or defects in an outdated version of the script (or current defects).
The tool does not currently support collecting Infiniband diagnostics on any non-SRIOV enabled SKUs.
The script should recovery from dcgmi hanging by enforcing appropriate timeouts on each run-level and recovering gracefully
If would be useful to collect 'lsmod' output
There should be a check for GPUs experiencing this type of issue:
nvidia-smi nvlink -s
....
GPU 3: Tesla V100-SXM2-32GB (UUID: ...)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
Link 2: 25.781 GB/s
Link 3: <inactive>
Link 4: 25.781 GB/s
Link 5: <inactive>
I was investigating an incident involves IB & PKEYs. In the collected data files, information in /$node/Infiniband/$device/pkey0 and pkey1 are missing:
No pkey found
I checked other files. IB drivers are installed correctly. so it's not likely that these 2 files are missing from the vm.
Then I checked the diagnostics.sh and figured that the file path might be wrong.
The correct path for pkey (in my case) is:
/sys/class/infiniband/mlx5_0/ports/1/pkeys/0
The script is looking for:
Line 254 if [ -f "$device/ports/pkeys/0" ]; then
HyperV KVP data consists of several useful data, and will be good to collect it through diag scripts.
There should be a file (general.log is currently the closest thing to it) that holds the most general info or highlights about the system.
sudo ${HPCDIAG_EXE_DIR}/gather_azhpc_vm_diagnostics.sh -d . --gpu-level=3 --mem-level=1
etc
Running Nvidia GPU Diagnostics
Querying Nvidia GPU Info, writing to {output}/Nvidia/nvidia-smi-q.out
Running plain nvidia-smi, writing to {output}/Nvidia/nvidia-smi.out
Dumping Nvidia GPU internal state to {output}/Nvidia/nvidia-debugdump.zip
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
Temporarily starting nv-hostengine
etc
Checking for common issues
Checking for GPUs with corrupted infoROM
Checking for GPUs with row remapping failures
Checking for GPUs that don't appear in nvidia-smi
Checking PCIe speed training for GPUs
Could not find pkey 0 for device mlx5_ib0
Could not find pkey 1 for device mlx5_ib0
Could not find pkey 0 for device mlx5_ib1
Could not find pkey 1 for device mlx5_ib1
Could not find pkey 0 for device mlx5_ib2
Could not find pkey 1 for device mlx5_ib2
Could not find pkey 0 for device mlx5_ib3
Could not find pkey 1 for device mlx5_ib3
Could not find pkey 0 for device mlx5_ib4
Could not find pkey 1 for device mlx5_ib4
Could not find pkey 0 for device mlx5_ib5
Could not find pkey 1 for device mlx5_ib5
Could not find pkey 0 for device mlx5_ib6
Could not find pkey 1 for device mlx5_ib6
Could not find pkey 0 for device mlx5_ib7
Could not find pkey 1 for device mlx5_ib7
Are these errors normal on NDv4?
I ran on multiple NDv4 VM's and got the same errors above.
The script fails to correctly identify standard_nc64as_t4_v3 as a Nvidia GPU SKU and therefore won't gather GPU data.
https://github.com/Azure/azhpc-diagnostics/blob/main/Linux/src/gather_azhpc_vm_diagnostics.sh#L190
This line should include 64 vCPU size as well.
Lacking of one line in line 196:
[[ "$clean" =~ ^standard_nd96i?sr(_h100)?_v5$ ]] ||
The current implementation of checking for GPU/IB drivers relies on checking to see if CLI tools that come installed with the drivers are functioning (e.g. nvidia-smi, ibstat).
Notably, this relies on these binaries being on the PATH, which can be a source of confusion for containerized environments, but in general, it's not as direct of a method as it could be.
For instance, checking for known kernel modules via lsmod could give us a clearer picture of what drivers have been successfully installed.
If we see in nvidia-smi or another GPU check that there are currently pages with ECC errors that are pending retirement, we can print a message to let the user know that this is fixable with a reboot and doesn't require hardware servicing.
To support CycleCloud deployments, we need to collect any logs that we can that are related to CycleCloud and whichever scheduler is in use.
Some sample CycleCloud Logs:
A flag to set at runtime to instruct the script to skip any activities that require Internet access.
Intended for use in sensitive or air-gapped environments.
We need to add IB health check coverage for older gen. RDMA SKUs (eg: H16r).
Please, can you update the support matrix?
thank you!
It would be useful for the tool to collect package manager logs such as dpkg.log or yum.log.
Compare the set of devices showing up on the pci bus to the expected set for the current VM size
The tool should be able to attempt to query this repo for any updates and apply them before running.
This is to avoid usage of out-of-date versions of the tool in scenarios where it has been installed onto a machine for a long time.
There is a known issue affecting VMs with ConnectX-5 cards that prevents IB devices from being found and needs to be fixed with a firmware update.
The tool should be able to detect that particular issue and alert its user to contact support for assistance.
Please use "ibv_devinfo -v" to collect all the attributes of the IB device(s)
If not already present, output of PKEYS on the VM:
cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/*
Is there a direct test to check if Accelerated Networking is enabled vs not? Not sure if this is in the VM metadata. Though indirectly, this is discernible through lspci, ibv_devinfo etc.
Additionally (unrelated to this ask), how can we add info if the VM is publicly accessible (over SSH, public IP, port 22 open)? This may be relevant to know if we can have direct access to VM (with customer consent) or will need to coordinate VM access with customer.
Currently, the tool takes /var/log/syslog and dmesg, which captures the system logs we want on Ubuntu. However, RHEL-based distros tend to put logs into /var/log/messages instead of /var/log/syslog, and newer distros (including Ubuntu) are switching over to logging using journald. The tool should be intelligent enough to gather the logs we need in any of these cases.
There should be a way to limit the output file size.
In some cases there are lots of system logs, leading to an overall output size of over 70MB. To avoid creating a huge payload to transfer around in cases like this and even more extreme ones, we need a method for controlling the overall output size.
The ibstat command relies on the ib_umad kernel module to be loaded to give full output.
In the event that it's not loaded, the ibstatus command can be used instead to still get some of the information that ibstat provides, in particular, link info.
We'd like to be able to adapt to missing ib_umad, which is not necessarily loaded when not running on the official Azure HPC marketplace images, and still obtain useful info such as link info and firmware version.
Both of these commands are from the Mellanox InfiniBand Fabric Utilities
The open source Nouveau GPU drivers are not supported on the compute GPU sizes, so it is worth flagging any time someone has them installed.
As a user of N-series VMs and H-series VMs in AKS, I want the ability to run the azhpc-diagnostics
tool in a kubernetes environment, so that I can help the Azure support team gather the diagnostic data they need to root cause and mitigate platform issues encountered on these VMs.
Azure users are sometimes asked by Azure support to run the azhpc-diagnostics
tool to gather data for them when an error condition arises that could be due to the platform. However, it's not clear how to run this tool within a kubernetes environment. This issue is a request to support running this tool in kubernetes (and thereby AKS).
When Accelerated Networking is enabled, ibstat, run with no parameters, can fail due to mistaking the Ethernet interface for an IB one. When this happens, we don't get any information from ibstat:
~$ ls /sys/class/infiniband/mlx5_ mlx5_an0/ mlx5_ib0/ mlx5_ib1/ mlx5_ib2/ mlx5_ib3/ mlx5_ib4/ mlx5_ib5/ mlx5_ib6/ mlx5_ib7/ ~$ ibstat ibpanic: [8371] main: stat of IB device 'mlx5_an0' failed: No such file or directory ~$ ibstat mlx5_ib0 CA 'mlx5_ib0' CA type: MT4124 Number of ports: 1 Firmware version: 20.28.4000 Hardware version: 0 Node GUID: 0x00155dfffe34072b System image GUID: 0x0c42a10300a58684 Port 1: State: Initializing Physical state: LinkUp Rate: 200 Base lid: 65535 LMC: 0 SM lid: 361 Capability mask: 0x2651ec48 Port GUID: 0x00155dfffd34072b Link layer: InfiniBand
This is fixable by using the ibstatus command, also from the Mellanox Fabric Utilities, which does not fail when an AN interface is present. We can even use it to determine which interfaces are actually InfiniBand ones, which would allow us to get the full output of ibstat by running it against one interface at a time.
`~$ ibstatus
Infiniband device 'mlx5_an0' port 1 status:
default gid: unknown
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet
Infiniband device 'mlx5_ib0' port 1 status:
default gid: fe80:0000:0000:0000:0015:5dff:fd34:072b
base lid: 0xffff
sm lid: 0x169
state: 2: INIT
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
...`
The azhpcdaig tool currently does not support GPU diagnostics information for the latest N series SKUs, such as NDH100v4 and NDA100v5. Is there a timeline for updating the scripts? This tool is crucial for collecting logs for support ticket and analysis.
With ibstat outputs like
...
Port 1:
State: Down
Physical state: Polling
...
we should highlight this issue in output and include the vmbus ID of the affected port.
The output of the tool should include all the necessary files to submit a bug report to Nvidia.
There are open compliance tasks that need to be reviewed for your azhpc-diagnostics repo.
To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your Azure GitHub organization.
Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/Azure/repos/azhpc-diagnostics/compliance
You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.
If you no longer need this repository, it might be quickest to delete the repo, too.
More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]
FYI: current admins at Microsoft include @JonShelley, @lostern, @jithinjosepkl, @souvik-de, @AnnaDaly, @edwardsp, @rluanma, @vermagit, @tlcyr4, @sakshamgupta006, @anshuljainansja, @alfpark
When nvidia-smi output has ERR! as its power usage like this
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00008ED0:00:00.0 Off | 0 |
| N/A 29C P0 ERR! / 250W | 593MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
the tool should point it out to explain that it's just a transient issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.