nvlabs / nvbitfi Goto Github PK

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation

License: Other

C++ 6.96% Makefile 3.16% Cuda 37.99% C 3.99% Python 43.07% Shell 4.82%

nvbitfi's Introduction

NVBitFI: An Architecture-level Fault Injection Tool for GPU Application Resilience Evaluations

NVBitFI provides an automated framework to perform error injection campaigns for GPU application resilience evaluation. NVBitFI builds on top of NVIDIA Binary Instrumentation Tool (NVBit), which is a research prototype of a dynamic binary instrumentation library for NVIDIA GPUs. NVBitFI offers functionality that is similar to a prior tool called SASSIFI.

Please refer to our NVBitFI paper for additional details about the tool and some experimental results.

Summary of NVBitFI's capabilities

NVBitFI injects errors into the destination register values of a dynamic thread-instruction by instrumenting instructions after they are executed. A dynamic instruction is selected at random from all dynamic kernels of a program for error injection. Only one error is injected per run. This mode was referred to as IOV in SASSIFI. As of now (4/1/2020), NVBitFI allows us to select the following instruction groups to study how errors in them can propagate to the application output.

Instructions that write to general purpose registers
Single precision floating point instructions
Double precision floating point instructions
Load instructions

NVBitFI can be extended to include custom instruction groups. See below for more details.

For a selected destination register, following errors can be injected.

Single bit-flip: one bit-flip in one register in one thread
Double bit-flip: bit-flips in two adjacent bits in one register in one thread
Random value: random value in one register in one thread
Zero value: zero out the value of one register in one thread

New bit-flip models can be added by modifying common/arch.h and injector/inject_func.cu and scripts/params.py.

Prerequisites

NVBit v1.5.5 or newer
System requirements

Getting started on a Linux x86_64 PC

The following commands are tested on an x86 system with Ubuntu 18.04 using CUDA-11.2 and NVBit version 1.5.5.

# NVBit-v1.5.5
wget https://github.com/NVlabs/NVBit/releases/download/1.5.5/nvbit-Linux-x86_64-1.5.5.tar.bz2
tar xvfj nvbit-Linux-x86_64-1.5.5.tar.bz2
cd nvbit_release/tools/

# NVBitFI 
git clone https://github.com/NVlabs/nvbitfi
cd nvbitfi
find . -name "*.sh" | xargs chmod +x
./test.sh

On an ARM-based device (e.g., Jetson Nano)

# NVBit-1.5.5
wget https://github.com/NVlabs/NVBit/releases/download/1.5.5/nvbit-Linux-aarch64-1.5.5.tar.bz2
tar xvfj nvbit-Linux-aarch64-1.5.5.tar.bz2
cd nvbit_release/tools/

# NVBitFI 
git clone https://github.com/NVlabs/nvbitfi
cd nvbitfi
find . -name "*.sh" | xargs chmod +x
./test.sh

If these commands complete without errors, you just completed your first error injection campaign using NVBitFI. The printed output should say where the results are stored. Summary of the campaign is stored in a tab-separated file, results_*NVbitFI_details.tsv. It can be opened using a spreadsheet program (e.g., Excel) for visualization and analysis.

Detailed steps

There are three main steps to run NVBitFI. We provide a sample script (test.sh) that automates nearly all these steps.

Step 0: Setup

One-time only: Copy NVBitFI package to tool directory in NVBit installation (see the above commands)
Every time we run an injection campaign: Setup environment (see Step 0 (2) in test.sh)
One-time only: Build the injector and profiler tools (see Step 0 (3) in test.sh)
One-time only: Run and collect golden stdout and stderr files for each of the applications (see Step 0 (4) in test.sh).
- Record fault-free outputs: Record golden output file (as golden.txt), stdout (as golden_stdout.txt), and stderr (as golden_stderr.txt) in the workload directory (e.g., nvbitfi/test-apps/simple_add).
- Create application-specific scripts: Create run.sh and sdc_check.sh scripts in workload directory. Instead of using absolute paths, please use environment variables for paths such as BIN_DIR, APP_DIR, and DATASET_DIR. These variables are set in set_env function in scripts/common_functions.py. See the scripts in the nvbitfi/test-apps/simple_add directory for examples.
- Workloads will be run from logs/workload-name/run-name directory. It would be great if the workload can run from this directory. If the program requires input files to be in a specific location, either update the workload or provide soft links to the input files in appropriate locations.
- The program output should be deterministic. Please exclude non-deterministic values (e.g., runtimes) from the file if they are present in one of the output files (see test-apps/simple_add/sdc_check.sh for more details).

Step 1: Profile and generate injection list

Profile the application: Run the program once by using profiler/profiler.so. We provide scripts/run_profiler.py script for this step. A new file named nvbitfi-igprofile.txt will be generated in logs/workload-name directory. This file contains the instruction counts for all the instruction groups and opcodes defined in common/arch.h. One line is created per dynamic kernel invocation. Profiling is often slow as it instruments every instruction in every dynamic kernel. Using an approximate profile can speed it up by orders of magnitude. There are many ways to approximate a profile and trade-off accuracy for speed. In this release we implement a method that approximates the profiles of all dynamic invocations of a static kernel with the profile of the first invocation of the static kernel. It essentially profiles all static kernels just ones, which can make the profiling very fast if a program has few static kernels and many dynamic involutions per kernel. This approximation can be enabled by using the SKIP_PROFILED_KERNELS flag while building the profiler.
Generate injection sites:
- Ensure that the parameters are set correctly in scripts/params.py. Following are some of the parameters that need user attention:
  - Setting maximum number of error injections to perform per instruction group and bit-flip model combination. See NUM_INJECTION and THRESHOLD_JOBS in scripts/params.py.
  - Selecting instruction groups and bit-flip models (more details in scripts/params.py).
  - Listing the applications, benchmark suite name, application binary file name, and the expected runtime on the system where the injection job will be run. See the apps dictionary in scripts/params.py for an example. The expected runtime defined here is used later to determine when to timeout injection runs (based on the TIMEOUT_THRESHOLD defined in scripts/params.py).
- Run scripts/generate_injection_list.py to generate a file that contains a list of errors to be injected during the injection campaign. Instructions are selected randomly from the instructions of the selected instruction group.

Step 2: Run the error injection campaign

Run scripts/run_injections.py to launch the error injection campaign. This script will run one injection run at a time in the standalone mode. If you plan to run multiple injection runs in parallel, please take special care to ensure that the output file is not clobbered. As of now, we support running multiple jobs on a multi-GPU system. Please see scripts/run_one_injection.py for more details.

Tip: Perform a few dummy injections before proceeding with full injection campaign (by setting DUMMY flag in injector/Makefile. Setting this flag will allow you to go through most of the SASSI handler code but skip the error injection. This is to ensure that you are not seeing crashes/SDCs that you should not see.

Step 3: Parse the results

Use the scripts/parse_results.py script to parse the results. This script generates three tab-separated values (tsv) files. The first file shows the fraction of executed instructions for different instruction groups and opcodes. The second file shows the outcomes of the error injections. Refer to CAT_STR in scripts/params.py for the list of error outcome categories. The third file shows the average runtime for the injection runs for different applications and selected error models. These files can be opened using a spreadsheet program (e.g., Excel) for plotting and analysis.

NVBitFI vs. SASSIFI

NVBitFI benefits from the featured offered by NVBit. It can run on newer GPUs (e.g., Turing and Volta GPUs). It works with pre-compiled libraries also, unlike SASSIFI. NVBitFI is expected to be faster than SASSIFI as it instruments just a single chosen dynamic kernel (SASSIFI, as it was implemented, instrumented all dynamic kernels) for the injection runs. As of now (April 14, 2020), NVBitFI implements a subset of the error injection models and we may be expanding this over time (users are more than welcome to contribute).

Contributing to NVBitFI

If you are interested in contributing to NVBitFI, please initialize a Pull Request and complete the Contributor License Agreement.

nvbitfi's People

Contributors

Stargazers

Watchers

nvbitfi's Issues

Error running test.sh

Hello, I'm an engineering student trying to run this benchmark on a Jetson Nano. I run Ubuntu 18.04 and Cuda 10.2, however when I run test.sh (after succesfully havig completed the previous instructions in the README), I get these errors. Do you have any ideas how to fix it ? I tried to Google it and I didn't find anything concerning the -lnvbit module ...
Thanks a lot by advance !

Step 0 (3): Build the nvbitfi injector and profiler tools
nvcc -ccbin=which gcc -D_FORCE_INLINES -arch=sm_35 -O3 inject_funcs.o injector.o -L../../../core -lnvbit -L/usr/local/cuda/lib64 -lcuda -lcudart_static -shared -o injector.so
nvlink warning : Skipping incompatible '../../../core/libnvbit.a' when searching for -lnvbit
/usr/bin/ld: skipping incompatible ../../../core/libnvbit.a when searching for -lnvbit
/usr/bin/ld: cannot find -lnvbit
collect2: error: ld returned 1 exit status
Makefile:27: recipe for target 'injector.so' failed
make: *** [injector.so] Error 1

Can't detect any instruction

I wrote a Cuda implementation for the convolution in order to do fault injection on a CNN but the injector does not detect any instruction. I use Pytorch C++ as a framework.
When I open the file stdout.txt in the directory where I store the executable, I get the following output:
inspecting forward_cuda_kernel(at::GenericPackedTensorAccessor<float, 4ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, 4ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, 1ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, 4ul, at::RestrictPtrTraits, int>, int, int, int, int, int, int) - num instrs 456
and a very long list of instructions (more than 1300) terminating with
NVBit-igprofile; ERROR FAIL in kernel execution!!
I was wondering what could be the meaning of this and how to solve it.

NOP instructions in Matrix Multiplication

Hello!

I am trying to run a series of tests to compare the reliability of different versions of the Matrix Multiplication. The kernels that I am using have a parameter that allows to change the thread block size. I performed tests with this parameter set to 32x32 and had no problems or unexpected results. However, when I tried to change that parameter to 16x16 or 8x8 I started getting these types of results:

inspecting: voidmatrixMulCUDA<8>(float*,float*,float*,int,int)
num_static_instrs: 90
maxregs: 30(30)
Injection data
index: 0
kernel_name: voidmatrixMulCUDA<8>(float*,float*,float*,int,int)
ctas: 256
instrs: 10452992
grp 0: 0 grp 1: 2097152 grp 2: 3145728 grp 3: 278528 grp 4: 1671168 grp 5: 3260416 grp 6: 8781824 grp 7: 8503296
mask: 0x0
beforeVal: 0x0;afterVal: 0x0
regNo: -1
opcode: NOP
pcOffset: 0x0
tid: -1
Error not injected

I checked the injection file in the logs and found lines like this one in all the injections that failed:
1;voidmatrixMulCUDA<8>(float*,float*,float*,int,int);0;28898422;0.947758577437;0.204871567272:0x0:NOP: -1:0x0:15.610934:19::value_before0x0:value_after0x0

As I said, these injections on NOP instructions never happened with the 32x32 thread block size, but it happens almost 80% of the time with other values.

Thank you in advance!

Expected runtime definition

Hi all,
i am trying to estimate the Expected_runtime to define the Timeout fault.
Usually the Expected_runtime measured during the normal_execution of the application is much shoter than the runtime measured by the tool when injecting faults (I guess that nvbitfi instrumentation is the cause of the delay) .

E.g: Inj_count=1, App=mEle_Sz256_Blk32, Mode=inst_value, Group=7, EM=0, Time=83.747101, Outcome: Masked: other reasons

So using the normal_execution time in the list of apps of params.py produces a lot of Timeouts. The other option is to use the maximum Time obtained in a DUMMY campaign. However in that case there are no Timeouts in the results.
What is the right way to estimate the Expected runtime?
Thank you in advance

Total instruction count = 0 while analyzing TensorRT binary

NVBitFI not able to find instructions

Setup

I followed this guide on how to build an image recognition program in C++ using TensorRT (included in the imageNet library in the code). I can correctly compile and run the program.
The apps dictionary in scripts/params.py has been modified like this
apps = { 'recognition': [ NVBITFI_HOME + '/test-apps/recognition', # workload directory 'recognition', # binary name NVBITFI_HOME + '/test-apps/recognition/', # path to the binary file 1, # expected runtime "" # additional parameters to the run.sh ], }

Inside the test-apps folder I created a new folder called recognition that contains the binary called recognition, a Makefile which has a target called golden built like this golden: ./recognition $(ARGS) >golden_stdout.txt 2>golden_stderr.txt, a file called run.sh with the following content #!/bin/bash eval ${PRELOAD_FLAG} ${BIN_DIR}/recognition polar_bear.jpg > stdout.txt 2> stderr.txt and a file called sdc_check.sh which I simply copied from the official repository.

Lastly, I modified the test.sh script in order to execute the right binary at the beginning with the following code
printf "\nStep 0 (4): Run and collect output without instrumentation\n" cd test-apps/recognition/ make golden ARGS=polar_bear.jpg cd $CWD

Problem

During step 1 (2) of the execution script available I encounter the following error
reating list for recognition ... Something is not right. Total instruction count = 0
It seems like the tool is not able to find any instruction in the binary and the execution stops.

Some small version fixes for nvbitfi

Hi,
When I was trying to run the first experiment test.sh in README.md, I found there're several errors caused by the version:

In the 1.5 version of NVBit, it changed *_pred functions/variables to *_guard_pred, see its log. So I think you may want to change the function nvbit_add_call_arg_pred_val(i) to nvbit_add_call_arg_guard_pred_val(i) in both injector.cu and profiler.cu.
I noticed that you may use python 2.0, so I received errors like TypeError: can only concatenate list (not "map") to list. I wondered if you could surround map() with list() when you are doing join function in parse_results.py.

Sorry for these small suggestions, I think it's OK if you just keep the current version since the developers may find themselves. Thanks!

Error not injected when threads/block different to 1024

Hi all!
similar to a previous issue I am having problems when injenct faults in a very simple kernel of matrix mult and the number of threads/block is different to 32x32 (1024). Any other value (e.g.: 16x16) produces some "Error not injected" results.

kernelName=matrixMulCUDA(float*,float*,float*,int,int,int)
kernelCount=0
groupID=7
bitFlipModel=0
instID=21059427
opIDSeed=0.401131
bitIDSeed=0.326217
inspecting: matrixMulCUDA(float*,float*,float*,int,int,int)
num_static_instrs: 282
maxregs: 32(32)
Injection data
index: 0
kernel_name: matrixMulCUDA(float*,float*,float*,int,int,int)
ctas: 64
instrs: 14057472
grp 0: 0 grp 1: 2097152 grp 2: 4194304 grp 3: 262144 grp 4: 1130496 grp 5: 6373376 grp 6: 12926976 grp 7: 12664832
mask: 0x0
beforeVal: 0x0;afterVal: 0x0
regNo: -1
opcode: NOP
pcOffset: 0x0
tid: -1
Error not injected

All the versions compile and pass the test (works correctly). I also tried to inject faults with DUMMY flag with the same results (some dummy injections work others don´t). In all the cases I ´ve rerun the profiler to be sure all is ok.
I´ve activated the VERBOSE_TOOLS flags but the info is difficult to interpret since some numbers haven´t got any identifier.
I´ve checked with Jetson nano and TX2 boards with the same result and using different matrix sizes.

The kernel is very simple:
global void matrixMulCUDA(float *C, float *A, float B, int ldA, int ldB, int ldC) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
float ptrA = &A[ildA]; // Pointer to the first element of row i of A
float tmp = 0.0f;
for (int k = 0; k < ldA; k++) {
tmp += (ptrA++) * B[kldB+j];
}
C[ildC+j] = tmp;
}

main() {
....
int block_size = 16;
dim3 dimsA(128, 128, 1); dim3 dimsB(128, 128, 1); dim3 dimsC(128, 128, 1);
dim3 threads(block_size, block_size);
dim3 grid(dimsC.x / threads.x, dimsC.y / threads.y);
....
matrixMulCUDA<<< grid, threads >>>(d_C, d_A, d_B, dimsA.x, dimsB.x, dimsC.x);
....
}

Please could you give me some hints for debugging the problem?

Thank you in advance.

High execution time for Pytorch models

Hello

I`m trying to perform fault injection on a Pytorch model (fasterrcnn_resnet50_fpn). However, the execution time for a single fault injection is taking too much, 155s for a single inference.

Is this expected? Is there a way to speed up the fault injection for Pytorch?

Thanks for the help

question on RTX Titan, thanks

Hello,

As you said, the tool should support turning. However, when I executed it in Titan RTX, it shows error "does not support Titan RTX". I wondered whether you can provide us with any suggestions. Thanks

Error while compiling injector

Working on a Jetson TX2 with the following nvcc version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_28_22:34:44_PST_2021 Cuda compilation tools, release 10.2, V10.2.300 Build cuda_10.2_r440.TC440_70.29663091_0

I tried executing test.sh on the latest commit 2f558ff89d5ff83025d19a85ae4405ad34bf1974 but it failed while compiling injector.cu with the following message

injector.cu(301): error: identifier "nvbit_add_call_arg_pred_val" is undefined 1 error detected in the compilation of "/tmp/tmpxft_00000271_00000000-6_injector.cpp1.ii". Makefile:30: recipe for target 'injector.o' failed make: *** [injector.o] Error 1

I tried reverting commits and I found the same error on 111519cc4be4f0e0365e141681fa410bd24c9528 while it compiles correctly at 4726929de5ae023bf9853327bac5a8254e027e52. Is it a problem of my configuration or on the code itself?

Thanks for the help

how to modify a specific register location in the register file regardless the executed thread or block?

dear authors,

I am trying to inject faults (permanent) into registers of an SM in the GPU. However, I could not find a way to guarantee the injected fault in the same hardware location of the SM regardless of the executed thread-block. Is there a way to access (read/write) any location of the register file in the GPU using the available functions of NVBIT or nvbitfi without the dependence of the executed application?

I appreciate any suggestion or help about it.

Juan David

jetson-nano Kernel failure

Hi there!
I am using NVBitFI on the jeston nano, and running the example at times I get a "Outcome: Pot DUE: SDC but Kernel Error". Looking at the stdout.txt, it says "ERROR FAIL in kernel execution (unspecified launch failure);" and I found weird that the Device is identified as Tegra TX1 "Device 0 (NVIDIA Tegra X1) is being used".
I have tried to change the architecture to sm_53, but the problem persists.
Is this an error due to the injection? I am not sure about it. It is not a DUE yet not an SDC. How should we consider it?
Can this be something to do with the fact that the GPU is used also for running X in linux? I am using it through ssh, though.

Thanks!

Will there be updates to NVBitFI with new NVBit releases?

A new release of NVBit (1.7) was just released. Will NVBitFI support these updates?

Error when try to inject: "Something is not right. Total instruction count = 0" in matrixMul

Hello everyone!
I'm using NVBitFI trying to inject SINGLE_BIT_FLIP to G_GP using some Samples application that come with CUDA toolkit , but in some application like matrixMul, the tool doesn't profile the application. The command run.sh create an empy nvbitfi-igprofile.txt.
The application stop itself when try to generate injection list due to the emptiness of nvbitfi-igprofile.txt, getting out with the above message.
Does anyone know how to solve this problem?

Thanks!!

Injection list kernel instruction value scales linearly with injection time

I've come to notice that when injecting faults there seems to be some correlation to the value of the kernel instruction and injection run latency.

For example, if the injection list is listed per injection site as:
kernel name/parameters, kernel index, kernel instruction, op seed, bit seed.

The injection between the following two injection is about 2x on the first one compare to the bottom one.
matmul_kernel 2 500000 0.234123 0.843278
matmul_kernel 2 250000 0.234123 0.843278

I'm curious if there is a reason this happens? I've dug through a lot of the code and can't figure out why. It seems odd that the number makes a difference when I assume the binary is only instrumented in one location in both cases.

Error whilst using test.sh

Hello everyone,
I am a computer science graduating, and I am trying to learn a bit more about the mechanism of the fault injection through the nvbitfi tool.
After the set-up of the tools (nvbit and nvbitfi) i tried to run the pre-loaded test "test.sh" to see if everything would work.
Unfortunately this was not the case, at least for me, in fact during the Step 1 (1): Profile the application, the script "run.sh" in the nvbitfi/test-apps/simple_add wouldn't create the log file needed to proceed. I proceeded to check the script and i edited in the way that would create the log file.txt needed.
After that, during the step Step 1 (2): Generate injection list for instruction-level error injections, the only output that returns is just "Something is not right. Total instruction count = 0".
From this point on, i couldn't manage to understand the problem, which still now persists.
I ran the tools, on:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"

with a kernel version: Linux 5.8.0-55-generic x86_64
on a acer laptop, with nvidia 920m graphic card and Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz.
I made sure to meet all the requirements for the correct use of the tool.

I wonder if someone could help me to fix the problem, so i can continue to learn and study this tool.

Thank you in advance.
Best regards

How to run hello_cuda ELF when using nvbitfi ?

Hi dear developer,
It's an honor to write a letter here, I have read README, but can't understand how to run another ELF like hello_cuda, I have followed the steps and build the test simple_add.
What can I do and where shall I change to test the hello_cuda ?

thank you
best regards
William

When will the tool support tensor core

I‘ m interested in the reliability of tensor core， while this tool does not seem to support fault injection on the tensor core now.

Error 139 and multi generation devices

Hi,

Today I tried to install nvbitfi, but I have some issues.

here is the output of runing test.sh

Step 0 (2): Setting environment variables

Step 0 (3): Build the nvbitfi injector and profiler tools
nvcc -ccbin=which gcc -D_FORCE_INLINES -I../../../core -I../common -maxrregcount=16 -Xptxas -astoolspatch --keep-device-functions -arch=sm_35 -DDUMMY=0 -Xcompiler -Wall -Xcompiler -fPIC -c inject_funcs.cu -o inject_funcs.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc -ccbin=which gcc -D_FORCE_INLINES -dc -c -std=c++11 -I../../../core -I../common -Xptxas -cloning=no -Xcompiler -Wall -arch=sm_35 -O3 -Xcompiler -fPIC injector.cu -o injector.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc -ccbin=which gcc -D_FORCE_INLINES -arch=sm_35 -O3 inject_funcs.o injector.o -L../../../core -lnvbit -L/usr/local/cuda-11.4/lib64 -lcuda -lcudart_static -shared -o injector.so
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc -ccbin=which gcc -D_FORCE_INLINES -I../../../core -I../common -maxrregcount=16 -Xptxas -astoolspatch --keep-device-functions -arch=sm_35 -Xcompiler -Wall -Xcompiler -fPIC -c inject_funcs.cu -o inject_funcs.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc -ccbin=which gcc -D_FORCE_INLINES -dc -c -std=c++11 -I../../../core -I../common -Xptxas -cloning=no -Xcompiler -Wall -arch=sm_35 -O3 -Xcompiler -fPIC profiler.cu -o profiler.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc -ccbin=which gcc -D_FORCE_INLINES -arch=sm_35 -O3 inject_funcs.o profiler.o -L../../../core -lnvbit -L /usr/local/cuda-11.4/lib64 -lcuda -lcudart_static -shared -o profiler.so
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

Step 0 (4): Run and collect output without instrumentation
rm -f *.o *~ simple_add
which nvcc -o simple_add -Xptxas -v -arch=sm_35 simple_add.cu
./simple_add >golden_stdout.txt 2>golden_stderr.txt
make: *** [Makefile:16: golden] Error 139

I do not understand what has been hepened. I have 2 GPUs that installed but I want to use nvbitfi for the second one but it seems that nvbitfi try to use the first one.

First GPU is Ampere and second one is Volta. I think Ampere is not supported.

I am seeing some -arch that non of them is not related to Volta. Should I do something here to change the arch?
I want to use this library to work on the bitflif of data (e.g. an error in input matrix ). Is nvbitfi useful?