Code Monkey home page Code Monkey logo

pytorch-direct_dgl's Introduction

PyTorch-Direct

Introduction

PyTorch-Direct adds a zero-copy access capability for GPU on top of the existing PyTorch DNN framework. Allowing the zero-copy access capabily for GPU significantly increases the data transfer efficiency over PCIe when the targeted data is scattered in the host memory. This is especially useful when the input data cannot be fit into the GPU memory ahead of the training time, and data pieces need to be transferred during the training time. With PyTorch-Direct, using the zero-copy access capability can be done by declaring a "Unified Tensor" on top of the existing CPU tensor. The current implementation of PyTorch-Direct is based on the nightly version of PyTorch-1.8.0.

The UnifiedTensor was once introduced in dgl at https://github.com/dmlc/dgl/commit/905c0aa578bca6f51ac2ff453c17e579d5a1b0fb. But after that, it was substituted by the combination of pin_memory_inplace and gather_pinned_tensor_rows functions under dgl.utils. See dgl/pin_memory.py for reference.

Installation

Env

Python >= 3.8 DGL >= 0.6.1

Pytorch

Since we modify the source code of PyTorch, our implementation cannot be installed through well-known tools like pip. To compile and install the modified version of our code, please follow this.

DGL Installation

We do not modify the source of DGL, so the users can either install DGL using pip or by compiling from the source code.

We support dgl 0.6.1, 0.7.1.

We can install dgl easily by

pip install https://data.dgl.ai/wheels/dgl_cu113-0.7.1-cp38-cp38-manylinux1_x86_64.whl

refer to https://data.dgl.ai/wheels/repo.html for your environment version

We can also build from source. Firstly, we need to update the submodule.

git submodule update --init --recursive
cd dgl/
sudo apt-get update
sudo apt-get install -y build-essential python3-dev make cmake

mkdir build
cd build
cmake -DUSE_CUDA=ON ..
make -j4

Note that pip will automatically match the latest scipy, which needs Python version >= 3.9. If using python 3.8, we need to install lower version of scipy. For example,pip install scipy==1.7.0

After that, we install the dgl

cd ../python
python setup.py install

please follow https://docs.dgl.ai/en/0.6.x/install/index.html

Use case

In the original PyTorch, the scattered data in the host can be accessed by the GPU like the following example:

# input_tensor: A given input 2D tensor in CPU
# index: A tensor which has indices of targets
# output_tensor: An output tensor which should be located in GPU

output_tensor = input_tensor[index].to(device="cuda:0")

Now in PyTorch-Direct, the code can be transformed into as follows:

# input_tensor: A given input 2D tensor in CPU
# index: A tensor which has indices of targets
# output_tensor: An output tensor which should be located in GPU

torch.cuda.set_device("cuda:0")
unified_tensor = input_tensor.to(device="unified")

output_tensor = unified_tensor[index]

The unified tensor does not physically copy any data, but only creates a mapping for the GPU. Therefore, in current implementation, if the original CPU tensor disappears, the unified tensor which created later cannot be accessed.

For such reason, the following coding practice should be avoided for now:

output_tensor = torch.randn([100,100], device="cpu").to(device="unified")

A temporary tensor created by the randn function will disappear as it is not assigned to any. The unified tensor created by the following code has no physical data therefore. The code should be re-written as follows:

temp_tensor = torch.randn([100,100], device="cpu")
output_tensor = temp_tensor.to(device="unified")

In this case the temporary tensor is fixed to temp_tensor declaration so the unified tensor can be safely called on it.

GNN Example

Basics

For a more practical example, we perform GNN training with the well known Deep Graph Library (DGL). The example code is located in the dgl submodule of this repository. The exact location is <current_path>/dgl/examples/pytorch/graphsage/train_sampling_pytorch_direct.py. To compare with the original PyTorch approach, the users can use the unmodified DGL implementation in <current_path>/dgl/examples/pytorch/graphsage/train_sampling.py. By default, the DGL example always try to load the whole data into the GPU memory. Therefore, to compare the host memory access performance, the user needs to add --data-cpu argument to the DGL example.

Using Multi-Processing Service (MPS)

To further increase the efficiency of PyTorch-Direct in GNN training, CUDA Multi-Processing Service (MPS) can be used. The purpose of MPS is to allocate a small amount of GPU resource for the zero-copy accesses while leaving the rest for the training process. The MPS can be used in our example GNN code by passing --mps x,y argument. Here, x is the GPU portion given for the zero-copy kernel and y is the GPU portion given for the trainig process. For the NVIDIA RTX 3090 GPU we used, we used --mps 10,90 setting. Using MPS requires running an external utility called nvidia-cuda-mps-control. This utiliy should be available as far as CUDA is installed. Running nvidia-cuda-mps-control does not require a root permission as the restriction is only applied to the users who are using it. In <current_path>/dgl/examples/pytorch/graphsage/utils.py file, we added some scripts which deal with running MPS. The functions declared in this file are used inside <current_path>/dgl/examples/pytorch/graphsage/train_sampling_pytorch_direct.py.

Quick Evaluation

Reddit
In this chart, we show a GraphSAGE training result for the reddit dataset. Since the reddit dataset is small and can be located either in the host memory or the GPU memory, we tested both cases. For the evaluation, we used AMD Threadripper 3960x CPU and NVIDIA RTX 3090 GPU. As we can observe, with a faster interconnect, the benefit of PyTorch-Direct is greater and it can nearly reach the all-in-GPU memory case.

Citation

@article{min2021large,
  title={Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture},
  author={Min, Seung Won and Wu, Kun and Huang, Sitao and Hidayeto{\u{g}}lu, Mert and Xiong, Jinjun and Ebrahimi, Eiman and Chen, Deming and Hwu, Wen-mei},
  journal={arXiv preprint arXiv:2103.03330},
  year={2021}
}

pytorch-direct_dgl's People

Contributors

davidmin7 avatar k-wu avatar lukelin-web avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pytorch-direct_dgl's Issues

The role of zero-copy

Dear Kun Wu,

I hope this message finds you well.

I have a few questions regarding your paper, "Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture." When the GPU starts training the GNN, is all the data already copied onto the GPU? Is the use of zero-copy technology intended to disperse data reading from the CPU to the GPU to avoid the traditional DMA's need to perform gather operations? In other words, is zero-copy used to manage the transfer of data to the GPU more efficiently, rather than having the GNN operations directly read from the CPU DRAM during training?

Thank you for your time and clarification.

Best regards,
Changyuan

Generating wikipedia and Amazon datasets in DGL form.

Hello @K-Wu ,

I downloaded the wikipedia dataset from konect.cc and Amazon dataset from the following website:
https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews

I am having a hard time preparing the labels and node-embeddings files required as the input format, as one or the other is missing in the above links and it needs to be created manually.

Could you provide me links to the prepared datasets or a methodology that was used to generate all the above in your experiments?

Thanks in advance.

multi gpu

Can we use this code with multi gpus? if so, give some examples in readme? thx~

say, there are 1 billion nodes and 60 billion edges,
so the matrix will be 500G while A100 has 80G memory.

Hello, why is it displayed that unified cannot be recognized

Using backend: pytorch
Process SpawnProcess-1:Traceback (most recent call last):File "/home/csarch/anaconda3/lib/python3.8/multiprocessing/process.py", line 315,in bootstrapself.run()
File "/home/csarch/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in runself. target(*self. args,**self. kwargs)File "/home/csarch/pytorch-direct/dgl/examples/pytorch/graphsage/train _sampling_pytorch direct.py", line 124, in producertrain nfeat = train nfeat.to(device="unified")RuntimeError: Expected one of cpu, cuda, xpu, mkldnn, opengl, opencl, ideep, hipmsnpu, mlc, xla, vulkan, meta, hpu device type at start of device string: unified

error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}

I'm trying to build Pytorch from source, but during installation facing the below issue:

[1/336] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_registry.cc.o                                                                                                    
FAILED: caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_registry.cc.o                                                                                                                        
/usr/bin/c++ -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMAGMA_V2 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAM
ESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dcaffe2_pybind11_state_EXPORTS -I/mnt/utkrisht/pytorch-direct/build/aten/src -I/mnt/utkrisht/pytorch-direct/aten/src -I/mnt/utkr
isht/pytorch-direct/build -I/mnt/utkrisht/pytorch-direct -I/mnt/utkrisht/pytorch-direct/cmake/../third_party/benchmark/include -I/mnt/utkrisht/pytorch-direct/build/caffe2/contrib/aten -I/mnt/utkrisht/pytor
ch-direct/third_party/onnx -I/mnt/utkrisht/pytorch-direct/build/third_party/onnx -I/mnt/utkrisht/pytorch-direct/third_party/foxi -I/mnt/utkrisht/pytorch-direct/build/third_party/foxi -I/mnt/utkrisht/pytorc
h-direct/build/caffe2/aten/src/TH -I/mnt/utkrisht/pytorch-direct/aten/src/TH -I/mnt/utkrisht/pytorch-direct/build/caffe2/aten/src -I/mnt/utkrisht/pytorch-direct/aten/../third_party/catch/single_include -I/
mnt/utkrisht/pytorch-direct/aten/src/ATen/.. -I/mnt/utkrisht/pytorch-direct/build/caffe2/aten/src/ATen -I/mnt/utkrisht/pytorch-direct/third_party/miniz-2.0.8 -I/mnt/utkrisht/pytorch-direct/caffe2/core/nomn
igraph/include -I/mnt/utkrisht/pytorch-direct/torch/csrc/api -I/mnt/utkrisht/pytorch-direct/torch/csrc/api/include -I/mnt/utkrisht/pytorch-direct/c10/.. -I/mnt/utkrisht/pytorch-direct/build/third_party/ide
ep/mkl-dnn/include -I/mnt/utkrisht/pytorch-direct/third_party/ideep/mkl-dnn/src/../include -I/mnt/utkrisht/pytorch-direct/c10/cuda/../.. -isystem /mnt/utkrisht/pytorch-direct/build/third_party/gloo -isyste
m /mnt/utkrisht/pytorch-direct/cmake/../third_party/gloo -isystem /mnt/utkrisht/pytorch-direct/cmake/../third_party/googletest/googlemock/include -isystem /mnt/utkrisht/pytorch-direct/cmake/../third_party/
googletest/googletest/include -isystem /mnt/utkrisht/pytorch-direct/third_party/protobuf/src -isystem /mnt/utkrisht/miniconda3/envs/pytorch-direct/include -isystem /mnt/utkrisht/pytorch-direct/third_party/
gemmlowp -isystem /mnt/utkrisht/pytorch-direct/third_party/neon2sse -isystem /mnt/utkrisht/pytorch-direct/third_party/XNNPACK/include -isystem /mnt/utkrisht/pytorch-direct/third_party -isystem /mnt/utkrish
t/pytorch-direct/cmake/../third_party/eigen -isystem /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11 -isystem /mnt/utkrisht/miniconda3/envs/pytorch-direct/lib/python3.11/site-packages/numpy
/core/include -isystem /mnt/utkrisht/pytorch-direct/cmake/../third_party/pybind11/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -isys
tem /usr/local/cuda-11.7/include -isystem /mnt/utkrisht/pytorch-direct/third_party/ideep/mkl-dnn/include -isystem /mnt/utkrisht/pytorch-direct/third_party/ideep/include -Wno-deprecated -fvisibility-inlines
-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initial
izers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-ove
rflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned
-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFI
NITION -O3 -DNDEBUG -DNDEBUG -fPIC -fvisibility=hidden -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -pthread -MD -MT caffe2/CMakeFiles/caffe2_pybind11_stat
e.dir/python/pybind_state_registry.cc.o -MF caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_registry.cc.o.d -o caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_registry.cc.o 
-c /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc     
In file included from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h: In function ‘std::string pybind11::detail::error_string()’:                                                                       
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:446:36: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}                                                
  446 |                 "  " + handle(frame->f_code->co_filename).cast<std::string>() +                                                                                                                      
      |                                    ^~                                                                                                                                                                
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:42,                                                                                                           
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,                                                                                                
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,                                                                                                       
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}                                                       
   22 | typedef struct _frame PyFrameObject;                                                                                                                                                                 
      |                ^~~~~~                                                                                                                                                                                
In file included from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:446:75: error: expected primary-expression before ‘>’ token                                                                        
  446 |                 "  " + handle(frame->f_code->co_filename).cast<std::string>() +                                                                                                                      
      |                                                                           ^                                                                                                                          
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:446:77: error: expected primary-expression before ‘)’ token                                                                        
  446 |                 "  " + handle(frame->f_code->co_filename).cast<std::string>() +                                                                                                                      
      |                                                                             ^                                                                                                                        
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:448:29: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}                                                
  448 |                 handle(frame->f_code->co_name).cast<std::string>() + "\n";                                                                                                                           
      |                             ^~                                                                                                                                                                       
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:42,                                                                                                           
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,                                                                                                
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,                                                                                                       
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}                                                       
   22 | typedef struct _frame PyFrameObject;             
In file included from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                 [78/1188]
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:448:64: error: expected primary-expression before ‘>’ token                                                                        
  448 |                 handle(frame->f_code->co_name).cast<std::string>() + "\n";                                                                                                                           
      |                                                                ^                                                                                                                                     
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:448:66: error: expected primary-expression before ‘)’ token                                                                        
  448 |                 handle(frame->f_code->co_name).cast<std::string>() + "\n";                                                                                                                           
      |                                                                  ^                                                                                                                                   
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:449:26: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}                                                
  449 |             frame = frame->f_back;                                                                                                                                                                   
      |                          ^~                                                                                                                                                                          
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:42,                                                                                                           
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,                                                                                                
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,                                                                                                       
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}                                                       
   22 | typedef struct _frame PyFrameObject;                                                                                                                                                                 
      |                ^~~~~~                                                                                                                                                                                
In file included from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h: In function ‘pybind11::function pybind11::detail::get_type_override(const void*, const pybind11::detail::type_info*, const cha
r*)’:                                                                                                                                                                                                        
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:2232:49: error: ‘PyThreadState’ {aka ‘struct _ts’} has no member named ‘frame’; did you mean ‘cframe’?                         
 2232 |     PyFrameObject *frame = PyThreadState_Get()->frame;                                                                                                                                               
      |                                                 ^~~~~                                                                                                                                                
      |                                                 cframe                                                                                                                                               
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:2233:41: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}                                           
 2233 |     if (frame && (std::string) str(frame->f_code->co_name) == name &&                                                                                                                                
      |                                         ^~                                                                                                                                                           
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:42,                                                                                                           
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,                                                                                                
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,                                                                                                       
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}                                                       
   22 | typedef struct _frame PyFrameObject;                                                                                                                                                                 
      |                ^~~~~~                                                                                                                                                                                
In file included from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:2234:14: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}                                           
 2234 |         frame->f_code->co_argcount > 0) {      
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:42,                                                                                                           
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,                                                                                                
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,                                                                                                       
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}                                                       
   22 | typedef struct _frame PyFrameObject;                                                                                                                                                                 
      |                ^~~~~~                                                                                                                                                                                
In file included from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:2237:18: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}                                           
 2237 |             frame->f_locals, PyTuple_GET_ITEM(frame->f_code->co_varnames, 0));                                                                                                                       
      |                  ^~                                                                                                                                                                                  
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:42,                                                                                                           
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,                                                                                                
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,                                                                                                       
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,                                                                                                          
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,                                                                                                      
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,                                                                                                                  
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:                                                                                                                 
/mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}                                                       
   22 | typedef struct _frame PyFrameObject;                                                                                                                                                                 
      |                ^~~~~~                                                                                                                                                                                
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:38,                                                                                                           
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:
/mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:2237:30: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}
 2237 |             frame->f_locals, PyTuple_GET_ITEM(frame->f_code->co_varnames, 0));
      |                              ^~~~~~~~~~~~~~~~
In file included from /mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/Python.h:42,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/detail/common.h:122,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pytypes.h:12,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/cast.h:13,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/attr.h:13,
                 from /mnt/utkrisht/pytorch-direct/third_party/pybind11/include/pybind11/pybind11.h:45,
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.h:3,
                 from /mnt/utkrisht/pytorch-direct/caffe2/python/pybind_state_registry.cc:1:
/mnt/utkrisht/miniconda3/envs/pytorch-direct/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}
   22 | typedef struct _frame PyFrameObject;
      |                ^~~~~~

  • cython is latest version within my conda environment
$cython -V
Cython version 0.29.35

$ which cython
/mnt/utkrisht/miniconda3/envs/pytorch-direct/bin/cython

  • pybind11 is installed
pybind11                2.10.4

  • Have tried installing with Python 3.8 but still facing the same issue.

(Im trying to install a specific commit as I am setting up PyTorch-direct a functionality available only in that specific commit of PyTorch)
Can someone please suggest what can be done to circumvent this issue?

Process deadlocked with error 'CUDA driver initialization failed, you might not have a CUDA gpu.'


$ python train_sampling_pytorch_direct.py

/mnt/utkrisht/miniconda3/envs/zc/lib/python3.10/site-packages/dgl-0.9-py3.10-linux-x86_64.egg/dgl/base.py:45: DGLWarning: DEPRECATED: DGLGraph and DGLHeteroGraph have been merged in v0.5.
        dgl.as_heterograph will do nothing and can be removed safely in all cases.
  return warnings.warn(message, category=category, stacklevel=1)

Process Process-2:
Traceback (most recent call last):
  File "/mnt/utkrisht/miniconda3/envs/zc/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/mnt/utkrisht/miniconda3/envs/zc/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/utkrisht/pytorch-direct_dgl/dgl/examples/pytorch/graphsage/utils.py", line 38, in decorated_function
    raise exception.__class__(trace)
RuntimeError: Traceback (most recent call last):
  File "/mnt/utkrisht/pytorch-direct_dgl/dgl/examples/pytorch/graphsage/utils.py", line 26, in _queue_result
    res = func(*args, **kwargs)
  File "/mnt/utkrisht/pytorch-direct_dgl/dgl/examples/pytorch/graphsage/train_sampling_pytorch_direct.py", line 194, in run
    th.cuda.set_device(device)
  File "/mnt/utkrisht/miniconda3/envs/zc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 257, in set_device
    torch._C._cuda_setDevice(device)
  File "/mnt/utkrisht/miniconda3/envs/zc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 166, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

  • While running the example scripts the program is stuck indefinitely with the above error.
  • I went through the following github link (pytorch/pytorch#17199), but i see thread_wrap_func() has been implemented to circumvent deadlocking issue.
  • Can someone please look into this?

Environment details:

TESLA V100 (16GB)
Python 3.10.11
dgl-0.9-py3.10

Does the input tensor have to be from CPU?

Hi,
Thank you for the interesting work.

In your readme examples you mentioned this:
input_tensor: A given input 2D tensor in CPU

But in the paper, you have also mentioned very large GNN models can give OOM. In this case what if we store the node features on NVMe? Do you have a simple example then? Does the library avoid CPU and go to GPU in that case, like the GPUDirect or DALI? Please let me know.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.