rocketmlhq / rmldnn Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 9.0 18.17 MB

RocketML Deep Neural Networks

Home Page: https://rocketmlhq.github.io/rmldnn/

License: Other

Jupyter Notebook 95.25% Python 4.75%

deep-learning distributed-deep-learning high-performance-computing machine-learning scientific-machine-learning

rmldnn's People

Contributors

Stargazers

Watchers

Forkers

vrstartup chuanzhidong ksoumya sk-kadam rhythmbindal siwtom anuragkulkarni thiru1814 areyesan

rmldnn's Issues

Add a brief tutorial on transfer learning

I'm trying to use rmldnn to classify images from a Kaggle dataset of birds, but I'm having trouble doing so. We normally utilise transfer learning for datasets this huge, and in Tensorflow it's very simple: we just create a base model with weights, and then onto the last layer, we add layers as necessary to get the desired amount of classes as output.
But when I added a pre-trained model (say VGG16 from the resources) and adjusted the input and output layers to match the dataset, it didn't provide any results or return any errors.
So, if a tutorial on transfer learning could be added, that would be extremely beneficial.
Kaggle dataset on which I was working: https://www.kaggle.com/datasets/gpiosenka/100-bird-species

ERROR: CUDA error: no kernel image is available for execution on the device

Hello - I'm trying to run rmldnn on the following system. See command & error below.

NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
$ nvidia-smi
Fri Apr 22 10:12:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02    Driver Version: 510.60.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:3B:00.0 Off |                    0 |
|  0%   38C    P0    73W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          Off  | 00000000:5E:00.0 Off |                    0 |
|  0%   38C    P0    77W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          Off  | 00000000:AF:00.0 Off |                    0 |
|  0%   36C    P0    77W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          Off  | 00000000:D8:00.0 Off |                    0 |
|  0%   36C    P0    74W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Command:  singularity exec --nv ../software/rmldnn.sif mpirun -np 2 -x CUDA_VISIBLE_DEVICES=0,1 rmldnn --config= ./config_inpaint_feature_extraction.json

Error: [2022-Apr-21 22:33:00.365008] *** CUDA available! Will train on GPU ***
[2022-Apr-21 22:33:00.365016] ---------------------------------------------
[2022-Apr-21 22:33:00.477420] ERROR: CUDA error: no kernel image is available for execution on the device
Exception raised from launch_vectorized_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:119 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f1eaff9a569 in /usr/local/libtorch/1.7.0/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::fill_kernel_cuda, 4u>, float (), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::fill_kernel_cuda, 4u>, float (), float> const&) + 0x862 (0x7f1e6127bf82 in /usr/local/libtorch/1.7.0/lib/libtorch_cuda.so)

Issue during in-painting classification

Hello - I'm working through the tutorial for in-painting and I've successfully completed the feature extraction training.

However, even though the classification training finishes successfully, I see that the loss balloons to a large number. Do you have any sense of what this could mean?

Here is the output from training the classifier:

$ cat rmldnnClass.out | grep Accuracy
[2022-Apr-25 15:28:10.260352] Eval  Epoch [  1/100] : Batch [    1/    1] (Loss: 11.9174) | Accuracy: 0.618
[2022-Apr-25 15:28:26.116163] Eval  Epoch [  2/100] : Batch [    1/    1] (Loss: 71.4548) | Accuracy: 0.634
[2022-Apr-25 15:28:43.123124] Eval  Epoch [  3/100] : Batch [    1/    1] (Loss: 436.4486) | Accuracy: 0.605
[2022-Apr-25 15:28:58.762846] Eval  Epoch [  4/100] : Batch [    1/    1] (Loss: 4738.9883) | Accuracy: 0.601
[2022-Apr-25 15:29:14.939691] Eval  Epoch [  5/100] : Batch [    1/    1] (Loss: 22715.8613) | Accuracy: 0.622
[2022-Apr-25 15:29:30.472585] Eval  Epoch [  6/100] : Batch [    1/    1] (Loss: 117077.8047) | Accuracy: 0.626
[2022-Apr-25 15:29:45.980462] Eval  Epoch [  7/100] : Batch [    1/    1] (Loss: 1560777.3750) | Accuracy: 0.599
[2022-Apr-25 15:30:01.238265] Eval  Epoch [  8/100] : Batch [    1/    1] (Loss: 10272561.0000) | Accuracy: 0.628
[2022-Apr-25 15:30:16.822373] Eval  Epoch [  9/100] : Batch [    1/    1] (Loss: 106860592.0000) | Accuracy: 0.537
[2022-Apr-25 15:30:32.327156] Eval  Epoch [ 10/100] : Batch [    1/    1] (Loss: 492095744.0000) | Accuracy: 0.598
[2022-Apr-25 15:30:38.241459] Eval  Epoch [ 11/100] : Batch [    1/    1] (Loss: 1098736640.0000) | Accuracy: 0.593
[2022-Apr-25 15:30:42.281468] Eval  Epoch [ 12/100] : Batch [    1/    1] (Loss: 1204668032.0000) | Accuracy: 0.619
[2022-Apr-25 15:30:46.724595] Eval  Epoch [ 13/100] : Batch [    1/    1] (Loss: 1348660480.0000) | Accuracy: 0.639
[2022-Apr-25 15:30:51.073770] Eval  Epoch [ 14/100] : Batch [    1/    1] (Loss: 1276815488.0000) | Accuracy: 0.639
[2022-Apr-25 15:30:55.363943] Eval  Epoch [ 15/100] : Batch [    1/    1] (Loss: 1593674496.0000) | Accuracy: 0.639

Information on training:

[2022-Apr-25 15:27:20.244494] RocketML : dnn
[2022-Apr-25 15:27:20.244561] rocketml 1.0.0 (Linux-5.3.0-1031-azure ) (Apr 14 2022 22:57:24) (git:master rev:b822b0c)
[2022-Apr-25 15:27:20.244570] RocketML : 4 MPI processes
[2022-Apr-25 15:27:20.244591]                     ___        __
[2022-Apr-25 15:27:20.244599]                    /\_ \      /\ \
[2022-Apr-25 15:27:20.244605]  _ __    ___ ___   \//\ \     \_\ \     ___      ___
[2022-Apr-25 15:27:20.244611] /\`'__\ /' __` __`\  \ \ \    /'__ \  /' _ `\  /' _ `\
[2022-Apr-25 15:27:20.244617] \ \ \/  /\ \/\ \/\ \  \_\ \_ /\ \_\ \ /\ \/\ \ /\ \/\ \
[2022-Apr-25 15:27:20.244623]  \ \_\  \ \_\ \_\ \_\ /\____\\ \___,_\\ \_\ \_\\ \_\ \_\
[2022-Apr-25 15:27:20.244628]   \/_/   \/_/\/_/\/_/ \/____/ \/__,_ / \/_/\/_/ \/_/\/_/
[2022-Apr-25 15:27:20.244634]
[2022-Apr-25 15:27:20.244639]             (C) 2022 RocketML, Inc. All rights reserved.
[2022-Apr-25 15:27:20.244645]
[2022-Apr-25 15:27:20.244650] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244656] |                   RocketML Deep Neural Networks                   |
[2022-Apr-25 15:27:20.244661] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244667] | More info: https://github.com/rocketmlhq/rmldnn                   |
[2022-Apr-25 15:27:20.244673] | License  : https://github.com/rocketmlhq/rmldnn/blob/main/LICENSE |
[2022-Apr-25 15:27:20.244678] | Contact  : [email protected]                                   |
[2022-Apr-25 15:27:20.244684] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244689]
[2022-Apr-25 15:27:20.410407] ----------------- Device(s) -----------------
[2022-Apr-25 15:27:20.410433]    CUDA:0
[2022-Apr-25 15:27:20.410441]    CUDA:1
[2022-Apr-25 15:27:20.410447]    CUDA:2
[2022-Apr-25 15:27:20.410452]    CUDA:3
[2022-Apr-25 15:27:20.410458] ---------------------------------------------
[2022-Apr-25 15:27:20.583516] -------------- Gradient reducer -------------
[2022-Apr-25 15:27:20.583552]    Type: oneshot
[2022-Apr-25 15:27:20.584248] ---------------------------------------------
[2022-Apr-25 15:27:20.584259] --------------- Neural Network --------------
[2022-Apr-25 15:27:20.584264]    Model name          : model_2
[2022-Apr-25 15:27:20.584268]    Total parameters    : 23796744
[2022-Apr-25 15:27:20.584388]    Trainable parameters: 262152
[2022-Apr-25 15:27:20.584492]    Num of operations   : 176
[2022-Apr-25 15:27:20.584494] ---------------------------------------------
[2022-Apr-25 15:27:20.586453] Loading model checkpoint from file: ./model_checkpoints/model_checkpoint_100.pt
[2022-Apr-25 15:27:22.166056]    Skipping layer dense_1: not found in model
[2022-Apr-25 15:27:22.166088]    Skipping parameter dense_1.weight: not found in model
[2022-Apr-25 15:27:22.166096]    Skipping parameter dense_1.bias: not found in model
[2022-Apr-25 15:27:29.930647] ------------- TAO configuration -------------
[2022-Apr-25 15:27:29.930667]    Optimization algo:  bqnls
[2022-Apr-25 15:27:29.930671]    Max iterations:     10
[2022-Apr-25 15:27:29.930676]    Max func evals:     4000
[2022-Apr-25 15:27:29.930678]    Absolute tolerance: 1e-08
[2022-Apr-25 15:27:29.930690]    Relative tolerance: 1e-08
[2022-Apr-25 15:27:29.930693]    Line search algo:   more-thuente
[2022-Apr-25 15:27:29.930696] ---------------------------------------------
[2022-Apr-25 15:27:29.930735] -------------------- Loss -------------------
[2022-Apr-25 15:27:29.930744]    Function    : NLL (Negative Log-Likelihood)
[2022-Apr-25 15:27:29.930750]    Weight      : None
[2022-Apr-25 15:27:29.930759]    Ignore index: None
[2022-Apr-25 15:27:29.930764] ---------------------------------------------
[2022-Apr-25 15:27:29.930802] Discovering training input images...
[2022-Apr-25 15:27:34.766006] Pre-loading training input images...
[2022-Apr-25 15:27:44.770813]    62% (ETA 6.1s)
[2022-Apr-25 15:27:50.942811] Discovering training labels...
[2022-Apr-25 15:27:50.956921] Number of class labels: 8
[2022-Apr-25 15:27:50.971304] Discovering test input images...
[2022-Apr-25 15:27:50.986674] Pre-loading test input images...
[2022-Apr-25 15:27:52.942285] Discovering test labels...
[2022-Apr-25 15:27:52.956547] Number of class labels: 8

Image Semantics Segmentation training ends abruptly

Description
I'm trying to use rmldnn to perform image semantics segmentation, however when I run it, it successfully discovers input images but then stops abruptly, not continuing with the training part.

To Reproduce
Steps to reproduce the behavior:

rmldnn docker image version: latest
Configuration file: same as provided in tutorial
Sample input data file: provided in tutorial.(link: https://rmldnnstorage.blob.core.windows.net/rmldnn-datasets/oxford_pets.tar.gz)
Command run to reproduce the error. For example: sudo docker run -u $(id -u):$(id -g) -v ${PWD}:/home/ubuntu -w /home/ubuntu --rm
rocketml/rmldnn:latest rmldnn --config=config_pets_segmentation.json
Expected behavior
It should have started training after discovering input images but it didn't.

Screenshots

Desktop :

OS: Ubuntu
Version: 22.04
Docker or Singularity: Docker
Version of Docker: 20.10.12

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.