Code Monkey home page Code Monkey logo

rmldnn's People

Contributors

rhythmbindal avatar ssbotelh avatar yashjain-99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

rmldnn's Issues

Add a brief tutorial on transfer learning

I'm trying to use rmldnn to classify images from a Kaggle dataset of birds, but I'm having trouble doing so. We normally utilise transfer learning for datasets this huge, and in Tensorflow it's very simple: we just create a base model with weights, and then onto the last layer, we add layers as necessary to get the desired amount of classes as output.
But when I added a pre-trained model (say VGG16 from the resources) and adjusted the input and output layers to match the dataset, it didn't provide any results or return any errors.
So, if a tutorial on transfer learning could be added, that would be extremely beneficial.
Kaggle dataset on which I was working: https://www.kaggle.com/datasets/gpiosenka/100-bird-species

ERROR: CUDA error: no kernel image is available for execution on the device

Hello - I'm trying to run rmldnn on the following system. See command & error below.

NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
$ nvidia-smi
Fri Apr 22 10:12:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02    Driver Version: 510.60.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:3B:00.0 Off |                    0 |
|  0%   38C    P0    73W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          Off  | 00000000:5E:00.0 Off |                    0 |
|  0%   38C    P0    77W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          Off  | 00000000:AF:00.0 Off |                    0 |
|  0%   36C    P0    77W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          Off  | 00000000:D8:00.0 Off |                    0 |
|  0%   36C    P0    74W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Command:  singularity exec --nv ../software/rmldnn.sif mpirun -np 2 -x CUDA_VISIBLE_DEVICES=0,1 rmldnn --config= ./config_inpaint_feature_extraction.json

Error: [2022-Apr-21 22:33:00.365008] *** CUDA available! Will train on GPU ***
[2022-Apr-21 22:33:00.365016] ---------------------------------------------
[2022-Apr-21 22:33:00.477420] ERROR: CUDA error: no kernel image is available for execution on the device
Exception raised from launch_vectorized_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:119 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f1eaff9a569 in /usr/local/libtorch/1.7.0/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::fill_kernel_cuda, 4u>, float (), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::fill_kernel_cuda, 4u>, float (), float> const&) + 0x862 (0x7f1e6127bf82 in /usr/local/libtorch/1.7.0/lib/libtorch_cuda.so)

Issue during in-painting classification

Hello - I'm working through the tutorial for in-painting and I've successfully completed the feature extraction training.

Screen Shot 2022-04-25 at 4 11 13 PM

However, even though the classification training finishes successfully, I see that the loss balloons to a large number. Do you have any sense of what this could mean?

Here is the output from training the classifier:

$ cat rmldnnClass.out | grep Accuracy
[2022-Apr-25 15:28:10.260352] Eval  Epoch [  1/100] : Batch [    1/    1] (Loss: 11.9174) | Accuracy: 0.618
[2022-Apr-25 15:28:26.116163] Eval  Epoch [  2/100] : Batch [    1/    1] (Loss: 71.4548) | Accuracy: 0.634
[2022-Apr-25 15:28:43.123124] Eval  Epoch [  3/100] : Batch [    1/    1] (Loss: 436.4486) | Accuracy: 0.605
[2022-Apr-25 15:28:58.762846] Eval  Epoch [  4/100] : Batch [    1/    1] (Loss: 4738.9883) | Accuracy: 0.601
[2022-Apr-25 15:29:14.939691] Eval  Epoch [  5/100] : Batch [    1/    1] (Loss: 22715.8613) | Accuracy: 0.622
[2022-Apr-25 15:29:30.472585] Eval  Epoch [  6/100] : Batch [    1/    1] (Loss: 117077.8047) | Accuracy: 0.626
[2022-Apr-25 15:29:45.980462] Eval  Epoch [  7/100] : Batch [    1/    1] (Loss: 1560777.3750) | Accuracy: 0.599
[2022-Apr-25 15:30:01.238265] Eval  Epoch [  8/100] : Batch [    1/    1] (Loss: 10272561.0000) | Accuracy: 0.628
[2022-Apr-25 15:30:16.822373] Eval  Epoch [  9/100] : Batch [    1/    1] (Loss: 106860592.0000) | Accuracy: 0.537
[2022-Apr-25 15:30:32.327156] Eval  Epoch [ 10/100] : Batch [    1/    1] (Loss: 492095744.0000) | Accuracy: 0.598
[2022-Apr-25 15:30:38.241459] Eval  Epoch [ 11/100] : Batch [    1/    1] (Loss: 1098736640.0000) | Accuracy: 0.593
[2022-Apr-25 15:30:42.281468] Eval  Epoch [ 12/100] : Batch [    1/    1] (Loss: 1204668032.0000) | Accuracy: 0.619
[2022-Apr-25 15:30:46.724595] Eval  Epoch [ 13/100] : Batch [    1/    1] (Loss: 1348660480.0000) | Accuracy: 0.639
[2022-Apr-25 15:30:51.073770] Eval  Epoch [ 14/100] : Batch [    1/    1] (Loss: 1276815488.0000) | Accuracy: 0.639
[2022-Apr-25 15:30:55.363943] Eval  Epoch [ 15/100] : Batch [    1/    1] (Loss: 1593674496.0000) | Accuracy: 0.639

Information on training:

[2022-Apr-25 15:27:20.244494] RocketML : dnn
[2022-Apr-25 15:27:20.244561] rocketml 1.0.0 (Linux-5.3.0-1031-azure ) (Apr 14 2022 22:57:24) (git:master rev:b822b0c)
[2022-Apr-25 15:27:20.244570] RocketML : 4 MPI processes
[2022-Apr-25 15:27:20.244591]                     ___        __
[2022-Apr-25 15:27:20.244599]                    /\_ \      /\ \
[2022-Apr-25 15:27:20.244605]  _ __    ___ ___   \//\ \     \_\ \     ___      ___
[2022-Apr-25 15:27:20.244611] /\`'__\ /' __` __`\  \ \ \    /'__ \  /' _ `\  /' _ `\
[2022-Apr-25 15:27:20.244617] \ \ \/  /\ \/\ \/\ \  \_\ \_ /\ \_\ \ /\ \/\ \ /\ \/\ \
[2022-Apr-25 15:27:20.244623]  \ \_\  \ \_\ \_\ \_\ /\____\\ \___,_\\ \_\ \_\\ \_\ \_\
[2022-Apr-25 15:27:20.244628]   \/_/   \/_/\/_/\/_/ \/____/ \/__,_ / \/_/\/_/ \/_/\/_/
[2022-Apr-25 15:27:20.244634]
[2022-Apr-25 15:27:20.244639]             (C) 2022 RocketML, Inc. All rights reserved.
[2022-Apr-25 15:27:20.244645]
[2022-Apr-25 15:27:20.244650] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244656] |                   RocketML Deep Neural Networks                   |
[2022-Apr-25 15:27:20.244661] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244667] | More info: https://github.com/rocketmlhq/rmldnn                   |
[2022-Apr-25 15:27:20.244673] | License  : https://github.com/rocketmlhq/rmldnn/blob/main/LICENSE |
[2022-Apr-25 15:27:20.244678] | Contact  : [email protected]                                   |
[2022-Apr-25 15:27:20.244684] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244689]
[2022-Apr-25 15:27:20.410407] ----------------- Device(s) -----------------
[2022-Apr-25 15:27:20.410433]    CUDA:0
[2022-Apr-25 15:27:20.410441]    CUDA:1
[2022-Apr-25 15:27:20.410447]    CUDA:2
[2022-Apr-25 15:27:20.410452]    CUDA:3
[2022-Apr-25 15:27:20.410458] ---------------------------------------------
[2022-Apr-25 15:27:20.583516] -------------- Gradient reducer -------------
[2022-Apr-25 15:27:20.583552]    Type: oneshot
[2022-Apr-25 15:27:20.584248] ---------------------------------------------
[2022-Apr-25 15:27:20.584259] --------------- Neural Network --------------
[2022-Apr-25 15:27:20.584264]    Model name          : model_2
[2022-Apr-25 15:27:20.584268]    Total parameters    : 23796744
[2022-Apr-25 15:27:20.584388]    Trainable parameters: 262152
[2022-Apr-25 15:27:20.584492]    Num of operations   : 176
[2022-Apr-25 15:27:20.584494] ---------------------------------------------
[2022-Apr-25 15:27:20.586453] Loading model checkpoint from file: ./model_checkpoints/model_checkpoint_100.pt
[2022-Apr-25 15:27:22.166056]    Skipping layer dense_1: not found in model
[2022-Apr-25 15:27:22.166088]    Skipping parameter dense_1.weight: not found in model
[2022-Apr-25 15:27:22.166096]    Skipping parameter dense_1.bias: not found in model
[2022-Apr-25 15:27:29.930647] ------------- TAO configuration -------------
[2022-Apr-25 15:27:29.930667]    Optimization algo:  bqnls
[2022-Apr-25 15:27:29.930671]    Max iterations:     10
[2022-Apr-25 15:27:29.930676]    Max func evals:     4000
[2022-Apr-25 15:27:29.930678]    Absolute tolerance: 1e-08
[2022-Apr-25 15:27:29.930690]    Relative tolerance: 1e-08
[2022-Apr-25 15:27:29.930693]    Line search algo:   more-thuente
[2022-Apr-25 15:27:29.930696] ---------------------------------------------
[2022-Apr-25 15:27:29.930735] -------------------- Loss -------------------
[2022-Apr-25 15:27:29.930744]    Function    : NLL (Negative Log-Likelihood)
[2022-Apr-25 15:27:29.930750]    Weight      : None
[2022-Apr-25 15:27:29.930759]    Ignore index: None
[2022-Apr-25 15:27:29.930764] ---------------------------------------------
[2022-Apr-25 15:27:29.930802] Discovering training input images...
[2022-Apr-25 15:27:34.766006] Pre-loading training input images...
[2022-Apr-25 15:27:44.770813]    62% (ETA 6.1s)
[2022-Apr-25 15:27:50.942811] Discovering training labels...
[2022-Apr-25 15:27:50.956921] Number of class labels: 8
[2022-Apr-25 15:27:50.971304] Discovering test input images...
[2022-Apr-25 15:27:50.986674] Pre-loading test input images...
[2022-Apr-25 15:27:52.942285] Discovering test labels...
[2022-Apr-25 15:27:52.956547] Number of class labels: 8

Image Semantics Segmentation training ends abruptly

Description
I'm trying to use rmldnn to perform image semantics segmentation, however when I run it, it successfully discovers input images but then stops abruptly, not continuing with the training part.

To Reproduce
Steps to reproduce the behavior:

  1. rmldnn docker image version: latest
  2. Configuration file: same as provided in tutorial
  3. Sample input data file: provided in tutorial.(link: https://rmldnnstorage.blob.core.windows.net/rmldnn-datasets/oxford_pets.tar.gz)
  4. Command run to reproduce the error. For example: sudo docker run -u $(id -u):$(id -g) -v ${PWD}:/home/ubuntu -w /home/ubuntu --rm
    rocketml/rmldnn:latest rmldnn --config=config_pets_segmentation.json
    Expected behavior
    It should have started training after discovering input images but it didn't.

Screenshots
Screenshot from 2022-06-02 04-55-42
Screenshot from 2022-06-02 04-56-09
Screenshot from 2022-06-02 04-56-30

Desktop :

  • OS: Ubuntu
  • Version: 22.04
  • Docker or Singularity: Docker
  • Version of Docker: 20.10.12

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.