rocketmlhq / rmldnn Goto Github PK
View Code? Open in Web Editor NEWRocketML Deep Neural Networks
Home Page: https://rocketmlhq.github.io/rmldnn/
License: Other
RocketML Deep Neural Networks
Home Page: https://rocketmlhq.github.io/rmldnn/
License: Other
I'm trying to use rmldnn to classify images from a Kaggle dataset of birds, but I'm having trouble doing so. We normally utilise transfer learning for datasets this huge, and in Tensorflow it's very simple: we just create a base model with weights, and then onto the last layer, we add layers as necessary to get the desired amount of classes as output.
But when I added a pre-trained model (say VGG16 from the resources) and adjusted the input and output layers to match the dataset, it didn't provide any results or return any errors.
So, if a tutorial on transfer learning could be added, that would be extremely beneficial.
Kaggle dataset on which I was working: https://www.kaggle.com/datasets/gpiosenka/100-bird-species
Hello - I'm trying to run rmldnn on the following system. See command & error below.
NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
$ nvidia-smi
Fri Apr 22 10:12:29 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:3B:00.0 Off | 0 |
| 0% 38C P0 73W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 Off | 00000000:5E:00.0 Off | 0 |
| 0% 38C P0 77W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 Off | 00000000:AF:00.0 Off | 0 |
| 0% 36C P0 77W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 Off | 00000000:D8:00.0 Off | 0 |
| 0% 36C P0 74W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Command: singularity exec --nv ../software/rmldnn.sif mpirun -np 2 -x CUDA_VISIBLE_DEVICES=0,1 rmldnn --config= ./config_inpaint_feature_extraction.json
Error: [2022-Apr-21 22:33:00.365008] *** CUDA available! Will train on GPU ***
[2022-Apr-21 22:33:00.365016] ---------------------------------------------
[2022-Apr-21 22:33:00.477420] ERROR: CUDA error: no kernel image is available for execution on the device
Exception raised from launch_vectorized_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:119 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f1eaff9a569 in /usr/local/libtorch/1.7.0/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::fill_kernel_cuda, 4u>, float (), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::fill_kernel_cuda, 4u>, float (), float> const&) + 0x862 (0x7f1e6127bf82 in /usr/local/libtorch/1.7.0/lib/libtorch_cuda.so)
Hello - I'm working through the tutorial for in-painting and I've successfully completed the feature extraction training.
However, even though the classification training finishes successfully, I see that the loss balloons to a large number. Do you have any sense of what this could mean?
Here is the output from training the classifier:
$ cat rmldnnClass.out | grep Accuracy
[2022-Apr-25 15:28:10.260352] Eval Epoch [ 1/100] : Batch [ 1/ 1] (Loss: 11.9174) | Accuracy: 0.618
[2022-Apr-25 15:28:26.116163] Eval Epoch [ 2/100] : Batch [ 1/ 1] (Loss: 71.4548) | Accuracy: 0.634
[2022-Apr-25 15:28:43.123124] Eval Epoch [ 3/100] : Batch [ 1/ 1] (Loss: 436.4486) | Accuracy: 0.605
[2022-Apr-25 15:28:58.762846] Eval Epoch [ 4/100] : Batch [ 1/ 1] (Loss: 4738.9883) | Accuracy: 0.601
[2022-Apr-25 15:29:14.939691] Eval Epoch [ 5/100] : Batch [ 1/ 1] (Loss: 22715.8613) | Accuracy: 0.622
[2022-Apr-25 15:29:30.472585] Eval Epoch [ 6/100] : Batch [ 1/ 1] (Loss: 117077.8047) | Accuracy: 0.626
[2022-Apr-25 15:29:45.980462] Eval Epoch [ 7/100] : Batch [ 1/ 1] (Loss: 1560777.3750) | Accuracy: 0.599
[2022-Apr-25 15:30:01.238265] Eval Epoch [ 8/100] : Batch [ 1/ 1] (Loss: 10272561.0000) | Accuracy: 0.628
[2022-Apr-25 15:30:16.822373] Eval Epoch [ 9/100] : Batch [ 1/ 1] (Loss: 106860592.0000) | Accuracy: 0.537
[2022-Apr-25 15:30:32.327156] Eval Epoch [ 10/100] : Batch [ 1/ 1] (Loss: 492095744.0000) | Accuracy: 0.598
[2022-Apr-25 15:30:38.241459] Eval Epoch [ 11/100] : Batch [ 1/ 1] (Loss: 1098736640.0000) | Accuracy: 0.593
[2022-Apr-25 15:30:42.281468] Eval Epoch [ 12/100] : Batch [ 1/ 1] (Loss: 1204668032.0000) | Accuracy: 0.619
[2022-Apr-25 15:30:46.724595] Eval Epoch [ 13/100] : Batch [ 1/ 1] (Loss: 1348660480.0000) | Accuracy: 0.639
[2022-Apr-25 15:30:51.073770] Eval Epoch [ 14/100] : Batch [ 1/ 1] (Loss: 1276815488.0000) | Accuracy: 0.639
[2022-Apr-25 15:30:55.363943] Eval Epoch [ 15/100] : Batch [ 1/ 1] (Loss: 1593674496.0000) | Accuracy: 0.639
Information on training:
[2022-Apr-25 15:27:20.244494] RocketML : dnn
[2022-Apr-25 15:27:20.244561] rocketml 1.0.0 (Linux-5.3.0-1031-azure ) (Apr 14 2022 22:57:24) (git:master rev:b822b0c)
[2022-Apr-25 15:27:20.244570] RocketML : 4 MPI processes
[2022-Apr-25 15:27:20.244591] ___ __
[2022-Apr-25 15:27:20.244599] /\_ \ /\ \
[2022-Apr-25 15:27:20.244605] _ __ ___ ___ \//\ \ \_\ \ ___ ___
[2022-Apr-25 15:27:20.244611] /\`'__\ /' __` __`\ \ \ \ /'__ \ /' _ `\ /' _ `\
[2022-Apr-25 15:27:20.244617] \ \ \/ /\ \/\ \/\ \ \_\ \_ /\ \_\ \ /\ \/\ \ /\ \/\ \
[2022-Apr-25 15:27:20.244623] \ \_\ \ \_\ \_\ \_\ /\____\\ \___,_\\ \_\ \_\\ \_\ \_\
[2022-Apr-25 15:27:20.244628] \/_/ \/_/\/_/\/_/ \/____/ \/__,_ / \/_/\/_/ \/_/\/_/
[2022-Apr-25 15:27:20.244634]
[2022-Apr-25 15:27:20.244639] (C) 2022 RocketML, Inc. All rights reserved.
[2022-Apr-25 15:27:20.244645]
[2022-Apr-25 15:27:20.244650] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244656] | RocketML Deep Neural Networks |
[2022-Apr-25 15:27:20.244661] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244667] | More info: https://github.com/rocketmlhq/rmldnn |
[2022-Apr-25 15:27:20.244673] | License : https://github.com/rocketmlhq/rmldnn/blob/main/LICENSE |
[2022-Apr-25 15:27:20.244678] | Contact : [email protected] |
[2022-Apr-25 15:27:20.244684] |-------------------------------------------------------------------|
[2022-Apr-25 15:27:20.244689]
[2022-Apr-25 15:27:20.410407] ----------------- Device(s) -----------------
[2022-Apr-25 15:27:20.410433] CUDA:0
[2022-Apr-25 15:27:20.410441] CUDA:1
[2022-Apr-25 15:27:20.410447] CUDA:2
[2022-Apr-25 15:27:20.410452] CUDA:3
[2022-Apr-25 15:27:20.410458] ---------------------------------------------
[2022-Apr-25 15:27:20.583516] -------------- Gradient reducer -------------
[2022-Apr-25 15:27:20.583552] Type: oneshot
[2022-Apr-25 15:27:20.584248] ---------------------------------------------
[2022-Apr-25 15:27:20.584259] --------------- Neural Network --------------
[2022-Apr-25 15:27:20.584264] Model name : model_2
[2022-Apr-25 15:27:20.584268] Total parameters : 23796744
[2022-Apr-25 15:27:20.584388] Trainable parameters: 262152
[2022-Apr-25 15:27:20.584492] Num of operations : 176
[2022-Apr-25 15:27:20.584494] ---------------------------------------------
[2022-Apr-25 15:27:20.586453] Loading model checkpoint from file: ./model_checkpoints/model_checkpoint_100.pt
[2022-Apr-25 15:27:22.166056] Skipping layer dense_1: not found in model
[2022-Apr-25 15:27:22.166088] Skipping parameter dense_1.weight: not found in model
[2022-Apr-25 15:27:22.166096] Skipping parameter dense_1.bias: not found in model
[2022-Apr-25 15:27:29.930647] ------------- TAO configuration -------------
[2022-Apr-25 15:27:29.930667] Optimization algo: bqnls
[2022-Apr-25 15:27:29.930671] Max iterations: 10
[2022-Apr-25 15:27:29.930676] Max func evals: 4000
[2022-Apr-25 15:27:29.930678] Absolute tolerance: 1e-08
[2022-Apr-25 15:27:29.930690] Relative tolerance: 1e-08
[2022-Apr-25 15:27:29.930693] Line search algo: more-thuente
[2022-Apr-25 15:27:29.930696] ---------------------------------------------
[2022-Apr-25 15:27:29.930735] -------------------- Loss -------------------
[2022-Apr-25 15:27:29.930744] Function : NLL (Negative Log-Likelihood)
[2022-Apr-25 15:27:29.930750] Weight : None
[2022-Apr-25 15:27:29.930759] Ignore index: None
[2022-Apr-25 15:27:29.930764] ---------------------------------------------
[2022-Apr-25 15:27:29.930802] Discovering training input images...
[2022-Apr-25 15:27:34.766006] Pre-loading training input images...
[2022-Apr-25 15:27:44.770813] 62% (ETA 6.1s)
[2022-Apr-25 15:27:50.942811] Discovering training labels...
[2022-Apr-25 15:27:50.956921] Number of class labels: 8
[2022-Apr-25 15:27:50.971304] Discovering test input images...
[2022-Apr-25 15:27:50.986674] Pre-loading test input images...
[2022-Apr-25 15:27:52.942285] Discovering test labels...
[2022-Apr-25 15:27:52.956547] Number of class labels: 8
Description
I'm trying to use rmldnn to perform image semantics segmentation, however when I run it, it successfully discovers input images but then stops abruptly, not continuing with the training part.
To Reproduce
Steps to reproduce the behavior:
Desktop :
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.