I am attempting to train STDC-seg model using super-gradients,
- My dataset is in coco2017 format
Here is my train.py code
-----train.py----
from super_gradients.training.datasets.dataset_interfaces.dataset_interface import CoCoSegmentationDatasetInterface
from super_gradients.training.sg_model import SgModel
from super_gradients.training.metrics import BinaryIOU
from super_gradients.training.transforms.transforms import ResizeSeg, RandomFlip, RandomRescale, CropImageAndMask,
PadShortToCropSize, ColorJitterSeg
from super_gradients.training.utils.callbacks import BinarySegmentationVisualizationCallback, Phase
from torchvision import transforms
DEFINE DATA TRANSFORMATIONS
dataset_params = {"dataset_dir": "/home/syed/work/vision_datasets/11apr22",
"batch_size": 8,
"val_batch_size":8,
"num_classes":2
}
dataset_interface = CoCoSegmentationDatasetInterface(dataset_params,
cache_labels = False, cache_images = False, dataset_classes_inclusion_tuples_list = [(0, 'background'), (1, 'drivable-area"')])
model = SgModel("stdc2_seg50_scratch_50_epochs")
CONNECTING THE DATASET INTERFACE WILL SET SGMODEL'S CLASSES ATTRIBUTE ACCORDING TO SUPERVISELY
#model.connect_dataset_interface(dataset_interface)
THIS IS WHERE THE MAGIC HAPPENS- SINCE SGMODEL'S CLASSES ATTRIBUTE WAS SET TO BE DIFFERENT FROM CITYSCAPES'S, AFTER
LOADING THE PRETRAINED REGSET, IT WILL CALL IT'S REPLACE_HEAD METHOD AND CHANGE IT'S SEGMENTATION HEAD LAYER ACCORDING
TO OUR BINARY SEGMENTATION DATASET
model.build_model(architecture = "stdc2_seg50", arch_params={"num_classes":1})
#model.build_model("stdc2_seg50")
model.connect_dataset_interface(dataset_interface)
DEFINE TRAINING PARAMS. SEE DOCS FOR THE FULL LIST.
train_params = {"max_epochs": 50,
"lr_mode": "cosine",
"initial_lr": 0.0064, # for batch_size=16
"optimizer_params": {"momentum": 0.843,
"weight_decay": 0.00036,
"nesterov": True},
"criterion_params": {"num_classes": 1
},
"cosine_final_lr_ratio": 0.1,
"multiply_head_lr": 10,
"optimizer": "SGD",
"loss": "stdc_loss",
"ema": True,
"zero_weight_decay_on_bias_and_bn": True,
"average_best_models": True,
"mixed_precision": False,
"metric_to_watch": "mean_IOU",
"greater_metric_to_watch_is_better": True,
"train_metrics_list": [BinaryIOU()],
"valid_metrics_list": [BinaryIOU()],
"loss_logging_items_names": ["loss"],
"phase_callbacks": [BinarySegmentationVisualizationCallback(phase=Phase.VALIDATION_BATCH_END,
freq=1,
last_img_idx_in_batch=4)],
}
model.train(train_params)
The training stops with following error message.
(super-gradients) gridai@session:~/work/super-gradients β python train.py
You did not mention an AWS environment.You can set the environment variable ENVIRONMENT_NAME with one of the values: development,staging,production
/home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
callbacks -WARNING- Failed to import deci_lab_client
loading annotations into memory...
Done (t=0.12s)
creating index...
index created!
loading annotations into memory...
Done (t=0.04s)
creating index...
index created!
/home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/deprecate/deprecation.py:115: FutureWarning: The IoU
was deprecated since v0.7 in favor of torchmetrics.classification.jaccard.JaccardIndex
. It will be removed in v0.8.
stream(template_mgs % msg_args)
sg_model -INFO- Using EMA with params {}
"events.out.tfevents.1649674474.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-ftddj.6391.0" will not be deleted
"events.out.tfevents.1649674209.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-ftddj.6318.0" will not be deleted
"events.out.tfevents.1649761776.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-7xfbb.1654.0" will not be deleted
"events.out.tfevents.1649761893.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-7xfbb.1800.0" will not be deleted
"events.out.tfevents.1649761085.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-7xfbb.1571.0" will not be deleted
"events.out.tfevents.1649676959.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-ftddj.6464.0" will not be deleted
"events.out.tfevents.1649761829.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-7xfbb.1727.0" will not be deleted
"events.out.tfevents.1649674011.ixnode-cce75236-32dc-42d6-90e9-a713878cc921-758f988669-ftddj.6245.0" will not be deleted
sg_model -INFO- Started training for 50 epochs (0/49)
Train epoch 0: 0%| | 0/210 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [24,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [25,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [26,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [27,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [28,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [29,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [30,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [464,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [96,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [97,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [98,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [99,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [100,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [101,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [102,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [103,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [399,0,0], thread: [104,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [53,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [54,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [55,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [56,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [57,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [58,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [59,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [60,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [61,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [62,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [63,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [112,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [113,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [114,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [115,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [116,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [117,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [118,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [119,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [120,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [121,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [122,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [123,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [124,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [125,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [126,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [453,0,0], thread: [127,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [87,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [88,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [89,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [90,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [91,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [92,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [93,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [94,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [400,0,0], thread: [95,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
Train epoch 0: 0%| | 0/210 [00:02<?, ?it/s]
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f29ab3097d2 in /home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0x10cf22a (0x7f29ac8eb22a in /home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x2fff28 (0x7f29fe644f28 in /home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #3: c10::TensorImpl::release_resources() + 0x175 (0x7f29ab2f2005 in /home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #4: + 0x1ede49 (0x7f29fe532e49 in /home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x4da268 (0x7f29fe81f268 in /home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f29fe81f562 in /home/jovyan/conda/envs/super-gradients/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #27: __libc_start_main + 0xf3 (0x7f2a010290b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Aborted
(super-gradients) gridai@session:~/work/super-gradients β
Here are the system details
(super-gradients) gridai@session:/work/super-gradients β uname -srm
Linux 5.4.129-63.229.amzn2.x86_64 x86_64
(super-gradients) gridai@session:/work/super-gradients β nvidia-smi
Tue Apr 12 11:30:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 29C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(super-gradients) gridai@session:~/work/super-gradients β nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
Also Would you please provide an example code using coco128 dataset training diffrent STDC, with some documentation regarding important parameters to tweak.
Thanks