Hi, I'm trying to run the model but in the training part, I can't find the file

Hi, I have modified the readme: the s name should be t

Hi, I have update the s and remove the srun part. Please see if it works for you

thanks! working better but have a problem: <div class="snippet-clipboard-content n

Missing train_load_prop.sh to train about dmm_net HOT 8 OPEN

zengxh commented on June 15, 2024

Missing train_load_prop.sh to train

from dmm_net.

Comments (8)

ZENGXH commented on June 15, 2024

Hi, I have modified the readme: the scripts name should be train_101.sh or train_50.sh. See if it works for you :)

from dmm_net.

walterpcasas commented on June 15, 2024

Thanks! :D but I have the same problem because he doesn't find srun.
I decided to try the training from the DMM_Net directory and ran train.py, when I run it, it first tells me that coco is not among the models despite having downloaded the cocoAPI, here the result:

2019-10-18 17:15:31,833-{train.py:384}-INFO-[model_name] model
2019-10-18 17:15:31,833-{train.py:385}-INFO-get number of gpu: 1
Traceback (most recent call last):
  File "train.py", line 403, in <module>
    trainIters(args)
  File "train.py", line 205, in trainIters
    enc_opt, dec_opt, trainer = build_model(args)
  File "train.py", line 130, in build_model
    encoder = FeatureExtractor(args)
  File "/content/DMM_Net/dmm/modules/model_encoder.py", line 51, in __init__
    raise Exception("The base model you chose is not supported ! {}".format(args.base_model))
Exception: The base model you chose is not supported ! coco

so use the flag to option -base_model 'resnet50' and the problem finished
but i run it again, and I get the error that missing configs/default.yml, then I copied from 'dmm', and ran again a little more, but then I got another error, with which I can not deal with at the moment:

2019-10-18 15:14:57,300-{train.py:384}-INFO-[model_name] model
2019-10-18 15:14:57,301-{train.py:385}-INFO-get number of gpu: 1
2019-10-18 15:14:58,615-{utils.py:213}-INFO-[load_DMM_config] configs/default.yaml
2019-10-18 15:14:58,619-{utils.py:232}-INFO-
2019-10-18 15:14:58,642-{utils.py:213}-INFO-[load_DMM_config] configs/default.yaml
2019-10-18 15:14:58,647-{utils.py:232}-INFO-
2019-10-18 15:14:58,650-{train.py:152}-INFO-{'sort_max_num': 50, 'matching_score_thre': 0.0, 'score_weight': 0.3, 'relax': 1, 'relax_max_iter': 400, 'relax_proj_iter': 50, 'relax_topk': 0, 'relax_learning_rate': 0.1, 'matching': {'match_max_score': 1, 'algo': 'relax', 'cost': 'cosine'}, 'encoder': {'nms_thresh': 0.4}}
2019-10-18 15:15:01,311-{trainer.py:63}-INFO-load from json_data; num vid 3000
2019-10-18 15:15:01,311-{train.py:154}-INFO-init model 4.010
2019-10-18 15:15:01,312-{train.py:161}-INFO-optimizer 0.001
2019-10-18 15:15:01,313-{train.py:163}-INFO-[enc_opt] len: 2; len for each param group: [48, 161]
2019-10-18 15:15:01,313-{train.py:165}-INFO-[dec_opt] len: 1; len for each param group: [10]
2019-10-18 15:15:01,315-{train.py:212}-INFO-save args in experiments/model/10-18-15-14args.pkl
2019-10-18 15:15:01,315-{train.py:213}-INFO-Namespace(augment=False, base_model='resnet50', batch_size=10, best_val_loss=0, cache_data=1, config_train='configs/default.yaml', dataset='youtube', davis_eval_folder='', device=device(type='cuda', index=0), distributed=0, distributed_manully=0, distributed_manully_Nrep=0, distributed_manully_rank=0, dropout=0.0, epoch_resume=0, eval_flag='pred', eval_split='trainval', finetune_after=0, gpu_id=0, gt_maxseqlen=5, hidden_size=128, imsize=480, iou_weight=1.0, kernel_size=3, length_clip=3, load_proposals=0, load_proposals_dataset=0, local_rank=0, log_file='train.log', log_term=False, loss_weight_iouraw=18.0, loss_weight_match=1.0, lr=0.001, lr_cnn=0.0001, lr_decoder=0.001, mask_th=0.5, max_dets=100, max_epoch=100, max_eval_iter=800, maxseqlen=5, min_delta=0.0, min_size=0.001, model_dir='experiments/model', model_name='model', models_root='experiments/', momentum=0.9, my_augment=False, ngpus=1, num_classes=21, num_workers=0, only_spatial=False, only_temporal=False, optim='adam', optim_cnn='adam', overwrite_loadargs=1, pad_video=0, patience=15, patience_stop=60, pred_offline_meta='../data/ytb_vos/splits_813_3k_trainvaltest/meta_vid_frame_2_predid.json', pred_offline_path=None, pred_offline_path_eval=None, prev_mask_d=1, print_every=2, random_select_frames=0, resize=False, resume=False, resume_path='epoxx_iterxxxx', rotation=10, sample_inference_mask=0, save_every=3000, seed=123, shear=0.1, single_object=False, skip_empty_starting_frame=0, skip_mode='concat', test=0, test_image_h=256, test_image_w=448, test_model_path='', threshold_mask=0.4, train_h=255, train_split='train', train_w=448, translation=0.1, update_encoder=1, use_gpu=True, use_refmask=0, weight_decay=1e-06, weight_decay_cnn=1e-06, year='2017', youtube_dir='../../databases/YouTubeVOS/', zoom=0.7)
2019-10-18 15:15:01,315-{train.py:222}-INFO-init_dataloaders
2019-10-18 15:15:01,373-{youtubeVOS.py:84}-INFO-[dataset] phase read train; len of db seq 3000
2019-10-18 15:15:01,374-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.
2019-10-18 15:15:01,374-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl; it will take a while to cache the data 
2019-10-18 15:18:37,794-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl
tcmalloc: large alloc 1387642880 bytes == 0x19e71e000 @  0x7f00502942a4 0x5a1987 0x4bd187 0x4bf64a 0x4bec3e 0x4c0335 0x4bf091 0x4bf8b6 0x4bec3e 0x4c0335 0x4bf091 0x4bf56a 0x4c0335 0x4bf091 0x4bee6c 0x4bf4c6 0x527d26 0x42e6c9 0x4f86ba 0x4f98c7 0x4f6128 0x4f426e 0x5a1481 0x512a60 0x53ee21 0x57ec0c 0x4f88ba 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d
tcmalloc: large alloc 2082512896 bytes == 0x7eff43df6000 @  0x7f00502942a4 0x5a1987 0x4bd187 0x4bf64a 0x4bec3e 0x4c0335 0x4bf091 0x4bf8b6 0x4bec3e 0x4c0335 0x4bf091 0x4bf56a 0x4c0335 0x4bf091 0x4bedf4 0x4bf4c6 0x527d26 0x42e6c9 0x4f86ba 0x4f98c7 0x4f6128 0x4f426e 0x5a1481 0x512a60 0x53ee21 0x57ec0c 0x4f88ba 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d
tcmalloc: large alloc 3124248576 bytes == 0x7efe89a72000 @  0x7f00502942a4 0x5a1987 0x4bd187 0x4bf64a 0x4bec3e 0x4c0335 0x4bf091 0x4bf8b6 0x4bec3e 0x4c0335 0x4bf091 0x4bf56a 0x4c0335 0x4bf091 0x4bedf4 0x4bf4c6 0x527d26 0x42e6c9 0x4f86ba 0x4f98c7 0x4f6128 0x4f426e 0x5a1481 0x512a60 0x53ee21 0x57ec0c 0x4f88ba 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d
2019-10-18 15:18:58,298-{youtubeVOS.py:125}-INFO-load lmdb 236.95
2019-10-18 15:18:58,849-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 0.55; cliplen 3| annotation clip 26490(skip 0)| videos 3000
2019-10-18 15:18:58,903-{youtubeVOS.py:265}-INFO-load keys 0.05
2019-10-18 15:18:58,903-{train.py:104}-INFO-INPUT shape: 255 448
2019-10-18 15:18:58,910-{youtubeVOS.py:84}-INFO-[dataset] phase read trainval; len of db seq 200
2019-10-18 15:18:58,911-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.
2019-10-18 15:18:58,911-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl; it will take a while to cache the data 
2019-10-18 15:19:14,751-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl
2019-10-18 15:19:15,439-{youtubeVOS.py:125}-INFO-load lmdb 16.53
2019-10-18 15:19:15,460-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 0.02; cliplen 3| annotation clip 800(skip 0)| videos 200
2019-10-18 15:19:15,464-{youtubeVOS.py:265}-INFO-load keys 0.00
2019-10-18 15:19:15,464-{train.py:104}-INFO-INPUT shape: 255 448
2019-10-18 15:19:15,464-{train.py:227}-INFO-dataloader 254.149
2019-10-18 15:19:15,464-{train.py:233}-INFO-==========> start sample_inference_mask
2019-10-18 15:19:15,465-{train.py:248}-INFO-epoch 0 - trainval; 
2019-10-18 15:19:15,465-{train.py:250}-INFO--- loss weight loss_weight_match: 1.0 loss_weight_iouraw 18.0; 
Traceback (most recent call last):
  File "train.py", line 403, in <module>
    trainIters(args)
  File "train.py", line 254, in trainIters
    for batch_idx, (inputs,imgs_names,targets,seq_name,starting_frame) in enumerate(loaders[split]):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 346, in __next__
    data = self.dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/content/DMM_Net/dmm/dataloader/dataset.py", line 216, in __getitem__
    CHECKEQ(img.shape[-1], self.inputRes[1])
  File "/content/DMM_Net/dmm/utils/checker.py", line 27, in CHECKEQ
    assert(a == b), 'get {} {}'.format(a, b)
AssertionError: get 3 448

Maybe you can help me with this error, please!

from dmm_net.

ZENGXH commented on June 15, 2024

Hi, I have update the scripts and remove the srun part. Please see if it works for you.
I am sorry that some of the value of the default arguments and default.yaml may not be compatible with the current code. I will find a time to fix them up. Let me know if the scripts do not work. :)

from dmm_net.

ZENGXH commented on June 15, 2024

btw, you may also want to remove the cache file: data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl and data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl file before you run the scripts.

from dmm_net.

walterpcasas commented on June 15, 2024

thanks! working better but have a problem:

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2019-10-19 05:38:45,489-{train.py:395}-INFO-to distributed; local_rank 2
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-19 05:38:45,586-{train.py:384}-INFO-[model_name] ytb_train_x101
2019-10-19 05:38:45,587-{train.py:385}-INFO-get number of gpu: 4
2019-10-19 05:38:45,587-{train.py:395}-INFO-to distributed; local_rank 0
2019-10-19 05:38:45,763-{train.py:395}-INFO-to distributed; local_rank 3
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-19 05:38:45,826-{train.py:395}-INFO-to distributed; local_rank 1
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-19 05:43:47,555-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-19 05:43:47,559-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
Traceback (most recent call last):
  File "train.py", line 403, in <module>
    trainIters(args)
  File "train.py", line 205, in trainIters
    enc_opt, dec_opt, trainer = build_model(args)
  File "train.py", line 133, in build_model
    encoder_dict, decoder_dict,_,_,_ = load_checkpoint('../../experiments/models/one-shot-model-youtubevos/')
  File "/content/DMM_Net/dmm/utils/utils.py", line 13, in load_checkpoint
    encoder_dict = torch.load(os.path.join(model_name,'encoder.pt'))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 419, in load
    f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '../../experiments/models/one-shot-model-youtubevos/encoder.pt'
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '-pred_offline_path_eval', 'experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth', '-pred_offline_path', './experiments/proposals/coco81/inference/youtubevos_train3k_meta/asdict_50/videos/', '-load_proposals_dataset', '1', '-load_proposals', '1', '-distributed', '1', '-save_every', '3000', '-train_split', 'train', '-eval_split', 'trainval', '-loss_weight_match=1', '-loss_weight_iouraw=1', '-finetune_after', '3', '-skip_empty_starting_frame', '1', '-random_select_frames', '1', '-model_name', 'ytb_train_x101', '-train_h', '255', '-train_w', '448', '-num_workers', '4', '-lr', '0.0001', '-lr_cnn', '0.00001', '-config_train', 'dmm/configs/train.yaml', '-batch_size=4', '-length_clip=3', '-max_epoch=2', '--resize', '-base_model', 'resnet101', '-models_root=experiments/models/', '-max_eval_iter=800', '--augment', '-ngpus', '4']' returned non-zero exit status 1

Do you have any idea please?

from dmm_net.

ZENGXH commented on June 15, 2024

I forget to put the pretrained model download link. It is fixed in my last commit:

DMM_Net/scripts/train/train_101.sh

Line 6 in a630868

    
           wget https://imatge.upc.edu/web/sites/default/files/projects/segmentation/public_html/rvos-pretrained-models/one-shot-model-youtubevos.zip

See if it works for you.

from dmm_net.

walterpcasas commented on June 15, 2024

thanks, I had already put what was missing and fixed the route, locally, now the whole model runs but then it stays on hold for a long time, I already left about 3 hours, is this normal?

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2019-10-31 19:41:53,158-{train.py:395}-INFO-to distributed; local_rank 3
2019-10-31 19:41:53,158-{train.py:384}-INFO-[model_name] ytb_train_x101
2019-10-31 19:41:53,158-{train.py:385}-INFO-get number of gpu: 4
2019-10-31 19:41:53,159-{train.py:395}-INFO-to distributed; local_rank 0
2019-10-31 19:41:53,162-{train.py:395}-INFO-to distributed; local_rank 2
2019-10-31 19:41:53,164-{train.py:395}-INFO-to distributed; local_rank 1
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-31 19:46:55,205-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-31 19:46:55,212-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-31 19:47:00,949-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-31 19:47:00,953-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-31 19:47:01,006-{train.py:152}-INFO-{'sort_max_num': 50, 'matching_score_thre': 0.0, 'score_weight': 0.3, 'relax': 1, 'relax_max_iter': 10, 'relax_proj_iter': 5, 'relax_topk': 0, 'relax_learning_rate': 0.1, 'matching': {'match_max_score': 1, 'algo': 'relax', 'cost': 'cosine'}, 'encoder': {'nms_thresh': 0.4}}
2019-10-31 19:47:01,132-{trainer.py:63}-INFO-load from json_data; num vid 3000
2019-10-31 19:47:01,133-{train.py:154}-INFO-init model 6.824
2019-10-31 19:47:01,135-{train.py:161}-INFO-optimizer 0.002
2019-10-31 19:47:01,135-{train.py:163}-INFO-[enc_opt] len: 2; len for each param group: [48, 314]
2019-10-31 19:47:01,135-{train.py:165}-INFO-[dec_opt] len: 1; len for each param group: [10]
2019-10-31 19:47:01,135-{train.py:169}-INFO-init DistributedDataParallel rank 0

from dmm_net.

ZENGXH commented on June 15, 2024

It shouldn’t take that long. Could you try setting the training arguments distributed as 0 and NGPUS as 1?
Besides, could you also try the commands mentioned here NVIDIA/vid2vid#17 (comment) to help us better localize the problem?

from dmm_net.

Missing train_load_prop.sh to train about dmm_net HOT 8 OPEN

Comments (8)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent