Code Monkey home page Code Monkey logo

Comments (8)

ZENGXH avatar ZENGXH commented on June 15, 2024

Hi, I have modified the readme: the scripts name should be train_101.sh or train_50.sh. See if it works for you :)

from dmm_net.

walterpcasas avatar walterpcasas commented on June 15, 2024

Thanks! :D but I have the same problem because he doesn't find srun.
I decided to try the training from the DMM_Net directory and ran train.py, when I run it, it first tells me that coco is not among the models despite having downloaded the cocoAPI, here the result:

2019-10-18 17:15:31,833-{train.py:384}-INFO-[model_name] model
2019-10-18 17:15:31,833-{train.py:385}-INFO-get number of gpu: 1
Traceback (most recent call last):
  File "train.py", line 403, in <module>
    trainIters(args)
  File "train.py", line 205, in trainIters
    enc_opt, dec_opt, trainer = build_model(args)
  File "train.py", line 130, in build_model
    encoder = FeatureExtractor(args)
  File "/content/DMM_Net/dmm/modules/model_encoder.py", line 51, in __init__
    raise Exception("The base model you chose is not supported ! {}".format(args.base_model))
Exception: The base model you chose is not supported ! coco

so use the flag to option -base_model 'resnet50' and the problem finished
but i run it again, and I get the error that missing configs/default.yml, then I copied from 'dmm', and ran again a little more, but then I got another error, with which I can not deal with at the moment:

2019-10-18 15:14:57,300-{train.py:384}-INFO-[model_name] model
2019-10-18 15:14:57,301-{train.py:385}-INFO-get number of gpu: 1
2019-10-18 15:14:58,615-{utils.py:213}-INFO-[load_DMM_config] configs/default.yaml
2019-10-18 15:14:58,619-{utils.py:232}-INFO-
2019-10-18 15:14:58,642-{utils.py:213}-INFO-[load_DMM_config] configs/default.yaml
2019-10-18 15:14:58,647-{utils.py:232}-INFO-
2019-10-18 15:14:58,650-{train.py:152}-INFO-{'sort_max_num': 50, 'matching_score_thre': 0.0, 'score_weight': 0.3, 'relax': 1, 'relax_max_iter': 400, 'relax_proj_iter': 50, 'relax_topk': 0, 'relax_learning_rate': 0.1, 'matching': {'match_max_score': 1, 'algo': 'relax', 'cost': 'cosine'}, 'encoder': {'nms_thresh': 0.4}}
2019-10-18 15:15:01,311-{trainer.py:63}-INFO-load from json_data; num vid 3000
2019-10-18 15:15:01,311-{train.py:154}-INFO-init model 4.010
2019-10-18 15:15:01,312-{train.py:161}-INFO-optimizer 0.001
2019-10-18 15:15:01,313-{train.py:163}-INFO-[enc_opt] len: 2; len for each param group: [48, 161]
2019-10-18 15:15:01,313-{train.py:165}-INFO-[dec_opt] len: 1; len for each param group: [10]
2019-10-18 15:15:01,315-{train.py:212}-INFO-save args in experiments/model/10-18-15-14args.pkl
2019-10-18 15:15:01,315-{train.py:213}-INFO-Namespace(augment=False, base_model='resnet50', batch_size=10, best_val_loss=0, cache_data=1, config_train='configs/default.yaml', dataset='youtube', davis_eval_folder='', device=device(type='cuda', index=0), distributed=0, distributed_manully=0, distributed_manully_Nrep=0, distributed_manully_rank=0, dropout=0.0, epoch_resume=0, eval_flag='pred', eval_split='trainval', finetune_after=0, gpu_id=0, gt_maxseqlen=5, hidden_size=128, imsize=480, iou_weight=1.0, kernel_size=3, length_clip=3, load_proposals=0, load_proposals_dataset=0, local_rank=0, log_file='train.log', log_term=False, loss_weight_iouraw=18.0, loss_weight_match=1.0, lr=0.001, lr_cnn=0.0001, lr_decoder=0.001, mask_th=0.5, max_dets=100, max_epoch=100, max_eval_iter=800, maxseqlen=5, min_delta=0.0, min_size=0.001, model_dir='experiments/model', model_name='model', models_root='experiments/', momentum=0.9, my_augment=False, ngpus=1, num_classes=21, num_workers=0, only_spatial=False, only_temporal=False, optim='adam', optim_cnn='adam', overwrite_loadargs=1, pad_video=0, patience=15, patience_stop=60, pred_offline_meta='../data/ytb_vos/splits_813_3k_trainvaltest/meta_vid_frame_2_predid.json', pred_offline_path=None, pred_offline_path_eval=None, prev_mask_d=1, print_every=2, random_select_frames=0, resize=False, resume=False, resume_path='epoxx_iterxxxx', rotation=10, sample_inference_mask=0, save_every=3000, seed=123, shear=0.1, single_object=False, skip_empty_starting_frame=0, skip_mode='concat', test=0, test_image_h=256, test_image_w=448, test_model_path='', threshold_mask=0.4, train_h=255, train_split='train', train_w=448, translation=0.1, update_encoder=1, use_gpu=True, use_refmask=0, weight_decay=1e-06, weight_decay_cnn=1e-06, year='2017', youtube_dir='../../databases/YouTubeVOS/', zoom=0.7)
2019-10-18 15:15:01,315-{train.py:222}-INFO-init_dataloaders
2019-10-18 15:15:01,373-{youtubeVOS.py:84}-INFO-[dataset] phase read train; len of db seq 3000
2019-10-18 15:15:01,374-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.
2019-10-18 15:15:01,374-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl; it will take a while to cache the data 
2019-10-18 15:18:37,794-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl
tcmalloc: large alloc 1387642880 bytes == 0x19e71e000 @  0x7f00502942a4 0x5a1987 0x4bd187 0x4bf64a 0x4bec3e 0x4c0335 0x4bf091 0x4bf8b6 0x4bec3e 0x4c0335 0x4bf091 0x4bf56a 0x4c0335 0x4bf091 0x4bee6c 0x4bf4c6 0x527d26 0x42e6c9 0x4f86ba 0x4f98c7 0x4f6128 0x4f426e 0x5a1481 0x512a60 0x53ee21 0x57ec0c 0x4f88ba 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d
tcmalloc: large alloc 2082512896 bytes == 0x7eff43df6000 @  0x7f00502942a4 0x5a1987 0x4bd187 0x4bf64a 0x4bec3e 0x4c0335 0x4bf091 0x4bf8b6 0x4bec3e 0x4c0335 0x4bf091 0x4bf56a 0x4c0335 0x4bf091 0x4bedf4 0x4bf4c6 0x527d26 0x42e6c9 0x4f86ba 0x4f98c7 0x4f6128 0x4f426e 0x5a1481 0x512a60 0x53ee21 0x57ec0c 0x4f88ba 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d
tcmalloc: large alloc 3124248576 bytes == 0x7efe89a72000 @  0x7f00502942a4 0x5a1987 0x4bd187 0x4bf64a 0x4bec3e 0x4c0335 0x4bf091 0x4bf8b6 0x4bec3e 0x4c0335 0x4bf091 0x4bf56a 0x4c0335 0x4bf091 0x4bedf4 0x4bf4c6 0x527d26 0x42e6c9 0x4f86ba 0x4f98c7 0x4f6128 0x4f426e 0x5a1481 0x512a60 0x53ee21 0x57ec0c 0x4f88ba 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d
2019-10-18 15:18:58,298-{youtubeVOS.py:125}-INFO-load lmdb 236.95
2019-10-18 15:18:58,849-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 0.55; cliplen 3| annotation clip 26490(skip 0)| videos 3000
2019-10-18 15:18:58,903-{youtubeVOS.py:265}-INFO-load keys 0.05
2019-10-18 15:18:58,903-{train.py:104}-INFO-INPUT shape: 255 448
2019-10-18 15:18:58,910-{youtubeVOS.py:84}-INFO-[dataset] phase read trainval; len of db seq 200
2019-10-18 15:18:58,911-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.
2019-10-18 15:18:58,911-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl; it will take a while to cache the data 
2019-10-18 15:19:14,751-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl
2019-10-18 15:19:15,439-{youtubeVOS.py:125}-INFO-load lmdb 16.53
2019-10-18 15:19:15,460-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 0.02; cliplen 3| annotation clip 800(skip 0)| videos 200
2019-10-18 15:19:15,464-{youtubeVOS.py:265}-INFO-load keys 0.00
2019-10-18 15:19:15,464-{train.py:104}-INFO-INPUT shape: 255 448
2019-10-18 15:19:15,464-{train.py:227}-INFO-dataloader 254.149
2019-10-18 15:19:15,464-{train.py:233}-INFO-==========> start sample_inference_mask
2019-10-18 15:19:15,465-{train.py:248}-INFO-epoch 0 - trainval; 
2019-10-18 15:19:15,465-{train.py:250}-INFO--- loss weight loss_weight_match: 1.0 loss_weight_iouraw 18.0; 
Traceback (most recent call last):
  File "train.py", line 403, in <module>
    trainIters(args)
  File "train.py", line 254, in trainIters
    for batch_idx, (inputs,imgs_names,targets,seq_name,starting_frame) in enumerate(loaders[split]):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 346, in __next__
    data = self.dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/content/DMM_Net/dmm/dataloader/dataset.py", line 216, in __getitem__
    CHECKEQ(img.shape[-1], self.inputRes[1])
  File "/content/DMM_Net/dmm/utils/checker.py", line 27, in CHECKEQ
    assert(a == b), 'get {} {}'.format(a, b)
AssertionError: get 3 448

Maybe you can help me with this error, please!

from dmm_net.

ZENGXH avatar ZENGXH commented on June 15, 2024

Hi, I have update the scripts and remove the srun part. Please see if it works for you.
I am sorry that some of the value of the default arguments and default.yaml may not be compatible with the current code. I will find a time to fix them up. Let me know if the scripts do not work. :)

from dmm_net.

ZENGXH avatar ZENGXH commented on June 15, 2024

btw, you may also want to remove the cache file: data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl and data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl file before you run the scripts.

from dmm_net.

walterpcasas avatar walterpcasas commented on June 15, 2024

thanks! working better but have a problem:

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2019-10-19 05:38:45,489-{train.py:395}-INFO-to distributed; local_rank 2
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-19 05:38:45,586-{train.py:384}-INFO-[model_name] ytb_train_x101
2019-10-19 05:38:45,587-{train.py:385}-INFO-get number of gpu: 4
2019-10-19 05:38:45,587-{train.py:395}-INFO-to distributed; local_rank 0
2019-10-19 05:38:45,763-{train.py:395}-INFO-to distributed; local_rank 3
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-19 05:38:45,826-{train.py:395}-INFO-to distributed; local_rank 1
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-19 05:43:47,555-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-19 05:43:47,559-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
Traceback (most recent call last):
  File "train.py", line 403, in <module>
    trainIters(args)
  File "train.py", line 205, in trainIters
    enc_opt, dec_opt, trainer = build_model(args)
  File "train.py", line 133, in build_model
    encoder_dict, decoder_dict,_,_,_ = load_checkpoint('../../experiments/models/one-shot-model-youtubevos/')
  File "/content/DMM_Net/dmm/utils/utils.py", line 13, in load_checkpoint
    encoder_dict = torch.load(os.path.join(model_name,'encoder.pt'))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 419, in load
    f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '../../experiments/models/one-shot-model-youtubevos/encoder.pt'
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '-pred_offline_path_eval', 'experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth', '-pred_offline_path', './experiments/proposals/coco81/inference/youtubevos_train3k_meta/asdict_50/videos/', '-load_proposals_dataset', '1', '-load_proposals', '1', '-distributed', '1', '-save_every', '3000', '-train_split', 'train', '-eval_split', 'trainval', '-loss_weight_match=1', '-loss_weight_iouraw=1', '-finetune_after', '3', '-skip_empty_starting_frame', '1', '-random_select_frames', '1', '-model_name', 'ytb_train_x101', '-train_h', '255', '-train_w', '448', '-num_workers', '4', '-lr', '0.0001', '-lr_cnn', '0.00001', '-config_train', 'dmm/configs/train.yaml', '-batch_size=4', '-length_clip=3', '-max_epoch=2', '--resize', '-base_model', 'resnet101', '-models_root=experiments/models/', '-max_eval_iter=800', '--augment', '-ngpus', '4']' returned non-zero exit status 1

Do you have any idea please?

from dmm_net.

ZENGXH avatar ZENGXH commented on June 15, 2024

I forget to put the pretrained model download link. It is fixed in my last commit:

wget https://imatge.upc.edu/web/sites/default/files/projects/segmentation/public_html/rvos-pretrained-models/one-shot-model-youtubevos.zip
See if it works for you.

from dmm_net.

walterpcasas avatar walterpcasas commented on June 15, 2024

thanks, I had already put what was missing and fixed the route, locally, now the whole model runs but then it stays on hold for a long time, I already left about 3 hours, is this normal?

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2019-10-31 19:41:53,158-{train.py:395}-INFO-to distributed; local_rank 3
2019-10-31 19:41:53,158-{train.py:384}-INFO-[model_name] ytb_train_x101
2019-10-31 19:41:53,158-{train.py:385}-INFO-get number of gpu: 4
2019-10-31 19:41:53,159-{train.py:395}-INFO-to distributed; local_rank 0
2019-10-31 19:41:53,162-{train.py:395}-INFO-to distributed; local_rank 2
2019-10-31 19:41:53,164-{train.py:395}-INFO-to distributed; local_rank 1
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 396, in <module>
    torch.cuda.set_device(local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 300, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
2019-10-31 19:46:55,205-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-31 19:46:55,212-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-31 19:47:00,949-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml
2019-10-31 19:47:00,953-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5
2019-10-31 19:47:01,006-{train.py:152}-INFO-{'sort_max_num': 50, 'matching_score_thre': 0.0, 'score_weight': 0.3, 'relax': 1, 'relax_max_iter': 10, 'relax_proj_iter': 5, 'relax_topk': 0, 'relax_learning_rate': 0.1, 'matching': {'match_max_score': 1, 'algo': 'relax', 'cost': 'cosine'}, 'encoder': {'nms_thresh': 0.4}}
2019-10-31 19:47:01,132-{trainer.py:63}-INFO-load from json_data; num vid 3000
2019-10-31 19:47:01,133-{train.py:154}-INFO-init model 6.824
2019-10-31 19:47:01,135-{train.py:161}-INFO-optimizer 0.002
2019-10-31 19:47:01,135-{train.py:163}-INFO-[enc_opt] len: 2; len for each param group: [48, 314]
2019-10-31 19:47:01,135-{train.py:165}-INFO-[dec_opt] len: 1; len for each param group: [10]
2019-10-31 19:47:01,135-{train.py:169}-INFO-init DistributedDataParallel rank 0

from dmm_net.

ZENGXH avatar ZENGXH commented on June 15, 2024

It shouldn’t take that long. Could you try setting the training arguments distributed as 0 and NGPUS as 1?
Besides, could you also try the commands mentioned here NVIDIA/vid2vid#17 (comment) to help us better localize the problem?

from dmm_net.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.