yoxu515 / aot-benchmark Goto Github PK

View Code? Open in Web Editor NEW

590.0 590.0 106.0 78.44 MB

An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch

License: BSD 3-Clause "New" or "Revised" License

Python 99.66% Shell 0.34%

aot-benchmark's People

Contributors

Stargazers

Watchers

Forkers

1359347500cwc kelvintao muvguan siyisan mayy1994 mkg1204 lingyunwu14 congjianting cloveryww cv-ip developer-isaac-xu jerryx1110 hy0523 roman-riazantsev wwn1233 wuyefeilin zhuofalin haochenheheda kristianmk zishanqin ljlldq shuowang-ai dq-soulie ggaoxiang wangzhesun htqin howellfra wangbo-zhao shiyoung77 xjtulyc qianduoduolr davidsvaughn xupercoin cerviny e-kiss-me awekling monsterdove yue2wang skupan shiok23 nicolesherwood vapeaholix farmingtong mistyr0se windb3ll closegoingaway n0wwa hs991023 tutuna luluchou staccats jbluv d3p10y ntt720 hay-man tufo830 molierflower xibinar fskeo 0x8235 iam20cm moguijoe coder-drinker twacoco gokuatsu paramedick zaku-zaku wensiyuansix innaturn minisoco billionerd s8xy csldali morbi25 xqdsj qianqian121 atilaxavier fiskrt chrislai502 martinarroyo chunnienc haohao11 paperwave innovationmatrix syo093c alexankharin ishrat-tl amshaker phasuwut rihns38 cgllalala burachonak bitsun aymaneleya tuko dewiiswaratika wufengchina zhanghongyong123456 bhack salvaba94

aot-benchmark's Issues

How to train with 2 GPUs?

Thanks for your excellent work!

I want to re-train AOT, however, I only have two 3090 that cannot support the large batch size in the paper. In that case, I need to half the batch size or even reduce it to 6 since the memory of 3090 is not as large as V100. May I have some suggestions on the hyperparameters if I reduce the batch size, e.g., should I also half the training lr and double the iterations as well?

BTW, do you think reducing batch size will reduce the performance of AOT?

The size of the image input to Swin Transformer

I notice that Swin Transformer uses pre-training parameters, and I would like to ask you whether the image size input into Swin Transformer is also 224*224?

Train speed very slow

I use the original code in Github and train on four A100 GPUS with the Stage = 'ytb', but it seems that it cost too much time to train this network(For example, 200steps with one hours). If there are total 100000 steps, the training can not just use 0.6 days as you say in the 'Reading Me'. Will this be caused by using the original code directly? Should I change some settings in this code?

Cannot train model youtubeVOS. Bug in attention.py?

Hi, I'm trying to train the model (I have not done any pre-training) on only youtubeVOS with
python tools/train.py --amp --exp_name default --stage ytb --gpu_num 2 --model aott.

It loads the dataset

and it loads the pretrained backbone model from ./pretrain_models/mobilenet_v2-b0353104.pth succesfully.
But then when it starts the training it crashes with the following output:

请问如何在不训练阶段1的情况下，直接训练阶段二，是否可以用你们提供的静态模型（AOTT_PRE.pth）作为预训练模型？

请问能不训练第一阶段，使用项目提供的模型做预训练模型。直接训练第二阶段

How to increase the efficiency of propagation modules?

Hey @z-x-yang,

awesome repository, thank you! :)

The DeAOTT model is performing really well, so I was wondering if we could further trim down the propagation/LSTT modules as they take most of the runtime and memory?

Do you think the network would require retraining, or could we just prune?

It seems the -T models reduce to

aot-benchmark/configs/models/default_deaot.py

Lines 14 to 15 in f9a62b6

    
           self.MODEL_SELF_HEADS = 1 
        
           self.MODEL_ATT_HEADS = 1

aot-benchmark/networks/layers/transformer.py

Lines 138 to 154 in f9a62b6

    
           class DualBranchGPM(nn.Module): 
        
               def __init__(self, 
        
                            num_layers=2, 
        
                            d_model=256, 
        
                            self_nhead=8, 
        
                            att_nhead=8, 
        
                            dim_feedforward=1024, 
        
                            emb_dropout=0., 
        
                            droppath=0.1, 
        
                            lt_dropout=0., 
        
                            st_dropout=0., 
        
                            droppath_lst=False, 
        
                            droppath_scaling=False, 
        
                            activation="gelu", 
        
                            return_intermediate=False, 
        
                            intermediate_norm=True, 
        
                            final_norm=True):

Which parameters would you tweak to make the modules even more efficient without losing too much performance?

Thanks a lot!

About Deployment

I'm very sorry to bother you. Do you have any plans to deploy this model, or do you have any suggestions for deployment?

Mask size compared to image size affects tracking

Hi,

I have images where the masks are quite small and "round-ish" compared to the total size of the image itself (approx 400 small masks per image and still a decent amount of background).
When I try to use the demo.py on the original image and masks, the tracking doesn't work to well.

However, when I crop the original image so that the size of the masks is much larger compared to the total size of the image (approx 10 large masks and some background), the tracking works much better.

Is there a way to change that "mask scaling factor" so that it works on my original images?

Thanks,
Renaud

Demo failed

Hello , thank you for making this repo public. I am trying to run the demo, which raises the following error:

Object number: 44. Inference size: 577x1041. Output size: 1080x1920. /opt/miniconda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] ./networks/layers/position.py:63: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). dim_t = self.temperature**(2 * (dim_t // 2) / self.num_pos_feats) Traceback (most recent call last): File "tools/demo.py", line 289, in <module> main() File "tools/demo.py", line 285, in main demo(cfg) File "tools/demo.py", line 204, in demo obj_nums=obj_nums) File "./networks/engines/aot_engine.py", line 586, in add_reference_frame img_embs=img_embs) File "./networks/engines/aot_engine.py", line 239, in add_reference_frame size_2d=self.enc_size_2d) File "./networks/models/aot.py", line 105, in LSTT_forward pos_emb, size_2d) File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "./networks/layers/transformer.py", line 100, in forward size_2d=size_2d) File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "./networks/layers/transformer.py", line 224, in forward tgt3 = self.short_term_attn(local_Q, local_K, local_V)[0] File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "./networks/layers/attention.py", line 471, in forward output = agg_value + agg_bias RuntimeError: The size of tensor a (32) must match the size of tensor b (256) at non-singleton dimension 3.
Can you help fix this error?
Thanks!

No checkpoint in ./results/result/default_AOTT/PRE/ckpt

Hello, thanks a lot for your work!
I'm trying to evaluate DAVIS17, after I use comand python eval.py --exp_name default --stage pre --model aott --dataset davis2017 --split val --gpu_num 1, it returns No checkpoint in ./results/result/default_AOTT/PRE/ckpt.. I've prepared as README said, could you please tell me the reason? Thank you very much!

AOT/DeAOT perfs on MOSE dataset

Just in the case you are interested:
https://henghuiding.github.io/MOSE/

--ema in model_zoo eval

Is --ema required on evaluation of model zoo checkpoints?

Can I pass in more annotated masks for model to use?

Thanks for this wonderful work!
I put some well-annotated masks of some key frames of the video in masks directory, but I found that the model is still only using the first frame's mask to propogate.
Can the model make use of more masks (like sparse annotated masks for key frames) to propogate the entire video?
Thanks a lot~

Tensorboard log on VAL

It would be nice to have a TB and log also on Davis and YouTube validation set.

Evaluate with more sparse supervision

I was trying to eval some of these models with more sparse supervision e.g. 1 every N frames.

So I've simply tried to add more reference frames at:

https://github.com/yoxu515/aot-benchmark/blob/1c3a5ec51d81f3e17ff9092aa1e830206d766132/networks/managers/evaluator.py#L316:L324

it seems to work but it is has an impact on the memory with earlier OOM events.

Have you tried to experiment this setup? Do I need to manipulate something else other then adding reference frames?
What it could be the relationship with the long term memory gap setting?

Shape mismatch when training

Steps taken:

Cloned main branch today.
Downloaded ECSSD and MSRA10K.
Set up repository with PyTorch 1.8.
Launched bash train_eval.sh.

Right on the first training step, the following error is raised:

 File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/aot-benchmark/tools/train.py", line 18, in main_worker
    trainer.sequential_training()
  File "./networks/managers/trainer.py", line 473, in sequential_training
    use_prev_prob=use_prev_prob)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./networks/engines/aot_engine.py", line 52, in forward
    self.add_reference_frame(frame_step=0, obj_nums=obj_nums)
  File "./networks/engines/aot_engine.py", line 239, in add_reference_frame
    size_2d=self.enc_size_2d)
  File "./networks/models/aot.py", line 105, in LSTT_forward
    pos_emb, size_2d)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./networks/layers/transformer.py", line 113, in forward
    size_2d=size_2d)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./networks/layers/transformer.py", line 352, in forward
    tgt3 = self.short_term_attn(local_Q, local_K, local_V)[0]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./networks/layers/attention.py", line 525, in forward
    output = agg_value + agg_bias
RuntimeError: The size of tensor a (32) must match the size of tensor b (256) at non-singleton dimension 3

I did not alter any of the configuration values. I noticed that #2 had the same issue, but no solution was provided there. Any help would be much appreciated!

Thanks in advance

cuda memory issue for multi-scale inference

Hi, thanks for your excellent work.
I tried to run multi-scale ([1.0, 1.3]) inference in tools/eval.py on a single A100 (40G). Then I faced the out of memory problem.
So I want to ask about the scales you use in the multi-scale inference. Or is there any way to decrease the memory during inference?
Thanks!

How to run only main training without pre-training

I would like to train a model purely on ytVOS and DAVIS. I.e I would like a stage ytb_dav.
Is this possible?

Could you please release the Swin-AOT checkpoint that only trained on DAVIS&YTB or YTB?

Hi,

I want to use the Swin-AOT model that is only trained on DAVIS&YTB or YTB, i.e. w/o pre-training on static image datasets. However, I found that one 3090 only supports less than 4 samples in a batch, which affects the convergence of the model, while I only have one GPU right now. May I know is it convenient for you to release this checkpoint (Swin-AOT that is only trained on DAVIS&YTB or YTB, i.e. w/o pre-training on static image datasets) for reference of the later work?

Thanks.

What's the difference between the result of evaluation on Davis17 in the paper and ModelZoo?

Hello! I find the J&F mean of AOTT(Y) on Davis17 validation set in Table1(b) in the paper is 78.2, while the J&F mean of AOTT in ModelZoo is 79.2, could you please tell me the difference?

The modules that work on tiny objects

Hello !

Thanks for sharing your great work!

In section 6.1, the paper mentioned the R50-DeAOT-L performs better than R50-AOT-L on tiny or scale-changing objects. I would like to know which module is beneficial to tiny or scale-changing objects.

Looking forward to your kind reply !

cfg.DIST_ENABLE false fail

With cfg.DIST_ENABLE false all the distributed specific parts are not wrapped by a condition to check if distributed was enabled or not.

Demo is not available

How to use demo.py in this version?

About the training time.

Thanks for your great job!
As said in the main page, the training process of the two stages cost about 0.6 days each. However, I train the AOTT model on DAVIS only with the default config, and it almost cost 1.5days for DAVIS. I wanna if I make some mistake in the training process.
My environment:
pytorch: 1.7.1
cuda: 10.2
GPU: 4 Tesla V100

The config:
import os
import importlib

class DefaultEngineConfig():
def init(self, exp_name='default', model='AOTT'):
model_cfg = importlib.import_module('configs.models.' +
model).ModelConfig()
self.dict.update(model_cfg.dict) # add model config

    self.EXP_NAME = exp_name + '_' + self.MODEL_NAME

    self.STAGE_NAME = 'default'

    self.DATASETS = ['youtubevos']
    self.DATA_WORKERS = 12
    self.DATA_RANDOMCROP = (465,
                            465) if self.MODEL_ALIGN_CORNERS else (464,
                                                                   464)
    self.DATA_RANDOMFLIP = 0.5
    self.DATA_MAX_CROP_STEPS = 10
    self.DATA_SHORT_EDGE_LEN = 480
    self.DATA_MIN_SCALE_FACTOR = 0.7
    self.DATA_MAX_SCALE_FACTOR = 1.3
    self.DATA_RANDOM_REVERSE_SEQ = True
    self.DATA_SEQ_LEN = 5
    self.DATA_DAVIS_REPEAT = 5
    self.DATA_RANDOM_GAP_DAVIS = 12  # max frame interval between two sampled frames for DAVIS (24fps)
    self.DATA_RANDOM_GAP_YTB = 3  # max frame interval between two sampled frames for YouTube-VOS (6fps)
    self.DATA_DYNAMIC_MERGE_PROB = 0.3

    self.PRETRAIN = True
    self.PRETRAIN_FULL = False  # if False, load encoder only
    self.PRETRAIN_MODEL = ''

    self.TRAIN_TOTAL_STEPS = 100000
    self.TRAIN_START_STEP = 0
    self.TRAIN_WEIGHT_DECAY = 0.07
    self.TRAIN_WEIGHT_DECAY_EXCLUSIVE = {
        # 'encoder.': 0.01
    }
    self.TRAIN_WEIGHT_DECAY_EXEMPTION = [
        'absolute_pos_embed', 'relative_position_bias_table',
        'relative_emb_v', 'conv_out'
    ]
    self.TRAIN_LR = 2e-4
    self.TRAIN_LR_MIN = 2e-5 if 'mobilenetv2' in self.MODEL_ENCODER else 1e-5
    self.TRAIN_LR_POWER = 0.9
    self.TRAIN_LR_ENCODER_RATIO = 0.1
    self.TRAIN_LR_WARM_UP_RATIO = 0.05
    self.TRAIN_LR_COSINE_DECAY = False
    self.TRAIN_LR_RESTART = 1
    self.TRAIN_LR_UPDATE_STEP = 1
    self.TRAIN_AUX_LOSS_WEIGHT = 1.0
    self.TRAIN_AUX_LOSS_RATIO = 1.0
    self.TRAIN_OPT = 'adamw'
    self.TRAIN_SGD_MOMENTUM = 0.9
    self.TRAIN_GPUS = 4
    self.TRAIN_BATCH_SIZE = 16
    self.TRAIN_TBLOG = False
    self.TRAIN_TBLOG_STEP = 50
    self.TRAIN_LOG_STEP = 20
    self.TRAIN_IMG_LOG = True
    self.TRAIN_TOP_K_PERCENT_PIXELS = 0.15
    self.TRAIN_SEQ_TRAINING_FREEZE_PARAMS = ['patch_wise_id_bank']
    self.TRAIN_SEQ_TRAINING_START_RATIO = 0.5
    self.TRAIN_HARD_MINING_RATIO = 0.5
    self.TRAIN_EMA_RATIO = 0.1
    self.TRAIN_CLIP_GRAD_NORM = 5.
    self.TRAIN_SAVE_STEP = 1000
    self.TRAIN_MAX_KEEP_CKPT = 8
    self.TRAIN_RESUME = False
    self.TRAIN_RESUME_CKPT = None
    self.TRAIN_RESUME_STEP = 0
    self.TRAIN_AUTO_RESUME = True
    self.TRAIN_DATASET_FULL_RESOLUTION = False
    self.TRAIN_ENABLE_PREV_FRAME = False
    self.TRAIN_ENCODER_FREEZE_AT = 2
    self.TRAIN_LSTT_EMB_DROPOUT = 0.
    self.TRAIN_LSTT_ID_DROPOUT = 0.
    self.TRAIN_LSTT_DROPPATH = 0.1
    self.TRAIN_LSTT_DROPPATH_SCALING = False
    self.TRAIN_LSTT_DROPPATH_LST = False
    self.TRAIN_LSTT_LT_DROPOUT = 0.
    self.TRAIN_LSTT_ST_DROPOUT = 0.

    self.TEST_GPU_ID = 0
    self.TEST_GPU_NUM = 1
    self.TEST_FRAME_LOG = False
    self.TEST_DATASET = 'youtubevos'
    self.TEST_DATASET_FULL_RESOLUTION = False
    self.TEST_DATASET_SPLIT = 'val'
    self.TEST_CKPT_PATH = None
    # if "None", evaluate the latest checkpoint.
    self.TEST_CKPT_STEP = None
    self.TEST_FLIP = False
    self.TEST_MULTISCALE = [1]
    self.TEST_MIN_SIZE = None
    self.TEST_MAX_SIZE = 800 * 1.3
    self.TEST_WORKERS = 4

    # GPU distribution
    self.DIST_ENABLE = True
    self.DIST_BACKEND = "nccl"  # "gloo"
    self.DIST_URL = "tcp://127.0.0.1:13241"
    self.DIST_START_GPU = 0

def init_dir(self):
    self.DIR_DATA = './datasets'
    self.DIR_DAVIS = os.path.join(self.DIR_DATA, 'DAVIS')
    self.DIR_YTB = os.path.join(self.DIR_DATA, 'YTB')
    self.DIR_STATIC = os.path.join(self.DIR_DATA, 'Static')

    self.DIR_ROOT = './results'

    self.DIR_RESULT = os.path.join(self.DIR_ROOT, 'result', self.EXP_NAME,
                                   self.STAGE_NAME)
    self.DIR_CKPT = os.path.join(self.DIR_RESULT, 'ckpt')
    self.DIR_EMA_CKPT = os.path.join(self.DIR_RESULT, 'ema_ckpt')
    self.DIR_LOG = os.path.join(self.DIR_RESULT, 'log')
    self.DIR_TB_LOG = os.path.join(self.DIR_RESULT, 'log', 'tensorboard')
    self.DIR_IMG_LOG = os.path.join(self.DIR_RESULT, 'log', 'img')
    self.DIR_EVALUATION = os.path.join(self.DIR_RESULT, 'eval')

    for path in [
            self.DIR_RESULT, self.DIR_CKPT, self.DIR_EMA_CKPT,
            self.DIR_LOG, self.DIR_EVALUATION, self.DIR_IMG_LOG,
            self.DIR_TB_LOG
    ]:
        if not os.path.isdir(path):
            try:
                os.makedirs(path)
            except Exception as inst:
                print(inst)
                print('Failed to make dir: {}.'.format(path))

Request of pre-computed results

Great work and great code! Could you provide some pre-computed results on YouTube and DAVIS? Thanks!

About the LSTT module

When I read the source code of networks/engines/aot_engine.py, I found that only the value was updated when the memory was updated in the update_short_term_memory method. Is there any consideration for not updating the key here?

Does AOTL use multiple frames in long term memory during the training phase?

Hello! Does AOTL use multiple frames in long term memory during the training phase? Or AOTL just uses the first frame in long term memory to train and uses multiple frames for evaluation?

Problems about main-train ytb

Pytorch 1.8
torchversion 0.9.0
CUDA 10.1

when train for ytb it report this error

here is the full track

(torch18) cwc@imc-Z9PE-D8-WS:~/aot-benchmark-main/tools$ python train.py
Exp _AOTT:
{
"DATASETS": [
"youtubevos"
],
"DATA_DAVIS_REPEAT": 5,
"DATA_DYNAMIC_MERGE_PROB": 0.3,
"DATA_MAX_CROP_STEPS": 10,
"DATA_MAX_SCALE_FACTOR": 1.3,
"DATA_MIN_SCALE_FACTOR": 0.7,
"DATA_RANDOMCROP": [
465,
465
],
"DATA_RANDOMFLIP": 0.5,
"DATA_RANDOM_GAP_DAVIS": 12,
"DATA_RANDOM_GAP_YTB": 3,
"DATA_RANDOM_REVERSE_SEQ": true,
"DATA_SEQ_LEN": 5,
"DATA_SHORT_EDGE_LEN": 480,
"DATA_WORKERS": 8,
"DIR_CKPT": "./results/result/_AOTT/YTB/ckpt",
"DIR_DAVIS": "/DATACENTER/1/ysl/Datasets/DAVIS/2017",
"DIR_EMA_CKPT": "./results/result/_AOTT/YTB/ema_ckpt",
"DIR_EVALUATION": "./results/result/_AOTT/YTB/eval",
"DIR_IMG_LOG": "./results/result/_AOTT/YTB/log/img",
"DIR_LOG": "./results/result/_AOTT/YTB/log",
"DIR_RESULT": "./results/result/_AOTT/YTB",
"DIR_ROOT": "./results",
"DIR_STATIC": "/DATACENTER/1/Datasets/static",
"DIR_TB_LOG": "./results/result/_AOTT/YTB/log/tensorboard",
"DIR_YTB": "/DATACENTER/1/ysl/Datasets/YoutubeVOS",
"DIST_BACKEND": "nccl",
"DIST_ENABLE": true,
"DIST_START_GPU": 1,
"DIST_URL": "tcp://127.0.0.1:12311",
"EXP_NAME": "_AOTT",
"MODEL_ALIGN_CORNERS": true,
"MODEL_ATT_HEADS": 8,
"MODEL_DECODER_INTERMEDIATE_LSTT": true,
"MODEL_ENCODER": "mobilenetv2",
"MODEL_ENCODER_DIM": [
24,
32,
96,
1280
],
"MODEL_ENCODER_EMBEDDING_DIM": 256,
"MODEL_ENCODER_PRETRAIN": "/home/cwc/aot-benchmark-main/pretrain_models/mobilenet_v2-b0353104.pth",
"MODEL_ENGINE": "aotengine",
"MODEL_EPSILON": 1e-05,
"MODEL_FREEZE_BACKBONE": false,
"MODEL_FREEZE_BN": true,
"MODEL_LSTT_NUM": 1,
"MODEL_MAX_OBJ_NUM": 10,
"MODEL_NAME": "AOTT",
"MODEL_SELF_HEADS": 8,
"MODEL_USE_PREV_PROB": false,
"MODEL_VOS": "aot",
"PRETRAIN": true,
"PRETRAIN_FULL": false,
"PRETRAIN_MODEL": "",
"STAGE_NAME": "YTB",
"TEST_CKPT_PATH": null,
"TEST_CKPT_STEP": null,
"TEST_DATASET": "youtubevos",
"TEST_DATASET_FULL_RESOLUTION": false,
"TEST_DATASET_SPLIT": "val",
"TEST_FLIP": false,
"TEST_FRAME_LOG": false,
"TEST_GPU_ID": 1,
"TEST_GPU_NUM": 1,
"TEST_LONG_TERM_MEM_GAP": 9999,
"TEST_MAX_SIZE": 1040.0,
"TEST_MIN_SIZE": null,
"TEST_MULTISCALE": [
1
],
"TEST_WORKERS": 4,
"TRAIN_AUTO_RESUME": true,
"TRAIN_AUX_LOSS_RATIO": 1.0,
"TRAIN_AUX_LOSS_WEIGHT": 1.0,
"TRAIN_BATCH_SIZE": 4,
"TRAIN_CLIP_GRAD_NORM": 5.0,
"TRAIN_DATASET_FULL_RESOLUTION": false,
"TRAIN_EMA_RATIO": 0.1,
"TRAIN_ENABLE_PREV_FRAME": false,
"TRAIN_ENCODER_FREEZE_AT": 2,
"TRAIN_GPUS": 2,
"TRAIN_HARD_MINING_RATIO": 0.5,
"TRAIN_IMG_LOG": true,
"TRAIN_LOG_STEP": 20,
"TRAIN_LONG_TERM_MEM_GAP": 9999,
"TRAIN_LR": 0.0002,
"TRAIN_LR_COSINE_DECAY": false,
"TRAIN_LR_ENCODER_RATIO": 0.1,
"TRAIN_LR_MIN": 2e-05,
"TRAIN_LR_POWER": 0.9,
"TRAIN_LR_RESTART": 1,
"TRAIN_LR_UPDATE_STEP": 1,
"TRAIN_LR_WARM_UP_RATIO": 0.05,
"TRAIN_LSTT_DROPPATH": 0.1,
"TRAIN_LSTT_DROPPATH_LST": false,
"TRAIN_LSTT_DROPPATH_SCALING": false,
"TRAIN_LSTT_EMB_DROPOUT": 0.0,
"TRAIN_LSTT_ID_DROPOUT": 0.0,
"TRAIN_LSTT_LT_DROPOUT": 0.0,
"TRAIN_LSTT_ST_DROPOUT": 0.0,
"TRAIN_MAX_KEEP_CKPT": 8,
"TRAIN_OPT": "adamw",
"TRAIN_RESUME": false,
"TRAIN_RESUME_CKPT": null,
"TRAIN_RESUME_STEP": 0,
"TRAIN_SAVE_STEP": 1000,
"TRAIN_SEQ_TRAINING_FREEZE_PARAMS": [
"patch_wise_id_bank"
],
"TRAIN_SEQ_TRAINING_START_RATIO": 0.5,
"TRAIN_SGD_MOMENTUM": 0.9,
"TRAIN_START_STEP": 0,
"TRAIN_TBLOG": false,
"TRAIN_TBLOG_STEP": 50,
"TRAIN_TOP_K_PERCENT_PIXELS": 0.15,
"TRAIN_TOTAL_STEPS": 100000,
"TRAIN_WEIGHT_DECAY": 0.07,
"TRAIN_WEIGHT_DECAY_EXCLUSIVE": {},
"TRAIN_WEIGHT_DECAY_EXEMPTION": [
"absolute_pos_embed",
"relative_position_bias_table",
"relative_emb_v",
"conv_out"
]
}
Use GPU 1 for training VOS.
Build VOS model.
Use GPU 2 for training VOS.
Use Frozen BN in Encoder!
Total Param: 5.73M
Build optimizer.
Total Param: 5.73M
Process dataset...
Short object: 721bb6f2cb-3
Short object: 721bb6f2cb-3
Short object: d177e9878a-2
Short object: d177e9878a-3
Short object: d177e9878a-2
Short object: d177e9878a-3
Short object: f36483c824-2
Short object: f9bd1fabf5-4
Short object: f36483c824-2
Video Num: 3471 X 1
Done!
Short object: f9bd1fabf5-4
Video Num: 3471 X 1
Remove ['features.0.1.num_batches_tracked', 'features.1.conv.0.1.num_batches_tracked', 'features.1.conv.2.num_batches_tracked', 'features.2.conv.0.1.num_batches_tracked', 'features.2.conv.1.1.num_batches_tracked', 'features.2.conv.3.num_batches_tracked', 'features.3.conv.0.1.num_batches_tracked', 'features.3.conv.1.1.num_batches_tracked', 'features.3.conv.3.num_batches_tracked', 'features.4.conv.0.1.num_batches_tracked', 'features.4.conv.1.1.num_batches_tracked', 'features.4.conv.3.num_batches_tracked', 'features.5.conv.0.1.num_batches_tracked', 'features.5.conv.1.1.num_batches_tracked', 'features.5.conv.3.num_batches_tracked', 'features.6.conv.0.1.num_batches_tracked', 'features.6.conv.1.1.num_batches_tracked', 'features.6.conv.3.num_batches_tracked', 'features.7.conv.0.1.num_batches_tracked', 'features.7.conv.1.1.num_batches_tracked', 'features.7.conv.3.num_batches_tracked', 'features.8.conv.0.1.num_batches_tracked', 'features.8.conv.1.1.num_batches_tracked', 'features.8.conv.3.num_batches_tracked', 'features.9.conv.0.1.num_batches_tracked', 'features.9.conv.1.1.num_batches_tracked', 'features.9.conv.3.num_batches_tracked', 'features.10.conv.0.1.num_batches_tracked', 'features.10.conv.1.1.num_batches_tracked', 'features.10.conv.3.num_batches_tracked', 'features.11.conv.0.1.num_batches_tracked', 'features.11.conv.1.1.num_batches_tracked', 'features.11.conv.3.num_batches_tracked', 'features.12.conv.0.1.num_batches_tracked', 'features.12.conv.1.1.num_batches_tracked', 'features.12.conv.3.num_batches_tracked', 'features.13.conv.0.1.num_batches_tracked', 'features.13.conv.1.1.num_batches_tracked', 'features.13.conv.3.num_batches_tracked', 'features.14.conv.0.1.num_batches_tracked', 'features.14.conv.1.1.num_batches_tracked', 'features.14.conv.3.num_batches_tracked', 'features.15.conv.0.1.num_batches_tracked', 'features.15.conv.1.1.num_batches_tracked', 'features.15.conv.3.num_batches_tracked', 'features.16.conv.0.1.num_batches_tracked', 'features.16.conv.1.1.num_batches_tracked', 'features.16.conv.3.num_batches_tracked', 'features.17.conv.0.1.num_batches_tracked', 'features.17.conv.1.1.num_batches_tracked', 'features.17.conv.3.num_batches_tracked', 'features.18.1.num_batches_tracked', 'classifier.1.weight', 'classifier.1.bias'] from pretrained model.
Load pretrained backbone model from .
Start training:
step------------------------------ 0
step------------------------------ 0
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "train.py", line 80, in
main()
File "train.py", line 76, in main
mp.spawn(main_worker, nprocs=cfg.TRAIN_GPUS, args=(cfg, args.amp))
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/cwc/aot-benchmark-main/tools/train.py", line 18, in main_worker
trainer.sequential_training()
File "/home/cwc/aot-benchmark-main/tools/../networks/managers/trainer.py", line 456, in sequential_training
loss.backward()
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [900, 2, 256]], which is output 0 of AddBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

No module named spatial_correlation_Sampler

Hi @z-x-yang ,
It was mentioned in the readme that demo script will run even without spatial correlation sampler.But, in the attention.py ,it was being imported and the error is as follows:

Build AOT model.
Traceback (most recent call last):
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/tools/demo.py", line 286, in
main()
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/tools/demo.py", line 282, in main
demo(cfg)
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/tools/demo.py", line 107, in demo
model = build_vos_model(cfg.MODEL_VOS, cfg).cuda(gpu_id)
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/models/init.py", line 7, in build_vos_model
return AOT(cfg, encoder=cfg.MODEL_ENCODER, **kwargs)
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/models/aot.py", line 23, in init
self.LSTT = LongShortTermTransformer(
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/layers/transformer.py", line 75, in init
block(d_model, self_nhead, att_nhead, dim_feedforward,
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/layers/transformer.py", line 279, in init
self.short_term_attn = MultiheadLocalAttention(d_model,
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/layers/attention.py", line 282, in init
from spatial_correlation_sampler import SpatialCorrelationSampler
ModuleNotFoundError: No module named 'spatial_correlation_sampler

Disantangle long term memory update and reference frame

Can we disentangle long term memory update and reference frame?

aot-benchmark/networks/engines/aot_engine.py

Lines 243 to 246 in f9a62b6

    
           if self.long_term_memories is None: 
        
               self.long_term_memories = lstt_long_memories 
        
           else: 
        
               self.update_long_term_memory(lstt_long_memories)

What I think we could add an extra parameter to the function signature for updating or not the long term memory.

Cause currently it is hard to make multiple evaluation passes with different semi-supervision as the same function it is required also for preparing curr_enc_embs.

Environment file

Hi, @yoxu515,

Thank you so much for making this code repository publicly available. Would you mind also exporting the environment and sharing it in this repository?

CUDA out of memory

Hello!
I used the python eval.py --stage pre_dav --model r50_aotl --ckpt_path ./pretrain_models/R50_AOTL_PRE_YTB_DAV.pth to evaluate my own dataset (following DAVIS format 480p), and it encountered cuda out of memory issue.

Could you give me some suggestions for how to reduce the memory?

Thank you!

VOT2020 Performance

Hi authors, I notice that you have mentioned the results on VOT2020 benchmark. Is the model the same as the model in your paper and which model from the model zoo do you use to evaluate it?

how real_label can convert rgb mask image to one-channel id mask?

Hi, thanks for this wonderful work!
I would like to ask that, in demo.py,

    def read_label(self, label_name, squeeze_idx=None):
        label_path = os.path.join(self.label_root, self.seq_name, label_name)
        label = Image.open(label_path)
        label = np.array(label, dtype=np.uint8)
        if self.single_obj:
            label = (label > 0).astype(np.uint8)
        elif squeeze_idx is not None:
            squeezed_label = label * 0
            for idx in range(len(squeeze_idx)):
                obj_id = squeeze_idx[idx]
                if obj_id == 0:
                    continue
                mask = label == obj_id
                squeezed_label += (mask * idx).astype(np.uint8)
            label = squeezed_label
        return label

why the rgb mask goes through the codes below

label = Image.open(label_path)
label = np.array(label, dtype=np.uint8)

and can be converted to a one-channel id mask? How should I enocode a mask image so that I could achieve this effect? Thanks!

Question about Identity Assignment

Hi, Dr. Yang:
It seems very useful to use identity assignment.

Although I have read the paper and code, but I can't understand how it works yet.
it seems AOT only use a very simple convolution to produce a id_emb.

aot-benchmark/networks/models/aot.py

Line 76 in 5a7665f

def get_id_emb(self, x):

Then, add this id_emb in to

aot-benchmark/networks/layers/transformer.py

Line 216 in 5a7665f

global_V = self.linear_V(curr_V + curr_id_emb)

it will fix the relationship between channels and objects directly.

It is very interesting and could you please offer more details and proofs about it.

Thank you.

The mask of first frame

I would like to ask if the target to be split does not appear in the mask of the first frame, can it still be split smoothly in subsequent splits?

SAM encoder/decoder

As you are working with your new project it could be interesting to add SAM encoder/decoder here so it could be finetunable like the other encoders/decoders in the repo.

What's the meaning of "squeeze_idx" in eval_datasets.py?

Hello!
When I read the code in eval_datasets.py, I have a question about the function "read_label" of class VOSTest.
Could you please tell me the meaning of "squeeze_idx"? When I debug the code, I find that in "getitem", it have used "obj_idx" as "squeeze_idx" in "read_label", but it seems no difference between the input label and "squeezed_label", so what's the effect of "squeeze_idx"? And under what circumstances will it be used?
Thank you!

How to use the MS-AOT to test custom dataset

Hello !

Thanks for sharing your work ! How do use MS-AOT to evaluate custom tracking datasets and output the prediction mask ?
Could you give me some suggestions ?

Thank you !

F score over the boundaries

It could be nice to have the F score for the boundaries other the IOU in utils/metrics.

prev_labels of trainer.process_log() maybe wrong?

Hi, thanks for your excellent work.

But I confused about why the prev_labels of trainer.process_log() are from all_pred[-2]?
They should be GT?

aot-benchmark/networks/managers/trainer.py

Line 474 in 5a7665f

self.process_log(ref_imgs, all_f[-2], all_f[-1],

Thanks!

Local attention formula

I notice that you use a conv2d (applied on q) for the relative_emb_k. Should it be a fixed embedding as relative_emb_v?

questions about inference fps

Thanks for making code available!
I met some questions while testing the pretrained model! It can only get a speed of near 29FPS when testing the PRE_YTB_DAV
pretrained model of DAVIS2017, AOTS which should be 40FPS according to paper result. But the test J & F-mean is the same as the results posted in model_zoo which is 0.820575.

I did not modify the default test config of aots.py exclude dir such as dataset. Did I need to modify something in train_eval.sh?

My device:
2 x Tesla V100 SXM2 32GB Driver Version: 450.51.06 CUDA Version: 11.0
pytorch==1.7.0
torchvision==0.8.1
spatial-correlation-sampler == 0.3.0

Exp alldataset_AOTS:
{
    "DATASETS": [
        "youtubevos",
        "davis2017"
    ],
    "DATA_DAVIS_REPEAT": 5,
    "DATA_DYNAMIC_MERGE_PROB": 0.3,
    "DATA_MAX_CROP_STEPS": 10,
    "DATA_MAX_SCALE_FACTOR": 1.3,
    "DATA_MIN_SCALE_FACTOR": 0.7,
    "DATA_RANDOMCROP": [
        465,
        465
    ],
    "DATA_RANDOMFLIP": 0.5,
    "DATA_RANDOM_GAP_DAVIS": 12,
    "DATA_RANDOM_GAP_YTB": 3,
    "DATA_RANDOM_REVERSE_SEQ": true,
    "DATA_SEQ_LEN": 5,
    "DATA_SHORT_EDGE_LEN": 480,
    "DATA_WORKERS": 8,
    "DIR_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ckpt",
    "DIR_DATA": "./datasets",
    "DIR_DAVIS": "/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval",
    "DIR_EMA_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ema_ckpt",
    "DIR_EVALUATION": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/eval",
    "DIR_IMG_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/img",
    "DIR_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log",
    "DIR_RESULT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV",
    "DIR_ROOT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022",
    "DIR_STATIC": "/yexin/vos_related_source/vos_exper_dataset/unify_pretrain_dataset",
    "DIR_TB_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/tensorboard",
    "DIR_YTB": "/yexin/vos_related_source/vos_exper_dataset/dataset/Youtube",
    "DIST_BACKEND": "nccl",
    "DIST_ENABLE": true,
    "DIST_START_GPU": 0,
    "DIST_URL": "tcp://127.0.0.1:13241",
    "EXP_NAME": "alldataset_AOTS",
    "MODEL_ALIGN_CORNERS": true,
    "MODEL_ATT_HEADS": 8,
    "MODEL_DECODER_INTERMEDIATE_LSTT": true,
    "MODEL_ENCODER": "mobilenetv2",
    "MODEL_ENCODER_DIM": [
        24,
        32,
        96,
        1280
    ],
    "MODEL_ENCODER_EMBEDDING_DIM": 256,
    "MODEL_ENCODER_PRETRAIN": "./pretrain_models/mobilenet_v2-b0353104.pth",
    "MODEL_ENGINE": "aotengine",
    "MODEL_EPSILON": 1e-05,
    "MODEL_FREEZE_BACKBONE": false,
    "MODEL_FREEZE_BN": true,
    "MODEL_LSTT_NUM": 2,
    "MODEL_MAX_OBJ_NUM": 10,
    "MODEL_NAME": "AOTS",
    "MODEL_SELF_HEADS": 8,
    "MODEL_USE_PREV_PROB": false,
    "MODEL_VOS": "aot",
    "PRETRAIN": true,
    "PRETRAIN_FULL": true,
    "PRETRAIN_MODEL": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE/ema_ckpt/save_step_100000.pth",
    "STAGE_NAME": "PRE_YTB_DAV",
    "TEST_CKPT_PATH": "./AOTS_PRE_YTB_DAV.pth",
    "TEST_CKPT_STEP": null,
    "TEST_DATASET": "davis2017",
    "TEST_DATASET_FULL_RESOLUTION": false,
    "TEST_DATASET_SPLIT": "val",
    "TEST_EMA": true,
    "TEST_FLIP": false,
    "TEST_FRAME_LOG": false,
    "TEST_GPU_ID": 0,
    "TEST_GPU_NUM": 2,
    "TEST_LONG_TERM_MEM_GAP": 9999,
    "TEST_MAX_SIZE": 1040.0,
    "TEST_MIN_SIZE": null,
    "TEST_MULTISCALE": [
        1.0
    ],
    "TEST_WORKERS": 4,
    "TRAIN_AUTO_RESUME": true,
    "TRAIN_AUX_LOSS_RATIO": 1.0,
    "TRAIN_AUX_LOSS_WEIGHT": 1.0,
    "TRAIN_BATCH_SIZE": 16,
    "TRAIN_CLIP_GRAD_NORM": 5.0,
    "TRAIN_DATASET_FULL_RESOLUTION": false,
    "TRAIN_EMA_RATIO": 0.1,
    "TRAIN_ENABLE_PREV_FRAME": false,
    "TRAIN_ENCODER_FREEZE_AT": 2,
    "TRAIN_GPUS": 4,
    "TRAIN_HARD_MINING_RATIO": 0.5,
    "TRAIN_IMG_LOG": true,
    "TRAIN_LOG_STEP": 50,
    "TRAIN_LONG_TERM_MEM_GAP": 9999,
    "TRAIN_LR": 0.0002,
    "TRAIN_LR_COSINE_DECAY": false,
    "TRAIN_LR_ENCODER_RATIO": 0.1,
    "TRAIN_LR_MIN": 2e-05,
    "TRAIN_LR_POWER": 0.9,
    "TRAIN_LR_RESTART": 1,
    "TRAIN_LR_UPDATE_STEP": 1,
    "TRAIN_LR_WARM_UP_RATIO": 0.05,
    "TRAIN_LSTT_DROPPATH": 0.1,
    "TRAIN_LSTT_DROPPATH_LST": false,
    "TRAIN_LSTT_DROPPATH_SCALING": false,
    "TRAIN_LSTT_EMB_DROPOUT": 0.0,
    "TRAIN_LSTT_ID_DROPOUT": 0.0,
    "TRAIN_LSTT_LT_DROPOUT": 0.0,
    "TRAIN_LSTT_ST_DROPOUT": 0.0,
    "TRAIN_MAX_KEEP_CKPT": 8,
    "TRAIN_OPT": "adamw",
    "TRAIN_RESUME": false,
    "TRAIN_RESUME_CKPT": null,
    "TRAIN_RESUME_STEP": 0,
    "TRAIN_SAVE_STEP": 1000,
    "TRAIN_SEQ_TRAINING_FREEZE_PARAMS": [
        "patch_wise_id_bank"
    ],
    "TRAIN_SEQ_TRAINING_START_RATIO": 0.5,
    "TRAIN_SGD_MOMENTUM": 0.9,
    "TRAIN_START_STEP": 0,
    "TRAIN_TBLOG": true,
    "TRAIN_TBLOG_STEP": 50,
    "TRAIN_TOP_K_PERCENT_PIXELS": 0.15,
    "TRAIN_TOTAL_STEPS": 100000,
    "TRAIN_WEIGHT_DECAY": 0.07,
    "TRAIN_WEIGHT_DECAY_EXCLUSIVE": {},
    "TRAIN_WEIGHT_DECAY_EXEMPTION": [
        "absolute_pos_embed",
        "relative_position_bias_table",
        "relative_emb_v",
        "conv_out"
    ]
}
Use GPU 0 for evaluating.
Use GPU 1 for evaluating.
Build VOS model.
Load checkpoint from ./AOTS_PRE_YTB_DAV.pth
Process dataset...
/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
Eval alldataset_AOTS on davis2017 val:
Done!
/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
GPU 0 - Processing Seq bike-packing [1/30]:
GPU 1 - Processing Seq blackswan [2/30]:
GPU 1 - Seq blackswan - FPS: 29.29. All-Frame FPS: 29.29, All-Seq FPS: 29.29, Max Mem: 0.53G
GPU 1 - Processing Seq breakdance [4/30]:
GPU 0 - Seq bike-packing - FPS: 28.99. All-Frame FPS: 28.99, All-Seq FPS: 28.99, Max Mem: 0.58G
GPU 0 - Processing Seq bmx-trees [3/30]:
GPU 1 - Seq breakdance - FPS: 29.57. All-Frame FPS: 29.46, All-Seq FPS: 29.43, Max Mem: 0.53G
GPU 1 - Processing Seq camel [5/30]:
GPU 0 - Seq bmx-trees - FPS: 29.16. All-Frame FPS: 29.08, All-Seq FPS: 29.08, Max Mem: 0.58G
GPU 0 - Processing Seq car-roundabout [6/30]:
GPU 1 - Seq camel - FPS: 30.71. All-Frame FPS: 29.95, All-Seq FPS: 29.84, Max Mem: 0.53G
GPU 1 - Processing Seq car-shadow [7/30]:
GPU 0 - Seq car-roundabout - FPS: 29.29. All-Frame FPS: 29.15, All-Seq FPS: 29.15, Max Mem: 0.58G
GPU 0 - Processing Seq cows [8/30]:
GPU 1 - Seq car-shadow - FPS: 30.62. All-Frame FPS: 30.05, All-Seq FPS: 30.03, Max Mem: 0.53G
GPU 1 - Processing Seq dance-twirl [9/30]:
GPU 0 - Seq cows - FPS: 27.67. All-Frame FPS: 28.67, All-Seq FPS: 28.76, Max Mem: 0.58G
GPU 0 - Processing Seq dog [10/30]:
GPU 1 - Seq dance-twirl - FPS: 25.66. All-Frame FPS: 28.80, All-Seq FPS: 29.04, Max Mem: 0.53G
GPU 1 - Processing Seq dogs-jump [11/30]:
GPU 0 - Seq dog - FPS: 28.28. All-Frame FPS: 28.60, All-Seq FPS: 28.67, Max Mem: 0.58G
GPU 0 - Processing Seq drift-chicane [12/30]:
GPU 1 - Seq dogs-jump - FPS: 27.15. All-Frame FPS: 28.52, All-Seq FPS: 28.71, Max Mem: 0.53G
GPU 1 - Processing Seq drift-straight [13/30]:
GPU 0 - Seq drift-chicane - FPS: 28.43. All-Frame FPS: 28.58, All-Seq FPS: 28.63, Max Mem: 0.58G
GPU 0 - Processing Seq goat [14/30]:
GPU 1 - Seq drift-straight - FPS: 29.75. All-Frame FPS: 28.65, All-Seq FPS: 28.85, Max Mem: 0.53G
GPU 1 - Processing Seq gold-fish [15/30]:
GPU 0 - Seq goat - FPS: 27.72. All-Frame FPS: 28.43, All-Seq FPS: 28.49, Max Mem: 0.58G
GPU 0 - Processing Seq horsejump-high [16/30]:
GPU 1 - Seq gold-fish - FPS: 28.61. All-Frame FPS: 28.64, All-Seq FPS: 28.82, Max Mem: 0.53G
GPU 1 - Processing Seq india [17/30]:
GPU 0 - Seq horsejump-high - FPS: 28.93. All-Frame FPS: 28.47, All-Seq FPS: 28.55, Max Mem: 0.58G
GPU 0 - Processing Seq judo [18/30]:
GPU 0 - Seq judo - FPS: 31.24. All-Frame FPS: 28.61, All-Seq FPS: 28.82, Max Mem: 0.58G
GPU 0 - Processing Seq lab-coat [20/30]:
GPU 1 - Seq india - FPS: 28.42. All-Frame FPS: 28.61, All-Seq FPS: 28.78, Max Mem: 0.53G
GPU 1 - Processing Seq kite-surf [19/30]:
GPU 0 - Seq lab-coat - FPS: 29.81. All-Frame FPS: 28.69, All-Seq FPS: 28.92, Max Mem: 0.58G
GPU 0 - Processing Seq libby [21/30]:
GPU 1 - Seq kite-surf - FPS: 30.69. All-Frame FPS: 28.76, All-Seq FPS: 28.96, Max Mem: 0.53G
GPU 1 - Processing Seq loading [22/30]:
GPU 0 - Seq libby - FPS: 31.08. All-Frame FPS: 28.85, All-Seq FPS: 29.10, Max Mem: 0.58G
GPU 0 - Processing Seq mbike-trick [23/30]:
GPU 1 - Seq loading - FPS: 31.06. All-Frame FPS: 28.90, All-Seq FPS: 29.14, Max Mem: 0.53G
GPU 1 - Processing Seq motocross-jump [24/30]:
GPU 1 - Seq motocross-jump - FPS: 31.09. All-Frame FPS: 29.01, All-Seq FPS: 29.29, Max Mem: 0.53G
GPU 1 - Processing Seq parkour [26/30]:
GPU 0 - Seq mbike-trick - FPS: 27.85. All-Frame FPS: 28.74, All-Seq FPS: 28.99, Max Mem: 0.58G
GPU 0 - Processing Seq paragliding-launch [25/30]:
GPU 1 - Seq parkour - FPS: 28.09. All-Frame FPS: 28.90, All-Seq FPS: 29.20, Max Mem: 0.53G
GPU 1 - Processing Seq pigs [27/30]:
GPU 0 - Seq paragliding-launch - FPS: 29.78. All-Frame FPS: 28.84, All-Seq FPS: 29.05, Max Mem: 0.58G
GPU 0 - Processing Seq scooter-black [28/30]:
GPU 0 - Seq scooter-black - FPS: 30.13. All-Frame FPS: 28.89, All-Seq FPS: 29.13, Max Mem: 0.58G
GPU 0 - Processing Seq soapbox [30/30]:
GPU 1 - Seq pigs - FPS: 30.28. All-Frame FPS: 29.01, All-Seq FPS: 29.27, Max Mem: 0.53G
GPU 1 - Processing Seq shooting [29/30]:
GPU 1 - Seq shooting - FPS: 28.25. All-Frame FPS: 28.98, All-Seq FPS: 29.20, Max Mem: 0.65G
Finished the evaluation on GPU 1.
GPU 0 - Seq soapbox - FPS: 29.63. All-Frame FPS: 28.96, All-Seq FPS: 29.16, Max Mem: 0.58G
Finished the evaluation on GPU 0.
GPU [0, 1] - All-Frame FPS: 28.97, All-Seq FPS: 29.18, Max Mem: 0.65G

Are you training with COCO?

Hello!
When I use COCO as one of the Static datasets to train the pre stage, I find that it reduces the accuracy of the pre-training model tested on DAVIS17. The pre-train model trained 20000 iters by using MSRA10K、PASCAL-S、PASCAL-VOC、ECSSD achieve nearly 59% iou when test on DAVIS17-val, but after adding COCO, even train 100000 iters, its iou is only 48%.
What do you think might have caused this? Are the COCO's annotations themself not particularly accurate?
And are you training with COCO?
Thank you very much!

How to eval DAVIS 2016

Thanks for your work!
But i confused that how to eval on davis 2016?
When I use DAVIS toolkit (for Val), which was mentioned in README, it returns an error that i don't have db_info.yaml, which will be used in db_read_info() from python/lib/davis/misc/config.py's line 81

Thanks for your reply!

A question about α in the loss formula of AOST

Hello!
In your new model AOST, you use "α" in the loss formula. Is every α_l equal to each other? As the paper says:" To balance the gradient contribution of sub-networks, we have to increase the weight of deeper subnetworks.", I think it means when L=3, α_1 < α_2 < α_3. But the default setting is α=2, and I am very confused about which α_l is set to 2?
Thank you!

DeAOTL predictions

Where are the DeAOTL predictions?

0.5px/1px E/SE misaligment

Have you found on your experiment runs a 0.5px/1px misaligment bias in right/bottom-right direction?
I have noted this both with aligned and not aligned corners models that you have used (e.g. R50/Swin Deaotl).
As these kind of errors are very hard to debug I want to know if you have experienced something like this on your side.

Thanks.

Custom datasets

I see in the readme an empty section to train/fine-tune on custom dataset.

Do you have any news to share on this topic?

About the implementation of multi-head attention in DeAOT

Hello, I have a question after reading your great work DeAOT.
When you conduct the ablation study about head number, you compare the multi-head and single-head in DeAOT. As we all know, the common implementation of multi-head is to reshape Query (its shape is HW×batch_size×C, just take Query as an example), and its channel dimension C is divided into C/num_head, then Query is reshaped to HW×batch_size×num_head×(C/num_head). This kind of implementation can keep the computation complexity as single-head has.
But the ablation study about head number shows that multi-head significantly reduces speed. So I want to know that what kind of implementation of multi-head attention in DeAOT? Is it what I show above?

	class DualBranchGPM(nn.Module):
	def __init__(self,
	num_layers=2,
	d_model=256,
	self_nhead=8,
	att_nhead=8,
	dim_feedforward=1024,
	emb_dropout=0.,
	droppath=0.1,
	lt_dropout=0.,
	st_dropout=0.,
	droppath_lst=False,
	droppath_scaling=False,
	activation="gelu",
	return_intermediate=False,
	intermediate_norm=True,
	final_norm=True):

	if self.long_term_memories is None:
	self.long_term_memories = lstt_long_memories
	else:
	self.update_long_term_memory(lstt_long_memories)