yoxu515 / aot-benchmark Goto Github PK
View Code? Open in Web Editor NEWAn efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch
License: BSD 3-Clause "New" or "Revised" License
An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch
License: BSD 3-Clause "New" or "Revised" License
Thanks for your excellent work!
I want to re-train AOT, however, I only have two 3090 that cannot support the large batch size in the paper. In that case, I need to half the batch size or even reduce it to 6 since the memory of 3090 is not as large as V100. May I have some suggestions on the hyperparameters if I reduce the batch size, e.g., should I also half the training lr and double the iterations as well?
BTW, do you think reducing batch size will reduce the performance of AOT?
I notice that Swin Transformer uses pre-training parameters, and I would like to ask you whether the image size input into Swin Transformer is also 224*224?
I use the original code in Github and train on four A100 GPUS with the Stage = 'ytb', but it seems that it cost too much time to train this network(For example, 200steps with one hours). If there are total 100000 steps, the training can not just use 0.6 days as you say in the 'Reading Me'. Will this be caused by using the original code directly? Should I change some settings in this code?
Hi, I'm trying to train the model (I have not done any pre-training) on only youtubeVOS with
python tools/train.py --amp --exp_name default --stage ytb --gpu_num 2 --model aott
.
It loads the dataset
and it loads the pretrained backbone model from ./pretrain_models/mobilenet_v2-b0353104.pth
succesfully.
But then when it starts the training it crashes with the following output:
Hey @z-x-yang,
awesome repository, thank you! :)
The DeAOTT model is performing really well, so I was wondering if we could further trim down the propagation/LSTT modules as they take most of the runtime and memory?
Do you think the network would require retraining, or could we just prune?
It seems the -T models reduce to
aot-benchmark/configs/models/default_deaot.py
Lines 14 to 15 in f9a62b6
aot-benchmark/networks/layers/transformer.py
Lines 138 to 154 in f9a62b6
Which parameters would you tweak to make the modules even more efficient without losing too much performance?
Thanks a lot!
I'm very sorry to bother you. Do you have any plans to deploy this model, or do you have any suggestions for deployment?
Hi,
I have images where the masks are quite small and "round-ish" compared to the total size of the image itself (approx 400 small masks per image and still a decent amount of background).
When I try to use the demo.py on the original image and masks, the tracking doesn't work to well.
However, when I crop the original image so that the size of the masks is much larger compared to the total size of the image (approx 10 large masks and some background), the tracking works much better.
Is there a way to change that "mask scaling factor" so that it works on my original images?
Thanks,
Renaud
Hello , thank you for making this repo public. I am trying to run the demo, which raises the following error:
Object number: 44. Inference size: 577x1041. Output size: 1080x1920. /opt/miniconda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] ./networks/layers/position.py:63: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). dim_t = self.temperature**(2 * (dim_t // 2) / self.num_pos_feats) Traceback (most recent call last): File "tools/demo.py", line 289, in <module> main() File "tools/demo.py", line 285, in main demo(cfg) File "tools/demo.py", line 204, in demo obj_nums=obj_nums) File "./networks/engines/aot_engine.py", line 586, in add_reference_frame img_embs=img_embs) File "./networks/engines/aot_engine.py", line 239, in add_reference_frame size_2d=self.enc_size_2d) File "./networks/models/aot.py", line 105, in LSTT_forward pos_emb, size_2d) File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "./networks/layers/transformer.py", line 100, in forward size_2d=size_2d) File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "./networks/layers/transformer.py", line 224, in forward tgt3 = self.short_term_attn(local_Q, local_K, local_V)[0] File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "./networks/layers/attention.py", line 471, in forward output = agg_value + agg_bias RuntimeError: The size of tensor a (32) must match the size of tensor b (256) at non-singleton dimension 3.
Can you help fix this error?
Thanks!
Hello, thanks a lot for your work!
I'm trying to evaluate DAVIS17, after I use comand python eval.py --exp_name default --stage pre --model aott --dataset davis2017 --split val --gpu_num 1
, it returns No checkpoint in ./results/result/default_AOTT/PRE/ckpt.
. I've prepared as README said, could you please tell me the reason? Thank you very much!
Just in the case you are interested:
https://henghuiding.github.io/MOSE/
Is --ema
required on evaluation of model zoo checkpoints?
Thanks for this wonderful work!
I put some well-annotated masks of some key frames of the video in masks directory, but I found that the model is still only using the first frame's mask to propogate.
Can the model make use of more masks (like sparse annotated masks for key frames) to propogate the entire video?
Thanks a lot~
It would be nice to have a TB and log also on Davis and YouTube validation set.
I was trying to eval some of these models with more sparse supervision e.g. 1 every N frames.
So I've simply tried to add more reference frames at:
it seems to work but it is has an impact on the memory with earlier OOM events.
Have you tried to experiment this setup? Do I need to manipulate something else other then adding reference frames?
What it could be the relationship with the long term memory gap setting?
Steps taken:
ECSSD
and MSRA10K
.bash train_eval.sh
.Right on the first training step, the following error is raised:
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/aot-benchmark/tools/train.py", line 18, in main_worker
trainer.sequential_training()
File "./networks/managers/trainer.py", line 473, in sequential_training
use_prev_prob=use_prev_prob)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "./networks/engines/aot_engine.py", line 52, in forward
self.add_reference_frame(frame_step=0, obj_nums=obj_nums)
File "./networks/engines/aot_engine.py", line 239, in add_reference_frame
size_2d=self.enc_size_2d)
File "./networks/models/aot.py", line 105, in LSTT_forward
pos_emb, size_2d)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "./networks/layers/transformer.py", line 113, in forward
size_2d=size_2d)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "./networks/layers/transformer.py", line 352, in forward
tgt3 = self.short_term_attn(local_Q, local_K, local_V)[0]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "./networks/layers/attention.py", line 525, in forward
output = agg_value + agg_bias
RuntimeError: The size of tensor a (32) must match the size of tensor b (256) at non-singleton dimension 3
I did not alter any of the configuration values. I noticed that #2 had the same issue, but no solution was provided there. Any help would be much appreciated!
Thanks in advance
Hi, thanks for your excellent work.
I tried to run multi-scale ([1.0, 1.3]) inference in tools/eval.py on a single A100 (40G). Then I faced the out of memory problem.
So I want to ask about the scales you use in the multi-scale inference. Or is there any way to decrease the memory during inference?
Thanks!
I would like to train a model purely on ytVOS and DAVIS. I.e I would like a stage ytb_dav
.
Is this possible?
Hi,
I want to use the Swin-AOT model that is only trained on DAVIS&YTB or YTB, i.e. w/o pre-training on static image datasets. However, I found that one 3090 only supports less than 4 samples in a batch, which affects the convergence of the model, while I only have one GPU right now. May I know is it convenient for you to release this checkpoint (Swin-AOT that is only trained on DAVIS&YTB or YTB, i.e. w/o pre-training on static image datasets) for reference of the later work?
Thanks.
Hello! I find the J&F mean of AOTT(Y) on Davis17 validation set in Table1(b) in the paper is 78.2, while the J&F mean of AOTT in ModelZoo is 79.2, could you please tell me the difference?
Hello !
Thanks for sharing your great work!
In section 6.1, the paper mentioned the R50-DeAOT-L performs better than R50-AOT-L on tiny or scale-changing objects. I would like to know which module is beneficial to tiny or scale-changing objects.
Looking forward to your kind reply !
With cfg.DIST_ENABLE
false all the distributed specific parts are not wrapped by a condition to check if distributed was enabled or not.
How to use demo.py in this version?
Thanks for your great job!
As said in the main page, the training process of the two stages cost about 0.6 days each. However, I train the AOTT model on DAVIS only with the default config, and it almost cost 1.5days for DAVIS. I wanna if I make some mistake in the training process.
My environment:
pytorch: 1.7.1
cuda: 10.2
GPU: 4 Tesla V100
The config:
import os
import importlib
class DefaultEngineConfig():
def init(self, exp_name='default', model='AOTT'):
model_cfg = importlib.import_module('configs.models.' +
model).ModelConfig()
self.dict.update(model_cfg.dict) # add model config
self.EXP_NAME = exp_name + '_' + self.MODEL_NAME
self.STAGE_NAME = 'default'
self.DATASETS = ['youtubevos']
self.DATA_WORKERS = 12
self.DATA_RANDOMCROP = (465,
465) if self.MODEL_ALIGN_CORNERS else (464,
464)
self.DATA_RANDOMFLIP = 0.5
self.DATA_MAX_CROP_STEPS = 10
self.DATA_SHORT_EDGE_LEN = 480
self.DATA_MIN_SCALE_FACTOR = 0.7
self.DATA_MAX_SCALE_FACTOR = 1.3
self.DATA_RANDOM_REVERSE_SEQ = True
self.DATA_SEQ_LEN = 5
self.DATA_DAVIS_REPEAT = 5
self.DATA_RANDOM_GAP_DAVIS = 12 # max frame interval between two sampled frames for DAVIS (24fps)
self.DATA_RANDOM_GAP_YTB = 3 # max frame interval between two sampled frames for YouTube-VOS (6fps)
self.DATA_DYNAMIC_MERGE_PROB = 0.3
self.PRETRAIN = True
self.PRETRAIN_FULL = False # if False, load encoder only
self.PRETRAIN_MODEL = ''
self.TRAIN_TOTAL_STEPS = 100000
self.TRAIN_START_STEP = 0
self.TRAIN_WEIGHT_DECAY = 0.07
self.TRAIN_WEIGHT_DECAY_EXCLUSIVE = {
# 'encoder.': 0.01
}
self.TRAIN_WEIGHT_DECAY_EXEMPTION = [
'absolute_pos_embed', 'relative_position_bias_table',
'relative_emb_v', 'conv_out'
]
self.TRAIN_LR = 2e-4
self.TRAIN_LR_MIN = 2e-5 if 'mobilenetv2' in self.MODEL_ENCODER else 1e-5
self.TRAIN_LR_POWER = 0.9
self.TRAIN_LR_ENCODER_RATIO = 0.1
self.TRAIN_LR_WARM_UP_RATIO = 0.05
self.TRAIN_LR_COSINE_DECAY = False
self.TRAIN_LR_RESTART = 1
self.TRAIN_LR_UPDATE_STEP = 1
self.TRAIN_AUX_LOSS_WEIGHT = 1.0
self.TRAIN_AUX_LOSS_RATIO = 1.0
self.TRAIN_OPT = 'adamw'
self.TRAIN_SGD_MOMENTUM = 0.9
self.TRAIN_GPUS = 4
self.TRAIN_BATCH_SIZE = 16
self.TRAIN_TBLOG = False
self.TRAIN_TBLOG_STEP = 50
self.TRAIN_LOG_STEP = 20
self.TRAIN_IMG_LOG = True
self.TRAIN_TOP_K_PERCENT_PIXELS = 0.15
self.TRAIN_SEQ_TRAINING_FREEZE_PARAMS = ['patch_wise_id_bank']
self.TRAIN_SEQ_TRAINING_START_RATIO = 0.5
self.TRAIN_HARD_MINING_RATIO = 0.5
self.TRAIN_EMA_RATIO = 0.1
self.TRAIN_CLIP_GRAD_NORM = 5.
self.TRAIN_SAVE_STEP = 1000
self.TRAIN_MAX_KEEP_CKPT = 8
self.TRAIN_RESUME = False
self.TRAIN_RESUME_CKPT = None
self.TRAIN_RESUME_STEP = 0
self.TRAIN_AUTO_RESUME = True
self.TRAIN_DATASET_FULL_RESOLUTION = False
self.TRAIN_ENABLE_PREV_FRAME = False
self.TRAIN_ENCODER_FREEZE_AT = 2
self.TRAIN_LSTT_EMB_DROPOUT = 0.
self.TRAIN_LSTT_ID_DROPOUT = 0.
self.TRAIN_LSTT_DROPPATH = 0.1
self.TRAIN_LSTT_DROPPATH_SCALING = False
self.TRAIN_LSTT_DROPPATH_LST = False
self.TRAIN_LSTT_LT_DROPOUT = 0.
self.TRAIN_LSTT_ST_DROPOUT = 0.
self.TEST_GPU_ID = 0
self.TEST_GPU_NUM = 1
self.TEST_FRAME_LOG = False
self.TEST_DATASET = 'youtubevos'
self.TEST_DATASET_FULL_RESOLUTION = False
self.TEST_DATASET_SPLIT = 'val'
self.TEST_CKPT_PATH = None
# if "None", evaluate the latest checkpoint.
self.TEST_CKPT_STEP = None
self.TEST_FLIP = False
self.TEST_MULTISCALE = [1]
self.TEST_MIN_SIZE = None
self.TEST_MAX_SIZE = 800 * 1.3
self.TEST_WORKERS = 4
# GPU distribution
self.DIST_ENABLE = True
self.DIST_BACKEND = "nccl" # "gloo"
self.DIST_URL = "tcp://127.0.0.1:13241"
self.DIST_START_GPU = 0
def init_dir(self):
self.DIR_DATA = './datasets'
self.DIR_DAVIS = os.path.join(self.DIR_DATA, 'DAVIS')
self.DIR_YTB = os.path.join(self.DIR_DATA, 'YTB')
self.DIR_STATIC = os.path.join(self.DIR_DATA, 'Static')
self.DIR_ROOT = './results'
self.DIR_RESULT = os.path.join(self.DIR_ROOT, 'result', self.EXP_NAME,
self.STAGE_NAME)
self.DIR_CKPT = os.path.join(self.DIR_RESULT, 'ckpt')
self.DIR_EMA_CKPT = os.path.join(self.DIR_RESULT, 'ema_ckpt')
self.DIR_LOG = os.path.join(self.DIR_RESULT, 'log')
self.DIR_TB_LOG = os.path.join(self.DIR_RESULT, 'log', 'tensorboard')
self.DIR_IMG_LOG = os.path.join(self.DIR_RESULT, 'log', 'img')
self.DIR_EVALUATION = os.path.join(self.DIR_RESULT, 'eval')
for path in [
self.DIR_RESULT, self.DIR_CKPT, self.DIR_EMA_CKPT,
self.DIR_LOG, self.DIR_EVALUATION, self.DIR_IMG_LOG,
self.DIR_TB_LOG
]:
if not os.path.isdir(path):
try:
os.makedirs(path)
except Exception as inst:
print(inst)
print('Failed to make dir: {}.'.format(path))
Great work and great code! Could you provide some pre-computed results on YouTube and DAVIS? Thanks!
When I read the source code of networks/engines/aot_engine.py, I found that only the value was updated when the memory was updated in the update_short_term_memory method. Is there any consideration for not updating the key here?
Hello! Does AOTL use multiple frames in long term memory during the training phase? Or AOTL just uses the first frame in long term memory to train and uses multiple frames for evaluation?
Pytorch 1.8
torchversion 0.9.0
CUDA 10.1
when train for ytb it report this error
here is the full track
(torch18) cwc@imc-Z9PE-D8-WS:~/aot-benchmark-main/tools$ python train.py
Exp _AOTT:
{
"DATASETS": [
"youtubevos"
],
"DATA_DAVIS_REPEAT": 5,
"DATA_DYNAMIC_MERGE_PROB": 0.3,
"DATA_MAX_CROP_STEPS": 10,
"DATA_MAX_SCALE_FACTOR": 1.3,
"DATA_MIN_SCALE_FACTOR": 0.7,
"DATA_RANDOMCROP": [
465,
465
],
"DATA_RANDOMFLIP": 0.5,
"DATA_RANDOM_GAP_DAVIS": 12,
"DATA_RANDOM_GAP_YTB": 3,
"DATA_RANDOM_REVERSE_SEQ": true,
"DATA_SEQ_LEN": 5,
"DATA_SHORT_EDGE_LEN": 480,
"DATA_WORKERS": 8,
"DIR_CKPT": "./results/result/_AOTT/YTB/ckpt",
"DIR_DAVIS": "/DATACENTER/1/ysl/Datasets/DAVIS/2017",
"DIR_EMA_CKPT": "./results/result/_AOTT/YTB/ema_ckpt",
"DIR_EVALUATION": "./results/result/_AOTT/YTB/eval",
"DIR_IMG_LOG": "./results/result/_AOTT/YTB/log/img",
"DIR_LOG": "./results/result/_AOTT/YTB/log",
"DIR_RESULT": "./results/result/_AOTT/YTB",
"DIR_ROOT": "./results",
"DIR_STATIC": "/DATACENTER/1/Datasets/static",
"DIR_TB_LOG": "./results/result/_AOTT/YTB/log/tensorboard",
"DIR_YTB": "/DATACENTER/1/ysl/Datasets/YoutubeVOS",
"DIST_BACKEND": "nccl",
"DIST_ENABLE": true,
"DIST_START_GPU": 1,
"DIST_URL": "tcp://127.0.0.1:12311",
"EXP_NAME": "_AOTT",
"MODEL_ALIGN_CORNERS": true,
"MODEL_ATT_HEADS": 8,
"MODEL_DECODER_INTERMEDIATE_LSTT": true,
"MODEL_ENCODER": "mobilenetv2",
"MODEL_ENCODER_DIM": [
24,
32,
96,
1280
],
"MODEL_ENCODER_EMBEDDING_DIM": 256,
"MODEL_ENCODER_PRETRAIN": "/home/cwc/aot-benchmark-main/pretrain_models/mobilenet_v2-b0353104.pth",
"MODEL_ENGINE": "aotengine",
"MODEL_EPSILON": 1e-05,
"MODEL_FREEZE_BACKBONE": false,
"MODEL_FREEZE_BN": true,
"MODEL_LSTT_NUM": 1,
"MODEL_MAX_OBJ_NUM": 10,
"MODEL_NAME": "AOTT",
"MODEL_SELF_HEADS": 8,
"MODEL_USE_PREV_PROB": false,
"MODEL_VOS": "aot",
"PRETRAIN": true,
"PRETRAIN_FULL": false,
"PRETRAIN_MODEL": "",
"STAGE_NAME": "YTB",
"TEST_CKPT_PATH": null,
"TEST_CKPT_STEP": null,
"TEST_DATASET": "youtubevos",
"TEST_DATASET_FULL_RESOLUTION": false,
"TEST_DATASET_SPLIT": "val",
"TEST_FLIP": false,
"TEST_FRAME_LOG": false,
"TEST_GPU_ID": 1,
"TEST_GPU_NUM": 1,
"TEST_LONG_TERM_MEM_GAP": 9999,
"TEST_MAX_SIZE": 1040.0,
"TEST_MIN_SIZE": null,
"TEST_MULTISCALE": [
1
],
"TEST_WORKERS": 4,
"TRAIN_AUTO_RESUME": true,
"TRAIN_AUX_LOSS_RATIO": 1.0,
"TRAIN_AUX_LOSS_WEIGHT": 1.0,
"TRAIN_BATCH_SIZE": 4,
"TRAIN_CLIP_GRAD_NORM": 5.0,
"TRAIN_DATASET_FULL_RESOLUTION": false,
"TRAIN_EMA_RATIO": 0.1,
"TRAIN_ENABLE_PREV_FRAME": false,
"TRAIN_ENCODER_FREEZE_AT": 2,
"TRAIN_GPUS": 2,
"TRAIN_HARD_MINING_RATIO": 0.5,
"TRAIN_IMG_LOG": true,
"TRAIN_LOG_STEP": 20,
"TRAIN_LONG_TERM_MEM_GAP": 9999,
"TRAIN_LR": 0.0002,
"TRAIN_LR_COSINE_DECAY": false,
"TRAIN_LR_ENCODER_RATIO": 0.1,
"TRAIN_LR_MIN": 2e-05,
"TRAIN_LR_POWER": 0.9,
"TRAIN_LR_RESTART": 1,
"TRAIN_LR_UPDATE_STEP": 1,
"TRAIN_LR_WARM_UP_RATIO": 0.05,
"TRAIN_LSTT_DROPPATH": 0.1,
"TRAIN_LSTT_DROPPATH_LST": false,
"TRAIN_LSTT_DROPPATH_SCALING": false,
"TRAIN_LSTT_EMB_DROPOUT": 0.0,
"TRAIN_LSTT_ID_DROPOUT": 0.0,
"TRAIN_LSTT_LT_DROPOUT": 0.0,
"TRAIN_LSTT_ST_DROPOUT": 0.0,
"TRAIN_MAX_KEEP_CKPT": 8,
"TRAIN_OPT": "adamw",
"TRAIN_RESUME": false,
"TRAIN_RESUME_CKPT": null,
"TRAIN_RESUME_STEP": 0,
"TRAIN_SAVE_STEP": 1000,
"TRAIN_SEQ_TRAINING_FREEZE_PARAMS": [
"patch_wise_id_bank"
],
"TRAIN_SEQ_TRAINING_START_RATIO": 0.5,
"TRAIN_SGD_MOMENTUM": 0.9,
"TRAIN_START_STEP": 0,
"TRAIN_TBLOG": false,
"TRAIN_TBLOG_STEP": 50,
"TRAIN_TOP_K_PERCENT_PIXELS": 0.15,
"TRAIN_TOTAL_STEPS": 100000,
"TRAIN_WEIGHT_DECAY": 0.07,
"TRAIN_WEIGHT_DECAY_EXCLUSIVE": {},
"TRAIN_WEIGHT_DECAY_EXEMPTION": [
"absolute_pos_embed",
"relative_position_bias_table",
"relative_emb_v",
"conv_out"
]
}
Use GPU 1 for training VOS.
Build VOS model.
Use GPU 2 for training VOS.
Use Frozen BN in Encoder!
Total Param: 5.73M
Build optimizer.
Total Param: 5.73M
Process dataset...
Short object: 721bb6f2cb-3
Short object: 721bb6f2cb-3
Short object: d177e9878a-2
Short object: d177e9878a-3
Short object: d177e9878a-2
Short object: d177e9878a-3
Short object: f36483c824-2
Short object: f9bd1fabf5-4
Short object: f36483c824-2
Video Num: 3471 X 1
Done!
Short object: f9bd1fabf5-4
Video Num: 3471 X 1
Remove ['features.0.1.num_batches_tracked', 'features.1.conv.0.1.num_batches_tracked', 'features.1.conv.2.num_batches_tracked', 'features.2.conv.0.1.num_batches_tracked', 'features.2.conv.1.1.num_batches_tracked', 'features.2.conv.3.num_batches_tracked', 'features.3.conv.0.1.num_batches_tracked', 'features.3.conv.1.1.num_batches_tracked', 'features.3.conv.3.num_batches_tracked', 'features.4.conv.0.1.num_batches_tracked', 'features.4.conv.1.1.num_batches_tracked', 'features.4.conv.3.num_batches_tracked', 'features.5.conv.0.1.num_batches_tracked', 'features.5.conv.1.1.num_batches_tracked', 'features.5.conv.3.num_batches_tracked', 'features.6.conv.0.1.num_batches_tracked', 'features.6.conv.1.1.num_batches_tracked', 'features.6.conv.3.num_batches_tracked', 'features.7.conv.0.1.num_batches_tracked', 'features.7.conv.1.1.num_batches_tracked', 'features.7.conv.3.num_batches_tracked', 'features.8.conv.0.1.num_batches_tracked', 'features.8.conv.1.1.num_batches_tracked', 'features.8.conv.3.num_batches_tracked', 'features.9.conv.0.1.num_batches_tracked', 'features.9.conv.1.1.num_batches_tracked', 'features.9.conv.3.num_batches_tracked', 'features.10.conv.0.1.num_batches_tracked', 'features.10.conv.1.1.num_batches_tracked', 'features.10.conv.3.num_batches_tracked', 'features.11.conv.0.1.num_batches_tracked', 'features.11.conv.1.1.num_batches_tracked', 'features.11.conv.3.num_batches_tracked', 'features.12.conv.0.1.num_batches_tracked', 'features.12.conv.1.1.num_batches_tracked', 'features.12.conv.3.num_batches_tracked', 'features.13.conv.0.1.num_batches_tracked', 'features.13.conv.1.1.num_batches_tracked', 'features.13.conv.3.num_batches_tracked', 'features.14.conv.0.1.num_batches_tracked', 'features.14.conv.1.1.num_batches_tracked', 'features.14.conv.3.num_batches_tracked', 'features.15.conv.0.1.num_batches_tracked', 'features.15.conv.1.1.num_batches_tracked', 'features.15.conv.3.num_batches_tracked', 'features.16.conv.0.1.num_batches_tracked', 'features.16.conv.1.1.num_batches_tracked', 'features.16.conv.3.num_batches_tracked', 'features.17.conv.0.1.num_batches_tracked', 'features.17.conv.1.1.num_batches_tracked', 'features.17.conv.3.num_batches_tracked', 'features.18.1.num_batches_tracked', 'classifier.1.weight', 'classifier.1.bias'] from pretrained model.
Load pretrained backbone model from .
Start training:
step------------------------------ 0
step------------------------------ 0
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "train.py", line 80, in
main()
File "train.py", line 76, in main
mp.spawn(main_worker, nprocs=cfg.TRAIN_GPUS, args=(cfg, args.amp))
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/cwc/aot-benchmark-main/tools/train.py", line 18, in main_worker
trainer.sequential_training()
File "/home/cwc/aot-benchmark-main/tools/../networks/managers/trainer.py", line 456, in sequential_training
loss.backward()
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [900, 2, 256]], which is output 0 of AddBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Hi @z-x-yang ,
It was mentioned in the readme that demo script will run even without spatial correlation sampler.But, in the attention.py ,it was being imported and the error is as follows:
Build AOT model.
Traceback (most recent call last):
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/tools/demo.py", line 286, in
main()
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/tools/demo.py", line 282, in main
demo(cfg)
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/tools/demo.py", line 107, in demo
model = build_vos_model(cfg.MODEL_VOS, cfg).cuda(gpu_id)
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/models/init.py", line 7, in build_vos_model
return AOT(cfg, encoder=cfg.MODEL_ENCODER, **kwargs)
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/models/aot.py", line 23, in init
self.LSTT = LongShortTermTransformer(
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/layers/transformer.py", line 75, in init
block(d_model, self_nhead, att_nhead, dim_feedforward,
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/layers/transformer.py", line 279, in init
self.short_term_attn = MultiheadLocalAttention(d_model,
File "/home/ubuntu/sam_track/DeAOT/aot-benchmark/./networks/layers/attention.py", line 282, in init
from spatial_correlation_sampler import SpatialCorrelationSampler
ModuleNotFoundError: No module named 'spatial_correlation_sampler
Can we disentangle long term memory update and reference frame?
aot-benchmark/networks/engines/aot_engine.py
Lines 243 to 246 in f9a62b6
What I think we could add an extra parameter to the function signature for updating or not the long term memory.
Cause currently it is hard to make multiple evaluation passes with different semi-supervision as the same function it is required also for preparing curr_enc_embs
.
Hi, @yoxu515,
Thank you so much for making this code repository publicly available. Would you mind also exporting the environment and sharing it in this repository?
Hi authors, I notice that you have mentioned the results on VOT2020 benchmark. Is the model the same as the model in your paper and which model from the model zoo do you use to evaluate it?
Hi, thanks for this wonderful work!
I would like to ask that, in demo.py
,
def read_label(self, label_name, squeeze_idx=None):
label_path = os.path.join(self.label_root, self.seq_name, label_name)
label = Image.open(label_path)
label = np.array(label, dtype=np.uint8)
if self.single_obj:
label = (label > 0).astype(np.uint8)
elif squeeze_idx is not None:
squeezed_label = label * 0
for idx in range(len(squeeze_idx)):
obj_id = squeeze_idx[idx]
if obj_id == 0:
continue
mask = label == obj_id
squeezed_label += (mask * idx).astype(np.uint8)
label = squeezed_label
return label
why the rgb mask goes through the codes below
label = Image.open(label_path)
label = np.array(label, dtype=np.uint8)
and can be converted to a one-channel id mask? How should I enocode a mask image so that I could achieve this effect? Thanks!
Hi, Dr. Yang:
It seems very useful to use identity assignment.
Although I have read the paper and code, but I can't understand how it works yet.
it seems AOT only use a very simple convolution to produce a id_emb
.
aot-benchmark/networks/models/aot.py
Line 76 in 5a7665f
Then, add this id_emb
in to
aot-benchmark/networks/layers/transformer.py
Line 216 in 5a7665f
It is very interesting and could you please offer more details and proofs about it.
Thank you.
I would like to ask if the target to be split does not appear in the mask of the first frame, can it still be split smoothly in subsequent splits?
As you are working with your new project it could be interesting to add SAM encoder/decoder here so it could be finetunable like the other encoders/decoders in the repo.
Hello!
When I read the code in eval_datasets.py, I have a question about the function "read_label" of class VOSTest.
Could you please tell me the meaning of "squeeze_idx"? When I debug the code, I find that in "getitem", it have used "obj_idx" as "squeeze_idx" in "read_label", but it seems no difference between the input label and "squeezed_label", so what's the effect of "squeeze_idx"? And under what circumstances will it be used?
Thank you!
Hello !
Thanks for sharing your work ! How do use MS-AOT to evaluate custom tracking datasets and output the prediction mask ?
Could you give me some suggestions ?
Thank you !
It could be nice to have the F score for the boundaries other the IOU in utils/metrics
.
Hi, thanks for your excellent work.
But I confused about why the prev_labels of trainer.process_log() are from all_pred[-2]?
They should be GT?
aot-benchmark/networks/managers/trainer.py
Line 474 in 5a7665f
Thanks!
I notice that you use a conv2d (applied on q) for the relative_emb_k. Should it be a fixed embedding as relative_emb_v?
Thanks for making code available!
I met some questions while testing the pretrained model! It can only get a speed of near 29FPS when testing the PRE_YTB_DAV
pretrained model of DAVIS2017, AOTS which should be 40FPS according to paper result. But the test J & F-mean is the same as the results posted in model_zoo which is 0.820575.
I did not modify the default test config of aots.py exclude dir such as dataset. Did I need to modify something in train_eval.sh?
My device:
2 x Tesla V100 SXM2 32GB Driver Version: 450.51.06 CUDA Version: 11.0
pytorch==1.7.0
torchvision==0.8.1
spatial-correlation-sampler == 0.3.0
Exp alldataset_AOTS:
{
"DATASETS": [
"youtubevos",
"davis2017"
],
"DATA_DAVIS_REPEAT": 5,
"DATA_DYNAMIC_MERGE_PROB": 0.3,
"DATA_MAX_CROP_STEPS": 10,
"DATA_MAX_SCALE_FACTOR": 1.3,
"DATA_MIN_SCALE_FACTOR": 0.7,
"DATA_RANDOMCROP": [
465,
465
],
"DATA_RANDOMFLIP": 0.5,
"DATA_RANDOM_GAP_DAVIS": 12,
"DATA_RANDOM_GAP_YTB": 3,
"DATA_RANDOM_REVERSE_SEQ": true,
"DATA_SEQ_LEN": 5,
"DATA_SHORT_EDGE_LEN": 480,
"DATA_WORKERS": 8,
"DIR_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ckpt",
"DIR_DATA": "./datasets",
"DIR_DAVIS": "/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval",
"DIR_EMA_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ema_ckpt",
"DIR_EVALUATION": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/eval",
"DIR_IMG_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/img",
"DIR_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log",
"DIR_RESULT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV",
"DIR_ROOT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022",
"DIR_STATIC": "/yexin/vos_related_source/vos_exper_dataset/unify_pretrain_dataset",
"DIR_TB_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/tensorboard",
"DIR_YTB": "/yexin/vos_related_source/vos_exper_dataset/dataset/Youtube",
"DIST_BACKEND": "nccl",
"DIST_ENABLE": true,
"DIST_START_GPU": 0,
"DIST_URL": "tcp://127.0.0.1:13241",
"EXP_NAME": "alldataset_AOTS",
"MODEL_ALIGN_CORNERS": true,
"MODEL_ATT_HEADS": 8,
"MODEL_DECODER_INTERMEDIATE_LSTT": true,
"MODEL_ENCODER": "mobilenetv2",
"MODEL_ENCODER_DIM": [
24,
32,
96,
1280
],
"MODEL_ENCODER_EMBEDDING_DIM": 256,
"MODEL_ENCODER_PRETRAIN": "./pretrain_models/mobilenet_v2-b0353104.pth",
"MODEL_ENGINE": "aotengine",
"MODEL_EPSILON": 1e-05,
"MODEL_FREEZE_BACKBONE": false,
"MODEL_FREEZE_BN": true,
"MODEL_LSTT_NUM": 2,
"MODEL_MAX_OBJ_NUM": 10,
"MODEL_NAME": "AOTS",
"MODEL_SELF_HEADS": 8,
"MODEL_USE_PREV_PROB": false,
"MODEL_VOS": "aot",
"PRETRAIN": true,
"PRETRAIN_FULL": true,
"PRETRAIN_MODEL": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE/ema_ckpt/save_step_100000.pth",
"STAGE_NAME": "PRE_YTB_DAV",
"TEST_CKPT_PATH": "./AOTS_PRE_YTB_DAV.pth",
"TEST_CKPT_STEP": null,
"TEST_DATASET": "davis2017",
"TEST_DATASET_FULL_RESOLUTION": false,
"TEST_DATASET_SPLIT": "val",
"TEST_EMA": true,
"TEST_FLIP": false,
"TEST_FRAME_LOG": false,
"TEST_GPU_ID": 0,
"TEST_GPU_NUM": 2,
"TEST_LONG_TERM_MEM_GAP": 9999,
"TEST_MAX_SIZE": 1040.0,
"TEST_MIN_SIZE": null,
"TEST_MULTISCALE": [
1.0
],
"TEST_WORKERS": 4,
"TRAIN_AUTO_RESUME": true,
"TRAIN_AUX_LOSS_RATIO": 1.0,
"TRAIN_AUX_LOSS_WEIGHT": 1.0,
"TRAIN_BATCH_SIZE": 16,
"TRAIN_CLIP_GRAD_NORM": 5.0,
"TRAIN_DATASET_FULL_RESOLUTION": false,
"TRAIN_EMA_RATIO": 0.1,
"TRAIN_ENABLE_PREV_FRAME": false,
"TRAIN_ENCODER_FREEZE_AT": 2,
"TRAIN_GPUS": 4,
"TRAIN_HARD_MINING_RATIO": 0.5,
"TRAIN_IMG_LOG": true,
"TRAIN_LOG_STEP": 50,
"TRAIN_LONG_TERM_MEM_GAP": 9999,
"TRAIN_LR": 0.0002,
"TRAIN_LR_COSINE_DECAY": false,
"TRAIN_LR_ENCODER_RATIO": 0.1,
"TRAIN_LR_MIN": 2e-05,
"TRAIN_LR_POWER": 0.9,
"TRAIN_LR_RESTART": 1,
"TRAIN_LR_UPDATE_STEP": 1,
"TRAIN_LR_WARM_UP_RATIO": 0.05,
"TRAIN_LSTT_DROPPATH": 0.1,
"TRAIN_LSTT_DROPPATH_LST": false,
"TRAIN_LSTT_DROPPATH_SCALING": false,
"TRAIN_LSTT_EMB_DROPOUT": 0.0,
"TRAIN_LSTT_ID_DROPOUT": 0.0,
"TRAIN_LSTT_LT_DROPOUT": 0.0,
"TRAIN_LSTT_ST_DROPOUT": 0.0,
"TRAIN_MAX_KEEP_CKPT": 8,
"TRAIN_OPT": "adamw",
"TRAIN_RESUME": false,
"TRAIN_RESUME_CKPT": null,
"TRAIN_RESUME_STEP": 0,
"TRAIN_SAVE_STEP": 1000,
"TRAIN_SEQ_TRAINING_FREEZE_PARAMS": [
"patch_wise_id_bank"
],
"TRAIN_SEQ_TRAINING_START_RATIO": 0.5,
"TRAIN_SGD_MOMENTUM": 0.9,
"TRAIN_START_STEP": 0,
"TRAIN_TBLOG": true,
"TRAIN_TBLOG_STEP": 50,
"TRAIN_TOP_K_PERCENT_PIXELS": 0.15,
"TRAIN_TOTAL_STEPS": 100000,
"TRAIN_WEIGHT_DECAY": 0.07,
"TRAIN_WEIGHT_DECAY_EXCLUSIVE": {},
"TRAIN_WEIGHT_DECAY_EXEMPTION": [
"absolute_pos_embed",
"relative_position_bias_table",
"relative_emb_v",
"conv_out"
]
}
Use GPU 0 for evaluating.
Use GPU 1 for evaluating.
Build VOS model.
Load checkpoint from ./AOTS_PRE_YTB_DAV.pth
Process dataset...
/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
Eval alldataset_AOTS on davis2017 val:
Done!
/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
GPU 0 - Processing Seq bike-packing [1/30]:
GPU 1 - Processing Seq blackswan [2/30]:
GPU 1 - Seq blackswan - FPS: 29.29. All-Frame FPS: 29.29, All-Seq FPS: 29.29, Max Mem: 0.53G
GPU 1 - Processing Seq breakdance [4/30]:
GPU 0 - Seq bike-packing - FPS: 28.99. All-Frame FPS: 28.99, All-Seq FPS: 28.99, Max Mem: 0.58G
GPU 0 - Processing Seq bmx-trees [3/30]:
GPU 1 - Seq breakdance - FPS: 29.57. All-Frame FPS: 29.46, All-Seq FPS: 29.43, Max Mem: 0.53G
GPU 1 - Processing Seq camel [5/30]:
GPU 0 - Seq bmx-trees - FPS: 29.16. All-Frame FPS: 29.08, All-Seq FPS: 29.08, Max Mem: 0.58G
GPU 0 - Processing Seq car-roundabout [6/30]:
GPU 1 - Seq camel - FPS: 30.71. All-Frame FPS: 29.95, All-Seq FPS: 29.84, Max Mem: 0.53G
GPU 1 - Processing Seq car-shadow [7/30]:
GPU 0 - Seq car-roundabout - FPS: 29.29. All-Frame FPS: 29.15, All-Seq FPS: 29.15, Max Mem: 0.58G
GPU 0 - Processing Seq cows [8/30]:
GPU 1 - Seq car-shadow - FPS: 30.62. All-Frame FPS: 30.05, All-Seq FPS: 30.03, Max Mem: 0.53G
GPU 1 - Processing Seq dance-twirl [9/30]:
GPU 0 - Seq cows - FPS: 27.67. All-Frame FPS: 28.67, All-Seq FPS: 28.76, Max Mem: 0.58G
GPU 0 - Processing Seq dog [10/30]:
GPU 1 - Seq dance-twirl - FPS: 25.66. All-Frame FPS: 28.80, All-Seq FPS: 29.04, Max Mem: 0.53G
GPU 1 - Processing Seq dogs-jump [11/30]:
GPU 0 - Seq dog - FPS: 28.28. All-Frame FPS: 28.60, All-Seq FPS: 28.67, Max Mem: 0.58G
GPU 0 - Processing Seq drift-chicane [12/30]:
GPU 1 - Seq dogs-jump - FPS: 27.15. All-Frame FPS: 28.52, All-Seq FPS: 28.71, Max Mem: 0.53G
GPU 1 - Processing Seq drift-straight [13/30]:
GPU 0 - Seq drift-chicane - FPS: 28.43. All-Frame FPS: 28.58, All-Seq FPS: 28.63, Max Mem: 0.58G
GPU 0 - Processing Seq goat [14/30]:
GPU 1 - Seq drift-straight - FPS: 29.75. All-Frame FPS: 28.65, All-Seq FPS: 28.85, Max Mem: 0.53G
GPU 1 - Processing Seq gold-fish [15/30]:
GPU 0 - Seq goat - FPS: 27.72. All-Frame FPS: 28.43, All-Seq FPS: 28.49, Max Mem: 0.58G
GPU 0 - Processing Seq horsejump-high [16/30]:
GPU 1 - Seq gold-fish - FPS: 28.61. All-Frame FPS: 28.64, All-Seq FPS: 28.82, Max Mem: 0.53G
GPU 1 - Processing Seq india [17/30]:
GPU 0 - Seq horsejump-high - FPS: 28.93. All-Frame FPS: 28.47, All-Seq FPS: 28.55, Max Mem: 0.58G
GPU 0 - Processing Seq judo [18/30]:
GPU 0 - Seq judo - FPS: 31.24. All-Frame FPS: 28.61, All-Seq FPS: 28.82, Max Mem: 0.58G
GPU 0 - Processing Seq lab-coat [20/30]:
GPU 1 - Seq india - FPS: 28.42. All-Frame FPS: 28.61, All-Seq FPS: 28.78, Max Mem: 0.53G
GPU 1 - Processing Seq kite-surf [19/30]:
GPU 0 - Seq lab-coat - FPS: 29.81. All-Frame FPS: 28.69, All-Seq FPS: 28.92, Max Mem: 0.58G
GPU 0 - Processing Seq libby [21/30]:
GPU 1 - Seq kite-surf - FPS: 30.69. All-Frame FPS: 28.76, All-Seq FPS: 28.96, Max Mem: 0.53G
GPU 1 - Processing Seq loading [22/30]:
GPU 0 - Seq libby - FPS: 31.08. All-Frame FPS: 28.85, All-Seq FPS: 29.10, Max Mem: 0.58G
GPU 0 - Processing Seq mbike-trick [23/30]:
GPU 1 - Seq loading - FPS: 31.06. All-Frame FPS: 28.90, All-Seq FPS: 29.14, Max Mem: 0.53G
GPU 1 - Processing Seq motocross-jump [24/30]:
GPU 1 - Seq motocross-jump - FPS: 31.09. All-Frame FPS: 29.01, All-Seq FPS: 29.29, Max Mem: 0.53G
GPU 1 - Processing Seq parkour [26/30]:
GPU 0 - Seq mbike-trick - FPS: 27.85. All-Frame FPS: 28.74, All-Seq FPS: 28.99, Max Mem: 0.58G
GPU 0 - Processing Seq paragliding-launch [25/30]:
GPU 1 - Seq parkour - FPS: 28.09. All-Frame FPS: 28.90, All-Seq FPS: 29.20, Max Mem: 0.53G
GPU 1 - Processing Seq pigs [27/30]:
GPU 0 - Seq paragliding-launch - FPS: 29.78. All-Frame FPS: 28.84, All-Seq FPS: 29.05, Max Mem: 0.58G
GPU 0 - Processing Seq scooter-black [28/30]:
GPU 0 - Seq scooter-black - FPS: 30.13. All-Frame FPS: 28.89, All-Seq FPS: 29.13, Max Mem: 0.58G
GPU 0 - Processing Seq soapbox [30/30]:
GPU 1 - Seq pigs - FPS: 30.28. All-Frame FPS: 29.01, All-Seq FPS: 29.27, Max Mem: 0.53G
GPU 1 - Processing Seq shooting [29/30]:
GPU 1 - Seq shooting - FPS: 28.25. All-Frame FPS: 28.98, All-Seq FPS: 29.20, Max Mem: 0.65G
Finished the evaluation on GPU 1.
GPU 0 - Seq soapbox - FPS: 29.63. All-Frame FPS: 28.96, All-Seq FPS: 29.16, Max Mem: 0.58G
Finished the evaluation on GPU 0.
GPU [0, 1] - All-Frame FPS: 28.97, All-Seq FPS: 29.18, Max Mem: 0.65G
Hello!
When I use COCO as one of the Static datasets to train the pre stage, I find that it reduces the accuracy of the pre-training model tested on DAVIS17. The pre-train model trained 20000 iters by using MSRA10K、PASCAL-S、PASCAL-VOC、ECSSD achieve nearly 59% iou when test on DAVIS17-val, but after adding COCO, even train 100000 iters, its iou is only 48%.
What do you think might have caused this? Are the COCO's annotations themself not particularly accurate?
And are you training with COCO?
Thank you very much!
Thanks for your work!
But i confused that how to eval on davis 2016?
When I use DAVIS toolkit (for Val), which was mentioned in README, it returns an error that i don't have db_info.yaml, which will be used in db_read_info() from python/lib/davis/misc/config.py's line 81
Thanks for your reply!
Hello!
In your new model AOST, you use "α" in the loss formula. Is every α_l equal to each other? As the paper says:" To balance the gradient contribution of sub-networks, we have to increase the weight of deeper subnetworks.", I think it means when L=3, α_1 < α_2 < α_3. But the default setting is α=2, and I am very confused about which α_l is set to 2?
Thank you!
Where are the DeAOTL predictions?
Have you found on your experiment runs a 0.5px/1px misaligment bias in right/bottom-right direction?
I have noted this both with aligned and not aligned corners models that you have used (e.g. R50/Swin Deaotl).
As these kind of errors are very hard to debug I want to know if you have experienced something like this on your side.
Thanks.
I see in the readme an empty section to train/fine-tune on custom dataset.
Do you have any news to share on this topic?
Hello, I have a question after reading your great work DeAOT.
When you conduct the ablation study about head number, you compare the multi-head and single-head in DeAOT. As we all know, the common implementation of multi-head is to reshape Query (its shape is HW×batch_size×C, just take Query as an example), and its channel dimension C is divided into C/num_head, then Query is reshaped to HW×batch_size×num_head×(C/num_head). This kind of implementation can keep the computation complexity as single-head has.
But the ablation study about head number shows that multi-head significantly reduces speed. So I want to know that what kind of implementation of multi-head attention in DeAOT? Is it what I show above?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.