facebookresearch / mask2former Goto Github PK

View Code? Open in Web Editor NEW

2.5K 28.0 371.0 426 KB

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

License: MIT License

Python 89.34% Shell 0.11% C++ 1.05% Cuda 9.50%

mask2former's Introduction

Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022)

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

[arXiv] [Project] [BibTeX]

Features

A single architecture for panoptic, instance and semantic segmentation.
Support major segmentation datasets: ADE20K, Cityscapes, COCO, Mapillary Vistas.

Updates

Add Google Colab demo.
Video instance segmentation is now supported! Please check our tech report for more details.

Installation

See installation instructions.

Getting Started

See Preparing Datasets for Mask2Former.

See Getting Started with Mask2Former.

Run our demo using Colab:

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:

Replicate web demo and docker image is available here:

Advanced usage

See Advanced Usage of Mask2Former.

Model Zoo and Baselines

We provide a large set of baseline results and trained models available for download in the Mask2Former Model Zoo.

License

Shield:

The majority of Mask2Former is licensed under a MIT License.

However portions of the project are available under separate license terms: Swin-Transformer-Semantic-Segmentation is licensed under the MIT license, Deformable-DETR is licensed under the Apache-2.0 License.

Citing Mask2Former

If you use Mask2Former in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.

@inproceedings{cheng2021mask2former,
  title={Masked-attention Mask Transformer for Universal Image Segmentation},
  author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
  journal={CVPR},
  year={2022}
}

If you find the code useful, please also consider the following BibTeX entry.

@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}

Acknowledgement

Code is largely based on MaskFormer (https://github.com/facebookresearch/MaskFormer).

mask2former's People

Contributors

Stargazers

Watchers

Forkers

hiyyg abdelpakey cv-ip zhuwenzhen iwldzt3011 saulocatharino phoenix9032 bluseking gavinljj muzafferestelik toopigtobig lyrl techthiyanes meetmangukia siyisan kimyonghyun94 aycatakmaz gehadaboharga swpsgithub cjt222 kubijaku blaxe05 mymuli wasedamagina cv-zmh zeechono hologerry ccoulombe starlipernl dkwarcher xingyizhou catcherhuhu kxqt delldu tuananh1007 qasimkhan5x shuowang-ai jinqi2376 huliang2016 heblade erik-sovereign fblogy limingxing00 lojzezust nmddc0211 vanvalen desperadolxh kelvintao ljm198134 zhangzaibin youngfly11 caixiaoniweimar vandrw niklas-ad sun-xh an99990 chensnathan chenxwh ak391 maheshkkumar hermar98 helojo ningyuanxiang zhizhangxian nicojorgensen1 xincoder hzhang57 msiam zbwxp gaitanignacio fcdl94 bluesky-xsk yibingwei-1 frankfanslc zdj7410852963 jmfortin z-mu-z saidineshpola v-run-p christinepan881 wutao-cs chenfan0206 niemiaszek giuliano-97 phdshliang jonychoi tristandb8 ruthvik92 vobecant jameshgrn richylyq wangjing60755 weianmao jingjunyi liuzhihui2046 tdroseval 3dlg-hcvc tinyloop celsopitta ytep-zhi

mask2former's Issues

error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device

using /data
Preparation done. Between equal marks is user's output:
/root/conda/bin/python
running build
running build_py
running build_ext
building 'MultiScaleDeformableAttention' extension
Emitting ninja build file /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
g++ -pthread -shared -B /root/conda/compiler_compat -L/root/conda/lib -Wl,-rpath=/root/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/vision.o /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.o /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.o -L/root/conda/lib/python3.7/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.7/MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so
running install
running bdist_egg
running egg_info
writing MultiScaleDeformableAttention.egg-info/PKG-INFO
writing dependency_links to MultiScaleDeformableAttention.egg-info/dependency_links.txt
writing top-level names to MultiScaleDeformableAttention.egg-info/top_level.txt
reading manifest file 'MultiScaleDeformableAttention.egg-info/SOURCES.txt'
writing manifest file 'MultiScaleDeformableAttention.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/functions
copying build/lib.linux-x86_64-3.7/functions/init.py -> build/bdist.linux-x86_64/egg/functions
copying build/lib.linux-x86_64-3.7/functions/ms_deform_attn_func.py -> build/bdist.linux-x86_64/egg/functions
copying build/lib.linux-x86_64-3.7/MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/modules
copying build/lib.linux-x86_64-3.7/modules/ms_deform_attn.py -> build/bdist.linux-x86_64/egg/modules
copying build/lib.linux-x86_64-3.7/modules/init.py -> build/bdist.linux-x86_64/egg/modules
byte-compiling build/bdist.linux-x86_64/egg/functions/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/functions/ms_deform_attn_func.py to ms_deform_attn_func.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/modules/ms_deform_attn.py to ms_deform_attn.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/modules/init.py to init.cpython-37.pyc
creating stub loader for MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/MultiScaleDeformableAttention.py to MultiScaleDeformableAttention.cpython-37.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
pycache.MultiScaleDeformableAttention.cpython-37: module references file
creating 'dist/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg
removing '/root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg' (and everything under it)
creating /root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg
Extracting MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg to /root/conda/lib/python3.7/site-packages
MultiScaleDeformableAttention 1.0 is already the active version in easy-install.pth

Installed /root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg
Processing dependencies for MultiScaleDeformableAttention==1.0
Finished processing dependencies for MultiScaleDeformableAttention==1.0
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
Command Line Args: Namespace(config_file='configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False)
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
[02/22 03:41:38 detectron2]: Rank of current process: 0. World size: 8
[02/22 03:41:40 detectron2]: Environment info:

sys.platform linux
Python 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0]
numpy 1.19.2
detectron2 0.6 @/root/conda/lib/python3.7/site-packages/detectron2
Compiler GCC 7.3
CUDA compiler CUDA 11.1
detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6
DETECTRON2_ENV_MODULE
PyTorch 1.9.0 @/root/conda/lib/python3.7/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0,1,2,3,4,5,6,7 GeForce RTX 3090 (arch=8.6)
Driver version 460.73.01
CUDA_HOME /usr/local/cuda
TORCH_CUDA_ARCH_LIST 6.0;6.1;6.2;7.0;7.5
Pillow 8.0.1
torchvision 0.10.0 @/root/conda/lib/python3.7/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20220212
iopath 0.1.9
cv2 4.1.2

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

[02/22 03:41:40 detectron2]: Command line arguments: Namespace(config_file='configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False)
[02/22 03:41:40 detectron2]: Contents of args.config_file=configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml:
BASE: Base-YouTubeVIS-VideoInstanceSegmentation.yaml
MODEL:
WEIGHTS: 186m"186mmodel_final_3c8ec9.pkl186m"
META_ARCHITECTURE: 186m"186mVideoMaskFormer186m"
SEM_SEG_HEAD:
NAME: 186m"186mMaskFormerHead186m"
IGNORE_VALUE: 255
NUM_CLASSES: 40
LOSS_WEIGHT: 1.0
CONVS_DIM: 256
MASK_DIM: 256
NORM: 186m"186mGN186m"
242m# pixel decoder
PIXEL_DECODER_NAME: 186m"186mMSDeformAttnPixelDecoder186m"
IN_FEATURES: [186m"186mres2186m", 186m"186mres3186m", 186m"186mres4186m", 186m"186mres5186m"]
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [186m"186mres3186m", 186m"186mres4186m", 186m"186mres5186m"]
COMMON_STRIDE: 4
TRANSFORMER_ENC_LAYERS: 6
MASK_FORMER:
TRANSFORMER_DECODER_NAME: 186m"186mVideoMultiScaleMaskedTransformerDecoder186m"
TRANSFORMER_IN_FEATURE: 186m"186mmulti_scale_pixel_decoder186m"
DEEP_SUPERVISION: True
NO_OBJECT_WEIGHT: 0.1
CLASS_WEIGHT: 2.0
MASK_WEIGHT: 5.0
DICE_WEIGHT: 5.0
HIDDEN_DIM: 256
NUM_OBJECT_QUERIES: 100
NHEADS: 8
DROPOUT: 0.0
DIM_FEEDFORWARD: 2048
ENC_LAYERS: 0
PRE_NORM: False
ENFORCE_INPUT_PROJ: False
SIZE_DIVISIBILITY: 32
DEC_LAYERS: 10 242m# 9 decoder layers, add one for the loss on learnable query
TRAIN_NUM_POINTS: 12544
OVERSAMPLE_RATIO: 3.0
IMPORTANCE_SAMPLE_RATIO: 0.75
TEST:
SEMANTIC_ON: False
INSTANCE_ON: True
PANOPTIC_ON: False
OVERLAP_THRESHOLD: 0.8
OBJECT_MASK_THRESHOLD: 0.8

[02/22 03:41:40 detectron2]: Running with full config:
CUDNN_BENCHMARK: false
DATALOADER:
ASPECT_RATIO_GROUPING: true
FILTER_EMPTY_ANNOTATIONS: false
NUM_WORKERS: 4
REPEAT_THRESHOLD: 0.0
SAMPLER_TRAIN: TrainingSampler
DATASETS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
PROPOSAL_FILES_TEST: []
PROPOSAL_FILES_TRAIN: []
TEST:

ytvis_2019_val
TRAIN:
ytvis_2019_train
GLOBAL:
HACK: 1.0
INPUT:
AUGMENTATIONS: []
COLOR_AUG_SSD: false
CROP:
ENABLED: false
SINGLE_CATEGORY_MAX_AREA: 1.0
SIZE:
- 600
- 720
  TYPE: absolute_range
  DATASET_MAPPER_NAME: mask_former_semantic
  FORMAT: RGB
  IMAGE_SIZE: 1024
  MASK_FORMAT: polygon
  MAX_SCALE: 2.0
  MAX_SIZE_TEST: 1333
  MAX_SIZE_TRAIN: 1333
  MIN_SCALE: 0.1
  MIN_SIZE_TEST: 360
  MIN_SIZE_TRAIN:
360
480
MIN_SIZE_TRAIN_SAMPLING: choice_by_clip
RANDOM_FLIP: flip_by_clip
SAMPLING_FRAME_NUM: 2
SAMPLING_FRAME_RANGE: 20
SAMPLING_FRAME_SHUFFLE: false
SIZE_DIVISIBILITY: -1
MODEL:
ANCHOR_GENERATOR:
ANGLES:
- - -90
  - 0
  - 90
    ASPECT_RATIOS:
- - 0.5
  - 1.0
  - 2.0
    NAME: DefaultAnchorGenerator
    OFFSET: 0.0
    SIZES:
- - 32
  - 64
  - 128
  - 256
  - 512
    BACKBONE:
    FREEZE_AT: 0
    NAME: build_resnet_backbone
    DEVICE: cuda
    FPN:
    FUSE_TYPE: sum
    IN_FEATURES: []
    NORM: 186m'186m'
    OUT_CHANNELS: 256
    KEYPOINT_ON: false
    LOAD_PROPOSALS: false
    MASK_FORMER:
    CLASS_WEIGHT: 2.0
    DEC_LAYERS: 10
    DEEP_SUPERVISION: true
    DICE_WEIGHT: 5.0
    DIM_FEEDFORWARD: 2048
    DROPOUT: 0.0
    ENC_LAYERS: 0
    ENFORCE_INPUT_PROJ: false
    HIDDEN_DIM: 256
    IMPORTANCE_SAMPLE_RATIO: 0.75
    MASK_WEIGHT: 5.0
    NHEADS: 8
    NO_OBJECT_WEIGHT: 0.1
    NUM_OBJECT_QUERIES: 100
    OVERSAMPLE_RATIO: 3.0
    PRE_NORM: false
    SIZE_DIVISIBILITY: 32
    TEST:
    INSTANCE_ON: true
    OBJECT_MASK_THRESHOLD: 0.8
    OVERLAP_THRESHOLD: 0.8
    PANOPTIC_ON: false
    SEMANTIC_ON: false
    SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE: false
    TRAIN_NUM_POINTS: 12544
    TRANSFORMER_DECODER_NAME: VideoMultiScaleMaskedTransformerDecoder
    TRANSFORMER_IN_FEATURE: multi_scale_pixel_decoder
    MASK_ON: true
    META_ARCHITECTURE: VideoMaskFormer
    PANOPTIC_FPN:
    COMBINE:
    ENABLED: true
    INSTANCES_CONFIDENCE_THRESH: 0.5
    OVERLAP_THRESH: 0.5
    STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
    PIXEL_MEAN:
123.675
116.28
103.53
PIXEL_STD:
58.395
57.12
57.375
PROPOSAL_GENERATOR:
MIN_SIZE: 0
NAME: RPN
RESNETS:
DEFORM_MODULATED: false
DEFORM_NUM_GROUPS: 1
DEFORM_ON_PER_STAGE:
- false
- false
- false
- false
  DEPTH: 50
  NORM: FrozenBN
  NUM_GROUPS: 1
  OUT_FEATURES:
- res2
- res3
- res4
- res5
  RES2_OUT_CHANNELS: 256
  RES4_DILATION: 1
  RES5_DILATION: 1
  RES5_MULTI_GRID:
- 1
- 1
- 1
  STEM_OUT_CHANNELS: 64
  STEM_TYPE: basic
  STRIDE_IN_1X1: false
  WIDTH_PER_GROUP: 64
  RETINANET:
  BBOX_REG_LOSS_TYPE: smooth_l1
  BBOX_REG_WEIGHTS: &id001
- 1.0
- 1.0
- 1.0
- 1.0
  FOCAL_LOSS_ALPHA: 0.25
  FOCAL_LOSS_GAMMA: 2.0
  IN_FEATURES:
- p3
- p4
- p5
- p6
- p7
  IOU_LABELS:
- 0
- -1
- 1
  IOU_THRESHOLDS:
- 0.4
- 0.5
  NMS_THRESH_TEST: 0.5
  NORM: 186m'186m'
  NUM_CLASSES: 80
  NUM_CONVS: 4
  PRIOR_PROB: 0.01
  SCORE_THRESH_TEST: 0.05
  SMOOTH_L1_LOSS_BETA: 0.1
  TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
  BBOX_REG_WEIGHTS:
- - 10.0
  - 10.0
  - 5.0
  - 5.0
- - 20.0
  - 20.0
  - 10.0
  - 10.0
- - 30.0
  - 30.0
  - 15.0
  - 15.0
    IOUS:
- 0.5
- 0.6
- 0.7
  ROI_BOX_HEAD:
  BBOX_REG_LOSS_TYPE: smooth_l1
  BBOX_REG_LOSS_WEIGHT: 1.0
  BBOX_REG_WEIGHTS:
- 10.0
- 10.0
- 5.0
- 5.0
  CLS_AGNOSTIC_BBOX_REG: false
  CONV_DIM: 256
  FC_DIM: 1024
  NAME: 186m'186m'
  NORM: 186m'186m'
  NUM_CONV: 0
  NUM_FC: 0
  POOLER_RESOLUTION: 14
  POOLER_SAMPLING_RATIO: 0
  POOLER_TYPE: ROIAlignV2
  SMOOTH_L1_BETA: 0.0
  TRAIN_ON_PRED_BOXES: false
  ROI_HEADS:
  BATCH_SIZE_PER_IMAGE: 512
  IN_FEATURES:
- res4
  IOU_LABELS:
- 0
- 1
  IOU_THRESHOLDS:
- 0.5
  NAME: Res5ROIHeads
  NMS_THRESH_TEST: 0.5
  NUM_CLASSES: 80
  POSITIVE_FRACTION: 0.25
  PROPOSAL_APPEND_GT: true
  SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
  CONV_DIMS:
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512
  LOSS_WEIGHT: 1.0
  MIN_KEYPOINTS_PER_IMAGE: 1
  NAME: KRCNNConvDeconvUpsampleHead
  NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
  NUM_KEYPOINTS: 17
  POOLER_RESOLUTION: 14
  POOLER_SAMPLING_RATIO: 0
  POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
  CLS_AGNOSTIC_MASK: false
  CONV_DIM: 256
  NAME: MaskRCNNConvUpsampleHead
  NORM: 186m'186m'
  NUM_CONV: 0
  POOLER_RESOLUTION: 14
  POOLER_SAMPLING_RATIO: 0
  POOLER_TYPE: ROIAlignV2
  RPN:
  BATCH_SIZE_PER_IMAGE: 256
  BBOX_REG_LOSS_TYPE: smooth_l1
  BBOX_REG_LOSS_WEIGHT: 1.0
  BBOX_REG_WEIGHTS: *id001
  BOUNDARY_THRESH: -1
  CONV_DIMS:
- -1
  HEAD_NAME: StandardRPNHead
  IN_FEATURES:
- res4
  IOU_LABELS:
- 0
- -1
- 1
  IOU_THRESHOLDS:
- 0.3
- 0.7
  LOSS_WEIGHT: 1.0
  NMS_THRESH: 0.7
  POSITIVE_FRACTION: 0.5
  POST_NMS_TOPK_TEST: 1000
  POST_NMS_TOPK_TRAIN: 2000
  PRE_NMS_TOPK_TEST: 6000
  PRE_NMS_TOPK_TRAIN: 12000
  SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
  ASPP_CHANNELS: 256
  ASPP_DILATIONS:
- 6
- 12
- 18
  ASPP_DROPOUT: 0.1
  COMMON_STRIDE: 4
  CONVS_DIM: 256
  DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES:
- res3
- res4
- res5
  DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS: 8
  DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS: 4
  IGNORE_VALUE: 255
  IN_FEATURES:
- res2
- res3
- res4
- res5
  LOSS_TYPE: hard_pixel_mining
  LOSS_WEIGHT: 1.0
  MASK_DIM: 256
  NAME: MaskFormerHead
  NORM: GN
  NUM_CLASSES: 40
  PIXEL_DECODER_NAME: MSDeformAttnPixelDecoder
  PROJECT_CHANNELS:
- 48
  PROJECT_FEATURES:
- res2
  TRANSFORMER_ENC_LAYERS: 6
  USE_DEPTHWISE_SEPARABLE_CONV: false
  SWIN:
  APE: false
  ATTN_DROP_RATE: 0.0
  DEPTHS:
- 2
- 2
- 6
- 2
  DROP_PATH_RATE: 0.3
  DROP_RATE: 0.0
  EMBED_DIM: 96
  MLP_RATIO: 4.0
  NUM_HEADS:
- 3
- 6
- 12
- 24
  OUT_FEATURES:
- res2
- res3
- res4
- res5
  PATCH_NORM: true
  PATCH_SIZE: 4
  PRETRAIN_IMG_SIZE: 224
  QKV_BIAS: true
  QK_SCALE: null
  USE_CHECKPOINT: false
  WINDOW_SIZE: 7
  WEIGHTS: /data/bolu.ldz/PRETRAINED_WEIGHTS/mask2former/model_final_3c8ec9.pkl
  OUTPUT_DIR: /summary
  SEED: -1
  SOLVER:
  AMP:
  ENABLED: true
  BACKBONE_MULTIPLIER: 0.1
  BASE_LR: 0.0001
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
  CLIP_TYPE: full_model
  CLIP_VALUE: 0.01
  ENABLED: true
  NORM_TYPE: 2.0
  GAMMA: 0.1
  IMS_PER_BATCH: 16
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 6000
  MOMENTUM: 0.9
  NESTEROV: false
  OPTIMIZER: ADAMW
  POLY_LR_CONSTANT_ENDING: 0.0
  POLY_LR_POWER: 0.9
  REFERENCE_WORLD_SIZE: 0
  STEPS:
4000
WARMUP_FACTOR: 1.0
WARMUP_ITERS: 10
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.05
WEIGHT_DECAY_BIAS: null
WEIGHT_DECAY_EMBED: 0.0
WEIGHT_DECAY_NORM: 0.0
TEST:
AUG:
ENABLED: false
FLIP: true
MAX_SIZE: 4000
MIN_SIZES:
- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200
  DETECTIONS_PER_IMAGE: 100
  EVAL_PERIOD: 0
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
  ENABLED: false
  NUM_ITER: 200
  VERSION: 2
  VIS_PERIOD: 0

[02/22 03:41:40 detectron2]: Full config saved to /summary/config.yaml
[02/22 03:41:40 d2.utils.env]: Using a generated random seed 40230477
[02/22 03:41:45 d2.engine.defaults]: Model:
VideoMaskFormer(
(backbone): ResNet(
(stem): BasicStem(
(conv1): Conv2d(
3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
)
(res2): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv1): Conv2d(
64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
)
(res3): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv1): Conv2d(
256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
)
(res4): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
(conv1): Conv2d(
512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(4): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(5): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
)
(res5): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
(conv1): Conv2d(
1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
)
)
(sem_seg_head): MaskFormerHead(
(pixel_decoder): MSDeformAttnPixelDecoder(
(input_proj): ModuleList(
(0): Sequential(
(0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(1): Sequential(
(0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(2): Sequential(
(0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
)
(transformer): MSDeformAttnTransformerEncoderOnly(
(encoder): MSDeformAttnTransformerEncoder(
(layers): ModuleList(
(0): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
)
(pe_layer): Positional encoding PositionEmbeddingSine
num_pos_feats: 128
temperature: 10000
normalize: True
scale: 6.283185307179586
(mask_features): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(adapter_1): Conv2d(
256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(layer_1): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): GroupNorm(32, 256, eps=1e-05, affine=True)
)
)
(predictor): VideoMultiScaleMaskedTransformerDecoder(
(pe_layer): PositionEmbeddingSine3D()
(transformer_self_attention_layers): ModuleList(
(0): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(1): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(2): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(3): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(4): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(5): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(6): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(7): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(8): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(transformer_cross_attention_layers): ModuleList(
(0): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(1): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(2): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(3): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(4): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(5): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(6): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(7): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(8): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(transformer_ffn_layers): ModuleList(
(0): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(6): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(7): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(8): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(decoder_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(query_feat): Embedding(100, 256)
(query_embed): Embedding(100, 256)
(level_embed): Embedding(3, 256)
(input_proj): ModuleList(
(0): Sequential()
(1): Sequential()
(2): Sequential()
)
(class_embed): Linear(in_features=256, out_features=41, bias=True)
(mask_embed): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=256, bias=True)
)
)
)
)
(criterion): Criterion VideoSetCriterion
matcher: Matcher VideoHungarianMatcher
cost_class: 2.0
cost_mask: 5.0
cost_dice: 5.0
losses: ['labels', 'masks']
weight_dict: {'loss_ce': 2.0, 'loss_mask': 5.0, 'loss_dice': 5.0, 'loss_ce_0': 2.0, 'loss_mask_0': 5.0, 'loss_dice_0': 5.0, 'loss_ce_1': 2.0, 'loss_mask_1': 5.0, 'loss_dice_1': 5.0, 'loss_ce_2': 2.0, 'loss_mask_2': 5.0, 'loss_dice_2': 5.0, 'loss_ce_3': 2.0, 'loss_mask_3': 5.0, 'loss_dice_3': 5.0, 'loss_ce_4': 2.0, 'loss_mask_4': 5.0, 'loss_dice_4': 5.0, 'loss_ce_5': 2.0, 'loss_mask_5': 5.0, 'loss_dice_5': 5.0, 'loss_ce_6': 2.0, 'loss_mask_6': 5.0, 'loss_dice_6': 5.0, 'loss_ce_7': 2.0, 'loss_mask_7': 5.0, 'loss_dice_7': 5.0, 'loss_ce_8': 2.0, 'loss_mask_8': 5.0, 'loss_dice_8': 5.0}
num_classes: 40
eos_coef: 0.1
num_points: 12544
oversample_ratio: 3.0
importance_sample_ratio: 0.75
)
[02/22 03:41:45 mask2former_video.data_video.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(360, 480), max_size=1333, sample_style='choice_by_clip', clip_frame_cnt=2), RandomFlip(clip_frame_cnt=2)]
[02/22 03:41:57 mask2former_video.data_video.datasets.ytvis]: Loading /data/bolu.ldz/DATASET/YoutubeVOS2019/train.json takes 12.59 seconds.
[02/22 03:41:57 mask2former_video.data_video.datasets.ytvis]: Loaded 2238 videos in YTVIS format from /data/bolu.ldz/DATASET/YoutubeVOS2019/train.json
[02/22 03:42:05 mask2former_video.data_video.build]: Using training sampler TrainingSampler
[02/22 03:42:19 d2.data.common]: Serializing 2238 elements to byte tensors and concatenating them all ...
[02/22 03:42:19 d2.data.common]: Serialized dataset takes 151.32 MiB
[02/22 03:42:20 fvcore.common.checkpoint]: [Checkpointer] Loading from /data/bolu.ldz/PRETRAINED_WEIGHTS/mask2former/model_final_3c8ec9.pkl ...
[02/22 03:42:22 fvcore.common.checkpoint]: Reading a file from 'MaskFormer Model Zoo'
WARNING [02/22 03:42:22 mask2former_video.modeling.transformer_decoder.video_mask2former_transformer_decoder]: Weight format of VideoMultiScaleMaskedTransformerDecoder have changed! Please upgrade your models. Applying automatic conversion now ...
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'sem_seg_head.predictor.class_embed.weight' to the model due to incompatible shapes: (81, 256) in the checkpoint but (41, 256) in the model! You might want to double check if this is expected.
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'sem_seg_head.predictor.class_embed.bias' to the model due to incompatible shapes: (81,) in the checkpoint but (41,) in the model! You might want to double check if this is expected.
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'criterion.empty_weight' to the model due to incompatible shapes: (81,) in the checkpoint but (41,) in the model! You might want to double check if this is expected.
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
criterion.empty_weight
sem_seg_head.predictor.class_embed.{bias, weight}
[02/22 03:42:22 d2.engine.train_loop]: Starting training from iteration 0
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device

Train on custom dataset for panoptic segmentation

I‘ve been trying to used it for a nuclei panoptic segmentation task.
Dataset is prepared like ADE20K panoptic do.
However, in evalutaion, it doesn't proposed any instance after a period time of training.

File "/home/---/anaconda3/envs/mask2former/lib/python3.8/site-packages/panopticapi/evaluation.py", line 224, in pq_compute
results[name], per_class_results = pq_stat.pq_average(categories, isthing=isthing)
File "/home/---/anaconda3/envs/mask2former/lib/python3.8/site-packages/panopticapi/evaluation.py", line 73, in pq_average
return {'pq': pq / n, 'sq': sq / n, 'rq': rq / n, 'n': n}, per_class_results
ZeroDivisionError: division by zero

There several possible reasons accounting for it I assume:

Dataset not well prepared: Are semantic and instance images label folder a must for panoptic? The labeled data I owned is not Detectorn2 format. But I referred to prepare_ade20k_sem_seg, prepare_ade20k_ins_seg and prepare_ade20k_pan_seg. Converted the labeled data to panoptic images (in a folder) and label json file. Commented the line "sem_seg_file_name": sem_label_file, in dataset_dict.
Configure file not well modified: Another reason maybe model not convergen. Is there any configuration like Mask-RCNN's anchor size or ratio in panopitc segmentation? Because nuclei in whole slide images (crop multiple patches in size 256*256, with one nuclei around (8~16)*(8~16) pixels) is rather small compared to common things in a natural image captioned by camera.

Question about the masked attention.

Where is the specific position of the formula (2) (page4 of paper) in the code ?
Xl=softmax(Ml-1+QlKl)Vl+Xl-1

Train on custom dataset for instance segmentation

The custom dataset only has one class, so I set the MODEL.ROI_HEADS.NUM_CLASSES and MODEL.RETINANET.NUM_CLASSES both as 1. However, when I evaluate the trained model, an error happened:

File "train_net.py", line 411, in main
res = Trainer.test(cfg, model)
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/engine/defaults.py", line 617, in test
results_i = inference_on_dataset(model, data_loader, evaluator)
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/evaluation/evaluator.py", line 205, in inference_on_dataset
results = evaluator.evaluate()
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/evaluation/coco_evaluation.py", line 206, in evaluate
self._eval_predictions(predictions, img_ids=img_ids)
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/evaluation/coco_evaluation.py", line 241, in _eval_predictions
f"A prediction has class={category_id}, "
AssertionError: A prediction has class=24, but the dataset only has 1 classes and predicted class id should be in [0, 0].

error in ms_deformable_im2col_cuda: invalid device function

Hi,

I successfully followed the installation instructions in INSTALL.md, namely:

conda create --name mask2former python=3.8 -y
conda activate mask2former
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python

# under your working directory
git clone [email protected]:facebookresearch/detectron2.git
cd detectron2
pip install -e .
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/mcordts/cityscapesScripts.git

cd ..
git clone [email protected]:facebookresearch/Mask2Former.git
cd Mask2Former
pip install -r requirements.txt
cd mask2former/modeling/pixel_decoder/ops
sh make.sh

However, when running the demo I get the following:

[02/23 09:54:12 detectron2]: Arguments: Namespace(confidence_threshold=0.5, config_file='../configs/coco/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml', input=['/home/weber/Pictures/man.png'], opts=['MODEL.WEIGHTS', '/media/weber/Ubuntu2/ubuntu2/Human_Pose/code-from-source/Mask2Former/model_final_94dc52.pkl'], output=None, video_input=None, webcam=False)
[02/23 09:54:14 fvcore.common.checkpoint]: [Checkpointer] Loading from /media/weber/Ubuntu2/ubuntu2/Human_Pose/code-from-source/Mask2Former/model_final_94dc52.pkl ...
[02/23 09:54:16 fvcore.common.checkpoint]: Reading a file from 'MaskFormer Model Zoo'
Weight format of MultiScaleMaskedTransformerDecoder have changed! Please upgrade your models. Applying automatic conversion now ...
/mnt/c7dd8318-a1d3-4622-a5fb-3fc2d8819579/CORSMAL/envs/detectron2/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448278899/work/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/mnt/c7dd8318-a1d3-4622-a5fb-3fc2d8819579/CORSMAL/envs/detectron2/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
[02/23 09:54:17 detectron2]: /home/weber/Pictures/man.png: detected 56 instances in 1.09s

From a web search, it seems that this error occurs when the wrong CUDA is installed. However, I correctly installed cudatoolkit 11.1 from the installation procedure above. What else could be the issue?

FYI - the demo runs fine if I run it on my CPU (using the MODEL.DEVICE cpu)

Training of Swin-L for video instance segmentation

Hi Bowen,
I am working on an 8*V100(32G) cluster.
When I use this config for training, it is still out of memory.

python scripts/train_net_video.py \
  --num-gpus 8 \
  --config-file configs/youtubevis_2021/swin/video_maskformer2_swin_large_IN21k_384_bs16_8ep.yaml

RuntimeError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 4; 31.75 GiB total capacity; 28.95 GiB already allocated; 11.75 MiB free; 30.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I will appreciate it if you can provide more information about GPUs for training.

why i always got "detected 10 instances per frame"?

when i finetuning on my own ytv2021 dataset, i always got "detected 10 instances per frame",why???? trained model?

Question about the length of training time

Hello, thank you for timely sharing the source code.
But I found that the training speed was very slow on my server.
Could u plz tell me the length of your training time on 8 V100?

configs/coco/instance-segmentation/maskformer2_R50_bs16_50ep.yaml
configs/coco/instance-segmentation/maskformer2_R101_bs16_50ep.yaml

Implementation of mask2former for vis

Hi,

Thank you for sharing such a good work! I have a simple question about the implementation of the mask2former for vis. You mentioned that you use a T=2 during training in the report. Did you keep the same setting in the inference stage? It's an IoU tracker to keep the instance id consistent like Vip-deeplab?

got the answer, I didn't read report carefully

gpu information

Hi,
do you guys have information on GPU usage during inference ?

Thank you

The result of swin-small backbone on ADE

Hi,

I run Mask2Former on ADE (maskformer2_swin_small_bs16_160k.yaml) with 4 16GB V-100 GPUs. However, I can only achieve 49.6%, which is much worse than the reported result (51.3%). Could you provide the log for me to analysize the result?

Thanks

Is it possible to compare detection results with models based on bounding box AP?

Hi, first, thank you for your work it works really well on my custom dataset and the robustness to occlusion is impressive!

I have questions regarding object detection precision. If using mask R-CNN to detect objects, it gives AP bb and AP segm. For the same dataset, the AP segm obtained with Mask2Former is better than the one with Mask R-CNN. However, the AP bb of Mask R-CNN is higher than AP segm of Mask2Former.

My questions are:

Does it mean detection precision is better with Mask R-CNN?
Did you guys try to get the bounding box from the mask segmentation, compute AP bb and compare it with other models?

About ytvis_2021 dataset

Hi, I am following up on your work on video instance segmentation and trying to run experiments on the ytvis_2021 dataset. The original data downloaded from the link is organized as:

{train/valid/test}/
    JPEGImages/
    instance.json

How should I convert it to the structure you used here? I just copied the instance.json as train/valid/test.json, evaluation could run correctly, but there were some file-not-found errors during training. Looks like some videos listed in train/instance.json are not included in train/JPEGImages/. What should I do?

Thanks a lot!

A training problem about Global alloc not supported yet

I created a new running environment for mask2former according to the steps. When I train the COCO dataset, I can train normally, but when I train my dataset, I encounter the following problems.

I've been looking for a solution on Google for a long time, so I'd like to ask if you have any similar problems. Thank you very much for your reply.

[Inference] AssertionError: Non-existent key: --output

Hello! Thank you for sharing.
I found some error in inference code (demo.py).
There is non-existent key, --output.

I modified the code for saving, you need to modify the code for saving.

(detectron) hello96min@rvi-node001:~/minseok/Mask2Former/demo$ python3 demo.py --config-file /home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_100ep.yaml --input ~/minseok/image-inpainting/datasets_dgm/sample_data/images/images_0.png --opts MODEL.WEIGHTS /home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/model_final_f07440.pkl --output ./1.png
[12/14 22:37:14 detectron2]: Arguments: Namespace(confidence_threshold=0.5, config_file='/home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_100ep.yaml', input=['/home/hello96min/minseok/image-inpainting/datasets_dgm/sample_data/images/images_0.png'], opts=['MODEL.WEIGHTS', '/home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/model_final_f07440.pkl', '--output', './1.png'], output=None, video_input=None, webcam=False)
Traceback (most recent call last):
  File "demo.py", line 106, in <module>
    cfg = setup_cfg(args)
  File "demo.py", line 40, in setup_cfg
    cfg.merge_from_list(args.opts)
  File "/home/hello96min/yes/envs/detectron/lib/python3.8/site-packages/fvcore/common/config.py", line 143, in merge_from_list
    return super().merge_from_list(cfg_list)
  File "/home/hello96min/yes/envs/detectron/lib/python3.8/site-packages/yacs/config.py", line 243, in merge_from_list
    _assert_with_logging(subkey in d, "Non-existent key: {}".format(full_key))
  File "/home/hello96min/yes/envs/detectron/lib/python3.8/site-packages/yacs/config.py", line 545, in _assert_with_logging
    assert cond, msg
AssertionError: Non-existent key: --output

Scale the lr when using different samping num

Hi,

Thank you for sharing such a good work! I have a question regarding to the loss. I found if I use more frames during training, the loss goes very high. Do I need to scale down the lr linearly according to the sampling frame num? Thank you.

Question about Bounding Box

Thanks for your great work!

For more general model, I think the model can infer bounding box too.
So, I have two questions about your work.

Is there any reason why you didn't add module for infering bounding box?
I think just adding box_embed module after decoder like class_embed enables infering bounding box. Have you tired it? or you already tried but not working?

Thanks a lot!

Mapillary instance annotation?

Hi,

Thanks for your wonderful repo. I follow the steps in preparing datasets, but it seems that datasets/prepare_mapillary_vistas_ins_seg.py is not provided. Could you pls check it out?

Question about Bounding Box

Thanks for your great work!

I added bounding box head to Mask2Former model like DETR.
(parallel with mask label)

But the performance is not good.
Do you think Mask2Former architecture is not good for detecting bounding box?
If you have any idea or intuition please tell me.

Thanks a lot!

Get output masks only on desired classes.

Hi,
Is there a way to get masks in the output of only 1 or 2 specified classes from ADE20k or COCO?

why the speed so slow compare with maskformer first version?

is that because of used multiple features and mask attention? the speed seems not very satisfying in terms of some practicle scenarios.

Ade20k Panoptic Segmentation demo problem

Hi,

I have a problem trying to use the demo with Ade20k Panoptic Segmentation. The command used is:

python demo.py --config-file ../configs/ade20k/panoptic-segmentation/maskformer2_R50_bs16_160k.yaml \
  --video-input ... \
  --output ... \
  --opts MODEL.WEIGHTS ../models/model_final_5c90d4.pkl

And the stack trace is:

  File "demo.py", line 182, in <module>
    for vis_frame in tqdm.tqdm(demo.run_on_video(video), total=num_frames):
  File "/home/master/Develop/Mask2Former/lib/python3.8/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/home/master/Develop/Mask2Former/demo/predictor.py", line 130, in run_on_video
    yield process_predictions(frame, self.predictor(frame))
  File "/home/master/Develop/Mask2Former/demo/predictor.py", line 94, in process_predictions
    vis_frame = video_visualizer.draw_panoptic_seg_predictions(
  File "/home/master/Develop/Mask2Former/lib/python3.8/site-packages/detectron2/utils/video_visualizer.py", line 172, in draw_panoptic_seg_predictions
    labels = [self.metadata.thing_classes[k] for k in category_ids]
  File "/home/master/Develop/Mask2Former/lib/python3.8/site-packages/detectron2/utils/video_visualizer.py", line 172, in <listcomp>
    labels = [self.metadata.thing_classes[k] for k in category_ids]
IndexError: list index out of range

I think I have found the source of the problem in the lines

    thing_classes = [k["name"] for k in ADE20K_150_CATEGORIES if k["isthing"] == 1]
    thing_colors = [k["color"] for k in ADE20K_150_CATEGORIES if k["isthing"] == 1]

of mask2former/data/datasets/register_ade20k_panoptic.py

And from my understanding it happens because Detectron2 seems to use the id as the index, but these lines remove some items and the index to id mapping is lost.

Changing the lines to

    thing_classes = [k["name"] for k in ADE20K_150_CATEGORIES]
    thing_colors = [k["color"] for k in ADE20K_150_CATEGORIES]

seems to work, but I don't know if there are any undesired consequences.

I have installed Detectron2 with pip, but the line where the error happens appears to be also in the git version.

Does MaskFormer support multi-scale testing for VIS task?

Appreciate to your excellent work! I wonder whether you have tried some testing skills like multi-scale testing which may boost the final performance. Or does the implementation of Mask2Former in this repo support multi-scale testing on ytvis 2019/2021?

Question for input data

Hi, I really enjoyed reading Mask2Former paper.

Could 3d image be used for training if appropriate modification is done on code?

Regards,
Tae

Wrong link in model zoo

Link of Mask2former_r101 for coco panoptic in model zoo is wrong.

Issue reproduce COCO training

Hello all, I am quite confuse with the definition " panoptic_{train,val}2017/ # png annotations " on coco folder structure. When I download the COCO dataset, I couldnt find this folder/dataset. could you please tell me how can I get/generate this folder? I know there is panoptic annotations, but exactly how can I generate the folder. Thank you

Validation accuracy always 100?

Hello,

Thank you for this project and code. I'm running a custom semantic segmentation training job (based on this config) with one class (custom class 'AT') and for some reason my validation accuracy after an epoch is always 100:

[12/21 21:20:26 d2.evaluation.evaluator]: Total inference time: 0:00:44.703675 (0.065935 s / iter per device, on 1 devices)
[12/21 21:20:26 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:37 (0.055430 s / iter per device, on 1 devices)
[12/21 21:20:26 d2.evaluation.sem_seg_evaluation]: OrderedDict([('sem_seg', {'mIoU': 100.0, 'fwIoU': 100.0, 'IoU-AT': 100.0, 'mACC': 100.0, 'pACC': 100.0, 'ACC-AT': 100.0})])
[12/21 21:20:26 d2.engine.defaults]: Evaluation results for ade20k_full_sem_seg_val in csv format:
[12/21 21:20:26 d2.evaluation.testing]: copypaste: Task: sem_seg
[12/21 21:20:26 d2.evaluation.testing]: copypaste: mIoU,fwIoU,mACC,pACC
[12/21 21:20:26 d2.evaluation.testing]: copypaste: 100.0000,100.0000,100.0000,100.0000

When I view the loss curves in tensorboard it seems like the model is learning so I'm not sure what's going on:

Here's the full config:

config.zip

Any ideas?

Thank you

why do we need to compile CUDA kernel for MSDeformAttn?

can you please explain why do we need to compile to cuda kernel for MSDeformAttn?
I see we have a python file for it, I am not understanding why the compilation is needed?
Sorry I am not very familiar with the concept of why it would not work if we use the py functions without compiling
In what scenario I should generally compile the cuda kernel? because I never paid attention to it :/

really appreciate it if you can please explain the reason behind it.
Thanks a ton!

how to inspect model such as visualize the attention?

anyone visualize the attention? where to get the fearure and visualize it.

I fintuned on custom dataset , now, it's time to inspect the model

How to map category_id to object label in panoptic_seg

I am running inference using COCO config. How can I get the object label of each class in the way that it is displayed on a visualized output image? Basically looking for some mapping from category_id to label.
During inference, I got category_ids greater than 91 so I thought the standard COCO mapping won't work.

For instance segmentation, how to deal with the 100 predicted instances for evalution?

There are 100 instances per image, how to post-process these instances? Is there any threshold or NMS? Where is the code for this part? Thanks.

ImportError: .../MultiScaleDeformableAttention.cpython-38-X86_64-linux-gnu.so: undefined symbol: _ZNK2at10TensorBase8dataptrIdEEPT_v

Hello! Thank you for sharing.
I follow your Example conda environment setup:

But when I run the code:

It doesnt work successifully. It makes an ImportError

How to visualize the VIS results?

Hi,

Thanks for your wonderful work and repo.

Could you please provide the instructions on how to visualize the video instance segmentation results on images or videos? Thanks!

Request for training logs on COCO

Hi,

Could you please share the training logs for models as well? Its common now to share them DeiT and would help in debugging in reproducing. Even just for say, COCO on R50 model.

Best,
Kartik

Colab demo doesn't work

I would love to try out this model but I am struggling with installation. The Colab demo does not work either and gives the error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/content/Mask2Former/mask2former/modeling/pixel_decoder/ops/functions/ms_deform_attn_func.py in <module>()
     21 try:
---> 22     import MultiScaleDeformableAttention as MSDA
     23 except ModuleNotFoundError as e:

ModuleNotFoundError: No module named 'MultiScaleDeformableAttention'

During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)
7 frames
/content/Mask2Former/mask2former/modeling/pixel_decoder/ops/functions/ms_deform_attn_func.py in <module>()
     27         "\t`sh make.sh`\n"
     28     )
---> 29     raise ModuleNotFoundError(info_string)
     30 
     31 

ModuleNotFoundError: 

Please compile MultiScaleDeformableAttention CUDA op with the following commands:
	`cd mask2former/modeling/pixel_decoder/ops`
	`sh make.sh`


---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Train coco dataset error

i followed the instruction:
https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md and perpare the coco datasets.

I have already run demo successfully but the error occur when i running train scrip：
python train_net.py --num-gpus 8 --config-file configs/coco/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml

which is followed:
`
[02/01 14:43:34 mask2former.data.dataset_mappers.coco_panoptic_new_baseline_dataset_mapper]: [COCOPanopticNewBaselineDatasetMapper] Full TransformGens used in training: [RandomFlip(), ResizeScale(min_scale=0.1, max_scale=2.0, target_height=1024, target_width=1024), FixedSizeCrop(crop_size=(1024, 1024))]
[02/01 14:43:41 d2.data.build]: Using training sampler TrainingSampler
[02/01 14:43:41 d2.data.common]: Serializing 118287 elements to byte tensors and concatenating them all ...
[02/01 14:43:42 d2.data.common]: Serialized dataset takes 78.29 MiB
[02/01 14:43:51 fvcore.common.checkpoint]: [Checkpointer] Loading from model_final_94dc52.pkl ...
[02/01 14:43:51 fvcore.common.checkpoint]: Reading a file from 'MaskFormer Model Zoo'
WARNING [02/01 14:43:51 mask2former.modeling.transformer_decoder.mask2former_transformer_decoder]: Weight format of MultiScaleMaskedTransformerDecoder have changed! Please upgrade your models. Applying automatic conversion now ...
[02/01 14:43:51 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/conda/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "/home/xt.xie/workspace/code/Mask2Former-main/train_net.py", line 321, in
launch(
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
main_func(*args)
File "/home/xt.xie/workspace/code/Mask2Former-main/train_net.py", line 315, in main
return trainer.train()
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 395, in run_step
loss_dict = self.model(data)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/maskformer_model.py", line 209, in forward
losses = self.criterion(outputs, targets)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/modeling/criterion.py", line 222, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/modeling/matcher.py", line 179, in forward
return self.memory_efficient_forward(outputs, targets)
File "/opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/modeling/matcher.py", line 122, in memory_efficient_forward
tgt_mask = point_sample(
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/projects/point_rend/point_features.py", line 39, in point_sample
output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 3836, in grid_sample
return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
RuntimeError: grid_sampler(): expected input and grid to have same dtype, but input has c10::Half and grid has float
`

Converting a PyTorch model to an ONNX model

Hi, thanks for your great work.
Currently I'm using coco maskformer2_swin_large_IN21k_384_bs16_100ep.yaml configuration with pretrained model.
I'm trying to convert this model to onnx format. But it gives me segmentation fault error.

Could you please share the converted model or inform about how to do it?

I trained instance segmentation,but I don't have bbox

HI, thanks for your code.
I used instance segmentation to train custom data, but I don't have bbox and bbox score.
I demo my test picture have no bbox too.
Why I just have mask and mask score.
thank you!

About evaluation Mask2Former on YouTubeVIS-2021

The original data downloaded from the link is organized as:

ytvis_2021/
{train/valid/test}/
JPEGImages/
instance.json

There are no Annotations as you mentioned like below,

ytvis_2021/
{train,valid,test}.json
{train,valid,test}/
Annotations/
JPEGImages/

How do you evaluation Mask2Former on YouTubeVIS-2021?

how to finetuning on custom dataset?

how to finetuning on custom dataset? is there any config file surport fintune

Question about the need of bounding boxes in the labels

Hi, thank you for sharing your work, works greatly on my dataset.

I don't get if the model uses the ground truth of the bounding boxes:
for example, let's say we are working with the Coco dataset, would it change anything in the training phase if we switched every annotations[n].segments_info[m].bbox into [0,0,0,0] ?

How to install with pip?

I tried to run pip install git+https://github.com/facebookresearch/Mask2Former, but terminal throws a bunch of errors (probably because lack of setup.py in repo).
I really don't like using conda so Is there any way to install it with pip or should I build it from source?

modifying MODEL.MASK_FORMER.NUM_OBJECT_QUERIES can't load pretrained model when training

Thanks for your code. I find that I can save GPU memory with modifying MODEL.MASK_FORMER.NUM_OBJECT_QUERIES when running demo.py. But after modifying that, I can't load pretrained model when training. Can you give me Any suggestion about this?

is there any wrong with the table in paper?

PQ column: 50.7 51.2(-0.7) ? I guess the PQ is 51.9 on first row？
another question: If train the same epoch with htc++ or knet， the result is also better or not ?

Question about the model training

Hi thank you for your excellent work. I meet a problem when re-run your experiments.

When I re-train the Instance segmentation model with R-50 on COCO dataset, the results are:
43.5, 23.0, 47.0, 65.1
43.2, 22.7, 46.4, 64.8
which are a bit lower with your reported number:
43.7, 23.4, 47.2, 64.8

I use the standard configuration file to run the experiments without any modification, and run on 4/8 V-100 cards. I do not know whether it's just a common scenario, or did you meet the same problem during training?

CUDA out of memory

When I run demo_video/demo.py to infer my video, it shows "CUDA out of memory". I try to reduce the input size, but it doesn't work, can you tell me how to solve this problem. Thanks!

Issues training the ytvis_2019 model

Hi Bowen,

Thanks for your excellent work and code! I am retraining the video instance segmentation model on the Youtube VIS 2019 dataset. I managed to train the model, but the quantitative result on the CodaLab turns to be very low (of only about 40).

The command I used for training is:

python3 train_net_video.py \
    --config-file configs/youtubevis_2019/swin/video_maskformer2_swin_base_IN21k_384_bs16_8ep.yaml \
    --num-gpus 8 \
    MODEL.WEIGHTS swin_base_patch4_window12_384_22k.pkl

The backbone weight is got from:

wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22k.pth
python tools/convert-pretrained-swin-model-to-d2.py swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl

Then I submitted the output/inference/results.json after the training to CodaLab, but only got 40 accuracy. I also tried to rerun the evaluation using output/model_final.pth, and the results are almost the same.

The config files are untouched. I am able to reproduce the correct result using the pretrain model, so I assume that my environment and dataset setups are ok. I also checked the tensorboard output and the loss curve looks good. Could you help to check if there is anything wrong with my training process? Thanks!

using ce loss instead of focal loss

I notice that you use the standard ce loss instead of focal loss. Does it have some influence on the result?

Can model convert to torchscript?

Hi, Can model convert to torchscript?
I try to do, but I got the error.

RuntimeError: 
Could not export Python function call 'MSDeformAttnFunction'. Remove calls to Python functions before export. Did you forget to add @script or @script_method annotation? If this is a nn.ModuleList, add it to __constants__:
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/ops/modules/ms_deform_attn.py(117): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(124): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(159): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(87): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(324): forward_features
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/autocast_mode.py(198): decorate_autocast
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/meta_arch/mask_former_head.py(119): layers
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/meta_arch/mask_former_head.py(116): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/maskformer_model.py(198): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/detectron2/detectron2/export/flatten.py(259): <lambda>
/home/ubuntu/PycharmProjects/mask2former/venv/detectron2/detectron2/export/flatten.py(294): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/toTorchScript.py(44): export_tracing
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/toTorchScript.py(115): <module>

A question regarding pixel decoder with Fapn

What confused me a lot is the pixel decoder as you mentioned briefly in the paper in one of the experiments said “Swin-L-FaPN uses FaPN as pixel decoder”, I have the strong desire to know exactly how do you use the FaPN as the pixel decoder as FaPN itself is the complete model? If you incorporate the FaPN components——FeatureAlign and FeatureSelectionModule to the pixel decoder of Mask2Former in one of the experiments?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.