raoyongming / denseclip Goto Github PK

View Code? Open in Web Editor NEW

503.0 503.0 37.0 15.75 MB

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Python 99.05% Shell 0.95%

denseclip's People

Stargazers

Watchers

denseclip's Issues

Training details

Hello!

Very interesting work, great job!
I could not find any information regarding the GPUs used for training. I see you have performed extensive experiments, I am specifically interested in knowing the GPUs used for experiments with ViT/Swin Transformer backbone!

Thanks

downloading the pretrained weights

where do I need to download RN50 and RN101 from (the ones pretrained with clip)?

thanks!

Questions about text input

Hi, I want to change the text input from a word to a paragraph, is there a way to do this?
For example, I would like to add an explanation of the category name after each category name.

What does the different contexts_length setting based on? What is the meaning of separation?

In denseclip.py, there has the following equation:
“context_length = self.text_encoder.context_length - self.context_length”
Can you explain to me the role of different contexts? And why we need to use the "-" to calculate the context_length of self.contexts?

vit-b-denseclip for semantic segmentation is lost

Hi and thanks for your great work! The pth file of the vit-b-denseclip for semantic segmentation is lost, can you upload it again?

code about Pre-model prompting

Hi, the paper proposed two context prompting methods which are pre-model and post-model prompting respectively。 But I only find the post-model methed from the source code. Could you please provide the pre-model's code? That will help understand a lot. Thanks!

super(SingleStageDetector, self).init(init_cfg) TypeError: init() takes 1 positional argument but 2 were given

Hi. Thank you for your great work.

I tried to run the train code with command

"bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 1", but got the following error.

super(SingleStageDetector, self).init(init_cfg) TypeError: init() takes 1 positional argument but 2 were given

I did not change the code but I dont know why building the model gets error.
Could I get some help?!

What does "We fix the text encoder during training" mean? Does it mean that the parameters are not updated during training?

Thanks for your great work! Does "We fix the text encoder during training" mean that the parameters of text encoder are not updated during training? If so, I can't find any "requires_grad=False" settings in the source code.

ADE20K batchsize

For ADE20K dataset, the batchsize is 32 (4*8gpus) ?

dim unsigned

Hi, I used the pre_trained model ViT-B-16.pt and the config retinanet_clip_r101_fpn_1x_coco.py. However, the embed_dim of CLIPTextContextEncoder is 1024 while the embeded_dim of pretrained model is 512.

about single-scale and multi-scale settings

The default setting is multi-scale model, which is provided in the github. What about the single-scale model? Could you please provide the config files for it.

New MMCV and MMSegmentation version

OpenMMlab has released new version of MMCV 2.0.0 and MMSegmentation 1.0.0 recently, will you migrate from your code to the latest version?

Code

I am very interested in reading the paper, but the pre-model prompting mentioned in the paper seems to be missing in the code, could you add this part of the code? I am very interested in this part。

An error while saving the model in ONNX format

Have you ever tried saving the model as ONNX for visualization? Why does the trained model report an error missing img_meta when saved as ONNX?

Where can I download this （RN102.pt）?

DenseCLIP/segmentation/configs/denseclip_fpn_res101_512x512_80k.py

Line 9 in a5bfe25

pretrained='pretrained/RN102.pt',

Did you download it from https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt? Is it a typo?

Thanks!

Additional Package Dependencies

Nice work!
But when creating a new conda environment for DenseCLIP, I found the following packages are required:

regex
ftfy

After using pip to install these packages, the code works fine.

(Moreover, seems that it's not necessary to install fvcore explicitly.

Can not reproduce the result of DenseCLIP-R50.

Hi,

I use your introduction of training denseclip-R50, with batch_size of "4 x 8 GPU".

However I can not reproduce your results(43.5) in your paper, I only get 42.8 mIoU.

Can you provide the training log file? or more details (e.g., seed?) to reproduce your paper results. Thanks!

Question about DenseCLIP for Any Visual Backbone

About prompt text

Thanks for your sharing work

I want to know how the prompt text create?(script or manual )

Can you provide some examples？

Questions about the architecture

You present in the paper only the results related to Semantic FPN.

Have you conducted any relevant experiments based on dilated backbone methods (e.g. DeepLabV3+)?

Is the reason for not using dilated backbone-based methods that you are looking for fewer FLOPs or have you found that the results are not good?

question about eos_indx in model.py

Nice work! For class of CLIPTextContextEncoder, why did the line eos_indx = text.argmax(dim=-1) + N2 add N2 ?

[critical bug] The text encoder is also updated.

I found out that the text encoder is also updated.
The positional embedding of the provided "denseclip_fpn_res50.pth" is
tensor([[-0.0013, 0.0003, 0.0007, ..., -0.0027, -0.0091, -0.0024],
[-0.0039, -0.0008, -0.0016, ..., -0.0006, -0.0049, -0.0044],
[-0.0044, 0.0011, -0.0007, ..., -0.0026, -0.0094, -0.0008],
...,
[-0.0002, -0.0002, -0.0012, ..., 0.0007, 0.0013, -0.0002],
[-0.0016, -0.0015, -0.0001, ..., -0.0010, -0.0025, -0.0004],
[-0.0030, -0.0013, -0.0004, ..., -0.0028, -0.0052, -0.0016]])

And the first 13 positional embedding of the pretrained RN50 model is
tensor([[-0.0012, 0.0003, 0.0008, ..., -0.0027, -0.0090, -0.0024],
[-0.0040, -0.0008, -0.0015, ..., -0.0006, -0.0049, -0.0045],
[-0.0044, 0.0011, -0.0006, ..., -0.0025, -0.0093, -0.0007],
...,
[-0.0002, -0.0002, -0.0011, ..., 0.0006, 0.0011, -0.0003],
[-0.0018, -0.0016, -0.0002, ..., -0.0009, -0.0025, -0.0004],
[-0.0031, -0.0014, -0.0006, ..., -0.0026, -0.0053, -0.0015]],
device='cuda:0', grad_fn=)

, which is slightly different.

I guess the reason is that "lr_mult" does not guarantee zero LR. The learning rate of the text encoder may get bigger than 0 due to the internal behavior of the LR scheduler. I think this is quite a critical bug since it may affect the result of the ablation study (Table 2 in the paper).

Also, I have one more question: Why do you set lr_mult as 0 for 'norm'? As far as I know, the mmcv library tries to set learning_rate as 0 for every module which includes the key "norm". If it is right, every 'normalization layer' in the transformer layer (especially the context decoder) will be 0.

A stupid question about auxiliary loss for Object detection & instance segmentation.

The paper said, "we do not have ground truth segmentation label.". I can understand there is no segmentation mask for detection, but why is there no segmentation mask for instance segmentation task?

Question about little FPN in ViT-B

Greetings, I like your great work and try to follow it to do something!

I noticed that your ViT-B-16 contains a little FPN, I wonder why you prefer to add this, and what will happen if we remove this?

dimension error when load ViT-B weight

for config denseclip_fpn_vit-b_640x640_80k.py:
in text_encoder:
embed_dim=512, while ViT-B-16.pt has embed_dim=1024,
when loading weight, it turns out that :"RuntimeError: Error(s) in loading state_dict for CLIPTextContextEncoder:
size mismatch for text_projection: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in curren
t model is torch.Size([512, 512])"
How do you deal with this problem?

Some questions of ViT-B-DenseCLIP

1 I intend to know the performance of ViT-B-DenseCLIP (VS RN101-DenseCLIP), can you tell me the specific information of it? and how to train ViT-B-DenseCLIP on coco or ADE20K?
2 ViT-B-DenseCLIP is based on ViT-B-16.pt? not ViT-B-32.pt?

Misaligned params

Hi,
when I apply DenseCLIP to torch.nn.DataParallel for multi-GPUs training,
the pretrained model can not load correctly,
how can I fix this? (Single GPU training work well)

question about how to use vit backbone in detection

Hi thank you for your good work.

I tried in vit structure used in the mmdetection, i notice that vit part used detection/denseclip/denseclip.py class DenseCLIP_MaskRCNN, but when I add vit config, I could not run it correctly, There will be some dimension mismatch or other incompatibility problems, it seems that the detection of vit version is not complete. Can you give me some suggestions? If I want to use vit as backbone and denseclip on coco, can I use denseclip.py from segmentation? Or can you provide vit detection configuration? Or am I using it the wrong way?

Thanks!

question on the device

Which and how many gpus do you use？How long does it take to complete a training session？

Question about ADE20K dataset

Hi, I want to reproduce your experiment on ADE20K but when I download the dataset I noticed that the ADE website had updated the dataset so the file structure and the number of training images is different from the older ADEChallengeData2016. I wonder which version do you use in your work?
The older version of ADE20k contains 20210 training images.
The newer version containes 25K training images as shown on their website.

Query on Inference Setting

Hi,

Thanks for making the code public !

I had a general query on the inference setting chosen for this paper -- Why is this paper not targetting zero -shot setting and instead focussed on fully supervised setting ? Is there any reason ? As the power of CLIP lies in zero-shot task transfer, i was wondering why no experiments were done for this ? and instead posed this problem as a multi-modal fully supervised dense detection task ?

Thanks in advance

Question about CLIPVisionTransformer

Hi, Thanks for your work DenseCLIP.

I have some question about the CLIPVisionTransformer.


x = self.conv1(x)  # shape = [*, width, grid, grid]
B, C, H, W = x.shape
x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]


pos = self.positional_embedding.to(x.dtype)
cls_pos = pos[0,:] + self.class_embedding.to(x.dtype)
spatial_pos = F.interpolate(pos[1:,].reshape(1, self.spatial_size, self.spatial_size, C).permute(0, 3, 1, 2), size=(H, W), mode='bilinear')
spatial_pos = spatial_pos.reshape(1, C, H*W).permute(0, 2, 1)
pos = torch.cat([cls_pos.reshape(1, 1, C), spatial_pos], dim=1)
x = x + pos
x = self.ln_pre(x)
x = x.permute(1, 0, 2)  # NLD -> LND

the x have both image feature and class embedding. However, the cls_pos are also added with class_embedding. I think there is some conflict with origin CLIP code, could you tell me the reason for this operation?

Single GPU error

Hi，I've modified the settings for single or multiple GPUs

norm_cfg = dict(type='BN', requires_grad=True)

But there were still such errors.

Does that mean I need to modify the training section in the mmseg source?

Question about inference setting

Hi.
Thanks for sharing your work!

Does DenseCLIP use pre-trained CLIP encoder on inference setting?

I think pre-trained CLIP encoder needs to compute pixel-text score maps on inference setting.
So the model is needed pre-trained CLIP encoder.

I wonder the CLIP encoder don't use on inference setting.

Thanks.

Prompt learning via CoOp

Hi, Thanks for your work. In the paper, it is mentioned that you use CoOp to generate prompts where <p1, p2,...,pn> are learnable. But from the code it seems that only the class labels are used for generating the prompt text ? Can you please confirm this, if I am missing something ?

Code for any backbone(ImageNet) experiments on ADE20K segmentation

I want to reproduce your ImageNet backbone experiments on ADE20K segmentation.

Can you provide the code for ImageNet backbone experiment?

any backbone experiments on ADE20K segmentation
is not enough to fully implement the experiments.

Thanks a lot !

fair comparison

Question about training process

I run following for training on ade dataset,

bash dist_train.sh configs/denseclip_fpn_res50_512x512_80k.py 8

command lines output as follows, but training process goes on....

[] [] are misaligned params in CLIPResNet
[] [] are misaligned params in text encoder

It seems language-image align failed, how can I fix this problem?

multi-gpu error

hello,I want to know whether the code can be trained with multigpu?

the given command uses multi-gpu,like
"bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 8"
but when I run it,it fails,showing the following errors

[] [] are misaligned params in CLIPResNet
[] [] are misaligned params in CLIPResNet
[] [] are misaligned params in text encoder
[] [] are misaligned params in text encoder

and I find in the code that writes the note that

the results of DenseCLIP on Cityscapes

Hi,

this is a great work. Have you tried DenseCLIP on other seg datasets? e.g., Cityscapes.

Because I tried with the code, but the result is not good enough.

thanks.

question about any backbone experiments on ADE20K segmentation

Hi, @raoyongming,
thanks very much for your great work. I just have some questions about any backbone experiments on ADE20K segmentation in table 5. 我想问一下，针对没有clip 预训练的模型，例如RestNet18, Swintransformer-T/S, 我看到在ADE20k上提升提升不如RN50 显著。你们是直接进行的visual-text 特征交互计算吗？有用到其他的一些trick吗？thanks!

Questions about the details of configuration of RN50-CLIP

I cannot reach the mIoU of RN50-CLIP that was showed in the paper, though I used the configuration mentioned in the README. Could you please tell me what batch size was used and how many GPUs were used. More details of implement are very helpful. I've tried batch size of 16, but only got 38.85 of mIoU. Here is my configuration and log file is putted in the attachment.

'''
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
type='EncoderDecoder',
pretrained='pretrained/RN50.pt',
backbone=dict(
type='CLIPResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
dilations=(1, 1, 1, 1),
strides=(1, 2, 2, 2),
norm_cfg=dict(type='SyncBN', requires_grad=True),
norm_eval=False,
style='pytorch',
contract_dilation=True,
layers=[3, 4, 6, 3]),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=4),
decode_head=dict(
type='FPNHead',
in_channels=[256, 256, 256, 256],
in_index=[0, 1, 2, 3],
feature_strides=[4, 8, 16, 32],
channels=256,
dropout_ratio=0.1,
num_classes=150,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
train_cfg=dict(),
test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)))
dataset_type = 'ADE20KDataset'
data_root = 'data/ade/ADEChallengeData2016'
IMG_MEAN = [122.7709383, 116.7460125, 104.09373615000001]
IMG_VAR = [68.5005327, 66.6321579, 70.32316304999999]
img_norm_cfg = dict(
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=4,
workers_per_gpu=4,
train=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/training',
ann_dir='annotations/training',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]),
val=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/validation',
ann_dir='annotations/validation',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/validation',
ann_dir='annotations/validation',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
log_config = dict(
interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
find_unused_parameters = True
optimizer = dict(
type='AdamW',
lr=0.0001,
weight_decay=0.0001,
paramwise_cfg=dict(
custom_keys=dict(
backbone=dict(lr_mult=0.1), norm=dict(decay_mult=0.0))))
optimizer_config = dict()
lr_config = dict(
policy='poly',
power=0.9,
min_lr=1e-06,
by_epoch=False,
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-06)
runner = dict(type='IterBasedRunner', max_iters=80000)
checkpoint_config = dict(by_epoch=False, interval=8000)
evaluation = dict(interval=8000, metric='mIoU')
work_dir = './work_dirs/fpn_clipres50_test4k'
gpu_ids = range(0, 1)
'''

20220320_015954.log

what is the value of gamma?

Hi, I found the value of self.gamma is different between your paper (gamma==1e-3) and your code (gamma==1e-4), which one should I use to reproduce your results? Thanks.

Loading pretrained CLIP parameters but tuncate context_length in positional_embedding?

I find that when you load pretrained CLIP parameters but tuncate context_length in positional_embedding like:

DenseCLIP/detection/denseclip/models.py

Lines 652 to 655 in 3b72447

    
           if k == 'positional_embedding' or k == 'text_projection' or k.startswith('token_embedding') or k.startswith('ln_final'): 
        
               if k == 'positional_embedding' and checkpoint[k].size(0) > self.context_length: 
        
                   checkpoint[k] = checkpoint[k][:self.context_length] 
        
                   print('positional_embedding is tuncated from 77 to', self.context_length)

Does this affect pretrained model performance or in other words, does this change the pretrained model text encoder original output ?

Question about implementation of CLIPResNet

Hi,
The modified ResNet in CLIP uses attention pooling, and the comments in CLIPResNet class are also noted that.
However, I didn't see any related operations in CLIPResNet.
I know there is another CLIPResNetWithAttention class, but according the configs, I think it is for DenseCLIP not CLIP?

Request about the pre-trained model of Swin Transformer+DenseCLIP

Dear author,
Thanks for sharing the code. I am greatly interested in your work. In my latest work, I hope to borrow ideas from your work.
I have a request and would like to hear back from you.
In your experiments, I found that applying DenseCLIP to Swin Transformer achieved a good performance gain. I would like to ask if I can obtain the pre-trained models of Swin Transformer + DenseCLIP?

Is default training iterations enough to reach the paper performance?

I try to train a DenseCLIP model based on CLIP ResNet-50 using the default configuration, where the number of iterations is 80,000. But I find that after training, the model has mIoU 39.46 in the testing set, which is smaller than 43.5 shown in the paper. The following images are the testing result and training history.

Open set inference without training?

Thanks for the great work. It seems that the released model is trained on the ADE dataset. If we want to test on other text descriptions (classes), we must re-train the model? Is there any other way to do the openest inference without training? For example, can I directly utilize the pre-trained CLIP model to calculate the pixel-wise dot product? Do you have such kind of code support?

Any plans in Anchor-free detection?

Thx for your interesting and nice work!

I would like to know whether you conduct experiments on anchor-free detectors, e.g., a). adopting the score maps s as the centre/conner points heatmaps to obtain bounding-boxes predictions or b). employing pre-model prompting for a transformer detector directly (use text embeddings to initialize classifier), like DETR.

CUDA out of memory

Hi, I'm faced with the following error even though I change the samples_per_gpu to 1 (the smallest batch size I think):

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 690.00 MiB (GPU 0; 39.59 GiB total capacity; 973.79 MiB already allocated; 474.62 MiB free; 1.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

Does it work only on GPU with 80GB memory or is there any other things I need to take care of? Thanks in advance!

	if k == 'positional_embedding' or k == 'text_projection' or k.startswith('token_embedding') or k.startswith('ln_final'):
	if k == 'positional_embedding' and checkpoint[k].size(0) > self.context_length:
	checkpoint[k] = checkpoint[k][:self.context_length]
	print('positional_embedding is tuncated from 77 to', self.context_length)

raoyongming / denseclip Goto Github PK

denseclip's People

Stargazers

Watchers

Forkers

denseclip's Issues

Recommend Projects

Recommend Topics

Recommend Org