raoyongming / denseclip Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Hello!
Very interesting work, great job!
I could not find any information regarding the GPUs used for training. I see you have performed extensive experiments, I am specifically interested in knowing the GPUs used for experiments with ViT/Swin Transformer backbone!
Thanks
where do I need to download RN50 and RN101 from (the ones pretrained with clip)?
thanks!
Hi, I want to change the text input from a word to a paragraph, is there a way to do this?
For example, I would like to add an explanation of the category name after each category name.
In denseclip.py, there has the following equation:
“context_length = self.text_encoder.context_length - self.context_length”
Can you explain to me the role of different contexts? And why we need to use the "-" to calculate the context_length of self.contexts?
Hi and thanks for your great work! The pth file of the vit-b-denseclip for semantic segmentation is lost, can you upload it again?
Hi, the paper proposed two context prompting methods which are pre-model and post-model prompting respectively。 But I only find the post-model methed from the source code. Could you please provide the pre-model's code? That will help understand a lot. Thanks!
Hi. Thank you for your great work.
I tried to run the train code with command
"bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 1", but got the following error.
super(SingleStageDetector, self).init(init_cfg) TypeError: init() takes 1 positional argument but 2 were given
I did not change the code but I dont know why building the model gets error.
Could I get some help?!
Thanks for your great work! Does "We fix the text encoder during training" mean that the parameters of text encoder are not updated during training? If so, I can't find any "requires_grad=False" settings in the source code.
For ADE20K dataset, the batchsize is 32 (4*8gpus) ?
Hi, I used the pre_trained model ViT-B-16.pt and the config retinanet_clip_r101_fpn_1x_coco.py. However, the embed_dim of CLIPTextContextEncoder is 1024 while the embeded_dim of pretrained model is 512.
The default setting is multi-scale model, which is provided in the github. What about the single-scale model? Could you please provide the config files for it.
OpenMMlab has released new version of MMCV 2.0.0 and MMSegmentation 1.0.0 recently, will you migrate from your code to the latest version?
I am very interested in reading the paper, but the pre-model prompting mentioned in the paper seems to be missing in the code, could you add this part of the code? I am very interested in this part。
Have you ever tried saving the model as ONNX for visualization? Why does the trained model report an error missing img_meta when saved as ONNX?
Did you download it from https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt? Is it a typo?
Thanks!
Nice work!
But when creating a new conda environment for DenseCLIP, I found the following packages are required:
regex
ftfy
After using pip to install these packages, the code works fine.
(Moreover, seems that it's not necessary to install fvcore
explicitly.
Hi,
I use your introduction of training denseclip-R50, with batch_size of "4 x 8 GPU".
However I can not reproduce your results(43.5) in your paper, I only get 42.8 mIoU.
Can you provide the training log file? or more details (e.g., seed?) to reproduce your paper results. Thanks!
Thanks for your sharing work
I want to know how the prompt text create?(script or manual )
Can you provide some examples?
You present in the paper only the results related to Semantic FPN.
Have you conducted any relevant experiments based on dilated backbone methods (e.g. DeepLabV3+)?
Is the reason for not using dilated backbone-based methods that you are looking for fewer FLOPs or have you found that the results are not good?
Nice work! For class of CLIPTextContextEncoder, why did the line eos_indx = text.argmax(dim=-1) + N2 add N2 ?
I found out that the text encoder is also updated.
The positional embedding of the provided "denseclip_fpn_res50.pth" is
tensor([[-0.0013, 0.0003, 0.0007, ..., -0.0027, -0.0091, -0.0024],
[-0.0039, -0.0008, -0.0016, ..., -0.0006, -0.0049, -0.0044],
[-0.0044, 0.0011, -0.0007, ..., -0.0026, -0.0094, -0.0008],
...,
[-0.0002, -0.0002, -0.0012, ..., 0.0007, 0.0013, -0.0002],
[-0.0016, -0.0015, -0.0001, ..., -0.0010, -0.0025, -0.0004],
[-0.0030, -0.0013, -0.0004, ..., -0.0028, -0.0052, -0.0016]])
And the first 13 positional embedding of the pretrained RN50 model is
tensor([[-0.0012, 0.0003, 0.0008, ..., -0.0027, -0.0090, -0.0024],
[-0.0040, -0.0008, -0.0015, ..., -0.0006, -0.0049, -0.0045],
[-0.0044, 0.0011, -0.0006, ..., -0.0025, -0.0093, -0.0007],
...,
[-0.0002, -0.0002, -0.0011, ..., 0.0006, 0.0011, -0.0003],
[-0.0018, -0.0016, -0.0002, ..., -0.0009, -0.0025, -0.0004],
[-0.0031, -0.0014, -0.0006, ..., -0.0026, -0.0053, -0.0015]],
device='cuda:0', grad_fn=)
, which is slightly different.
I guess the reason is that "lr_mult" does not guarantee zero LR. The learning rate of the text encoder may get bigger than 0 due to the internal behavior of the LR scheduler. I think this is quite a critical bug since it may affect the result of the ablation study (Table 2 in the paper).
Also, I have one more question: Why do you set lr_mult as 0 for 'norm'? As far as I know, the mmcv library tries to set learning_rate as 0 for every module which includes the key "norm". If it is right, every 'normalization layer' in the transformer layer (especially the context decoder) will be 0.
The paper said, "we do not have ground truth segmentation label.". I can understand there is no segmentation mask for detection, but why is there no segmentation mask for instance segmentation task?
Greetings, I like your great work and try to follow it to do something!
I noticed that your ViT-B-16 contains a little FPN, I wonder why you prefer to add this, and what will happen if we remove this?
for config denseclip_fpn_vit-b_640x640_80k.py:
in text_encoder:
embed_dim=512, while ViT-B-16.pt has embed_dim=1024,
when loading weight, it turns out that :"RuntimeError: Error(s) in loading state_dict for CLIPTextContextEncoder:
size mismatch for text_projection: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in curren
t model is torch.Size([512, 512])"
How do you deal with this problem?
Hi thank you for your good work.
I tried in vit structure used in the mmdetection, i notice that vit part used detection/denseclip/denseclip.py class DenseCLIP_MaskRCNN, but when I add vit config, I could not run it correctly, There will be some dimension mismatch or other incompatibility problems, it seems that the detection of vit version is not complete. Can you give me some suggestions? If I want to use vit as backbone and denseclip on coco, can I use denseclip.py from segmentation? Or can you provide vit detection configuration? Or am I using it the wrong way?
Thanks!
Which and how many gpus do you use?How long does it take to complete a training session?
Hi, I want to reproduce your experiment on ADE20K but when I download the dataset I noticed that the ADE website had updated the dataset so the file structure and the number of training images is different from the older ADEChallengeData2016. I wonder which version do you use in your work?
The older version of ADE20k contains 20210 training images.
The newer version containes 25K training images as shown on their website.
Hi,
Thanks for making the code public !
I had a general query on the inference setting chosen for this paper -- Why is this paper not targetting zero -shot setting and instead focussed on fully supervised setting ? Is there any reason ? As the power of CLIP lies in zero-shot task transfer, i was wondering why no experiments were done for this ? and instead posed this problem as a multi-modal fully supervised dense detection task ?
Thanks in advance
Hi, Thanks for your work DenseCLIP.
I have some question about the CLIPVisionTransformer.
x = self.conv1(x) # shape = [*, width, grid, grid]
B, C, H, W = x.shape
x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1) # shape = [*, grid ** 2 + 1, width]
pos = self.positional_embedding.to(x.dtype)
cls_pos = pos[0,:] + self.class_embedding.to(x.dtype)
spatial_pos = F.interpolate(pos[1:,].reshape(1, self.spatial_size, self.spatial_size, C).permute(0, 3, 1, 2), size=(H, W), mode='bilinear')
spatial_pos = spatial_pos.reshape(1, C, H*W).permute(0, 2, 1)
pos = torch.cat([cls_pos.reshape(1, 1, C), spatial_pos], dim=1)
x = x + pos
x = self.ln_pre(x)
x = x.permute(1, 0, 2) # NLD -> LND
the x have both image feature and class embedding. However, the cls_pos are also added with class_embedding. I think there is some conflict with origin CLIP code, could you tell me the reason for this operation?
Hi.
Thanks for sharing your work!
Does DenseCLIP use pre-trained CLIP encoder on inference setting?
I think pre-trained CLIP encoder needs to compute pixel-text score maps on inference setting.
So the model is needed pre-trained CLIP encoder.
I wonder the CLIP encoder don't use on inference setting.
Thanks.
Hi, Thanks for your work. In the paper, it is mentioned that you use CoOp to generate prompts where <p1, p2,...,pn> are learnable. But from the code it seems that only the class labels are used for generating the prompt text ? Can you please confirm this, if I am missing something ?
I want to reproduce your ImageNet backbone experiments on ADE20K segmentation.
Can you provide the code for ImageNet backbone experiment?
any backbone experiments on ADE20K segmentation
is not enough to fully implement the experiments.
Thanks a lot !
I run following for training on ade dataset,
bash dist_train.sh configs/denseclip_fpn_res50_512x512_80k.py 8
command lines output as follows, but training process goes on....
[] [] are misaligned params in CLIPResNet
[] [] are misaligned params in text encoder
It seems language-image align failed, how can I fix this problem?
hello,I want to know whether the code can be trained with multigpu?
the given command uses multi-gpu,like
"bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 8"
but when I run it,it fails,showing the following errors
[] [] are misaligned params in CLIPResNet
[] [] are misaligned params in CLIPResNet
[] [] are misaligned params in text encoder
[] [] are misaligned params in text encoder
Hi,
this is a great work. Have you tried DenseCLIP on other seg datasets? e.g., Cityscapes.
Because I tried with the code, but the result is not good enough.
thanks.
Hi, @raoyongming,
thanks very much for your great work. I just have some questions about any backbone experiments on ADE20K segmentation in table 5. 我想问一下,针对没有clip 预训练的模型,例如RestNet18, Swintransformer-T/S, 我看到在ADE20k上提升提升不如RN50 显著。你们是直接进行的visual-text 特征交互计算吗? 有用到其他的一些trick吗?thanks!
I cannot reach the mIoU of RN50-CLIP that was showed in the paper, though I used the configuration mentioned in the README. Could you please tell me what batch size was used and how many GPUs were used. More details of implement are very helpful. I've tried batch size of 16, but only got 38.85 of mIoU. Here is my configuration and log file is putted in the attachment.
'''
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
type='EncoderDecoder',
pretrained='pretrained/RN50.pt',
backbone=dict(
type='CLIPResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
dilations=(1, 1, 1, 1),
strides=(1, 2, 2, 2),
norm_cfg=dict(type='SyncBN', requires_grad=True),
norm_eval=False,
style='pytorch',
contract_dilation=True,
layers=[3, 4, 6, 3]),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=4),
decode_head=dict(
type='FPNHead',
in_channels=[256, 256, 256, 256],
in_index=[0, 1, 2, 3],
feature_strides=[4, 8, 16, 32],
channels=256,
dropout_ratio=0.1,
num_classes=150,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
train_cfg=dict(),
test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)))
dataset_type = 'ADE20KDataset'
data_root = 'data/ade/ADEChallengeData2016'
IMG_MEAN = [122.7709383, 116.7460125, 104.09373615000001]
IMG_VAR = [68.5005327, 66.6321579, 70.32316304999999]
img_norm_cfg = dict(
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=4,
workers_per_gpu=4,
train=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/training',
ann_dir='annotations/training',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]),
val=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/validation',
ann_dir='annotations/validation',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/validation',
ann_dir='annotations/validation',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[122.7709383, 116.7460125, 104.09373615000001],
std=[68.5005327, 66.6321579, 70.32316304999999],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
log_config = dict(
interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
find_unused_parameters = True
optimizer = dict(
type='AdamW',
lr=0.0001,
weight_decay=0.0001,
paramwise_cfg=dict(
custom_keys=dict(
backbone=dict(lr_mult=0.1), norm=dict(decay_mult=0.0))))
optimizer_config = dict()
lr_config = dict(
policy='poly',
power=0.9,
min_lr=1e-06,
by_epoch=False,
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-06)
runner = dict(type='IterBasedRunner', max_iters=80000)
checkpoint_config = dict(by_epoch=False, interval=8000)
evaluation = dict(interval=8000, metric='mIoU')
work_dir = './work_dirs/fpn_clipres50_test4k'
gpu_ids = range(0, 1)
'''
Hi, I found the value of self.gamma is different between your paper (gamma==1e-3) and your code (gamma==1e-4), which one should I use to reproduce your results? Thanks.
I find that when you load pretrained CLIP parameters but tuncate context_length in positional_embedding like:
DenseCLIP/detection/denseclip/models.py
Lines 652 to 655 in 3b72447
Does this affect pretrained model performance or in other words, does this change the pretrained model text encoder original output ?
Hi,
The modified ResNet in CLIP uses attention pooling, and the comments in CLIPResNet class are also noted that.
However, I didn't see any related operations in CLIPResNet.
I know there is another CLIPResNetWithAttention class, but according the configs, I think it is for DenseCLIP not CLIP?
Dear author,
Thanks for sharing the code. I am greatly interested in your work. In my latest work, I hope to borrow ideas from your work.
I have a request and would like to hear back from you.
In your experiments, I found that applying DenseCLIP to Swin Transformer achieved a good performance gain. I would like to ask if I can obtain the pre-trained models of Swin Transformer + DenseCLIP?
I try to train a DenseCLIP model based on CLIP ResNet-50 using the default configuration, where the number of iterations is 80,000. But I find that after training, the model has mIoU 39.46 in the testing set, which is smaller than 43.5 shown in the paper. The following images are the testing result and training history.
Thanks for the great work. It seems that the released model is trained on the ADE dataset. If we want to test on other text descriptions (classes), we must re-train the model? Is there any other way to do the openest inference without training? For example, can I directly utilize the pre-trained CLIP model to calculate the pixel-wise dot product? Do you have such kind of code support?
Thx for your interesting and nice work!
I would like to know whether you conduct experiments on anchor-free detectors, e.g., a). adopting the score maps s
as the centre/conner points heatmaps to obtain bounding-boxes predictions or b). employing pre-model prompting
for a transformer detector directly (use text embeddings to initialize classifier), like DETR.
Hi, I'm faced with the following error even though I change the samples_per_gpu to 1 (the smallest batch size I think):
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 690.00 MiB (GPU 0; 39.59 GiB total capacity; 973.79 MiB already allocated; 474.62 MiB free; 1.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
Does it work only on GPU with 80GB memory or is there any other things I need to take care of? Thanks in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.