rmokady / clip_prefix_caption Goto Github PK
View Code? Open in Web Editor NEWSimple image captioning model
License: MIT License
Simple image captioning model
License: MIT License
Hello,
Would you happen to have any tips for training the model? I keep on running into a CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 15.90 GiB total capacity; 14.75 GiB already allocated; 23.75 MiB free; 14.83 GiB reserved in total by PyTorch)
problem despite training on a P-100 with 16GB of RAM. I've tried decreasing the batch size from 40 to 10 with no improvement - in fact it seems to run into a slightly different issue: RuntimeError: CUDA out of memory. Tried to allocate 892.00 MiB (GPU 0; 15.90 GiB total capacity; 14.13 GiB already allocated; 193.75 MiB free; 14.66 GiB reserved in total by PyTorch)
.
Setting the batch_size to 2 seems to get things running, albeit very slowly. What hardware did you train on?
Hi,
First of all, I want to say thanks a lot for this repo!
My question is,
How can I get an output of n
captions for an input image? which are also with high quality
like the first one?
For example if n=3
, I'd like to get 3 different captions that describe the same caption as good as possible.
Hello. Thanks for sharing your code!
I'm curious about the training dataset to evaluate "nocaps".
I was wondering if the model(clipcap model) was trained with the train split of "MS COCO Karpathy's splits" and evaluated using the validation split of "nocaps".
Does this problem have big guy to encounter excuse me? I hope you can help me
Hello,
Thanks for the great work! I was interested in reproducing the transformer network with frozen GPT-2, and achieved slightly lower performance on COCO so far:
Metric | reported | reproduced |
---|---|---|
Bleu@4 | 33.53 | 31.0 |
METEOR | 27.45 | 27.1 |
CIDEr | 113.08 | 105.7 |
SPICE | 21.05 | 20.4 |
I was wondering if the provided code should be able to reproduce the validation scores or if I am missing something?
Hi, thanks a lot for making the code available, it's a great resource to use!
I was wondering why the index when computing the loss from the output of gpt is shifted by 1 on the left:
https://github.com/rmokady/CLIP_prefix_caption/blob/main/train.py#L315
shouldn't it be logits = outputs.logits[:, dataset.prefix_length :]
Thanks!
Hi @rmokady,
Thank you for your nice work, I learned a lot from it. Since the default CLIP model you are using seems to be the ViT-B32 version, I am wondering if you have tried other visual features e.g. from ViT-L or the resnet models? I can't find it mentioned in the paper. I'm trying to train a similar model at the moment and assume the features extracted from bigger vision encoders would contain more information.
Best, David
Hi,
Congrats on the amazing project. I have a video dataset with captions. I wanted to fine-tune this model on video dataset to generate custom captions. I am planning to extract frame-wise features and concatenate them during training. Please, let me know how can I fine-tune on custom dataset.
Thanks in advance.
Hi thanks for providing great work!
Just in case, did you experiment with the text encoder of the clip instead of the language pretrained model?
Of course, I think the clip text encoder only learns the alignment with visual features, but I'm still curious whether clip text encoder has generation ability.
Thanks
Thank you for your work. Do you have any plans to add code that supports multiple gpu's?
I am trying to train data having roughly 3000 images on google colab GPU and resulting GPU error as below. So I tried giving 50 images to process then it is working fine. But, I believe it should not be GPU issue as coco is able to train more than 10000 images. I checked out the format of data and images as well those look fine based on coco data format. Any leads on this is appreciated.
Downloading: 100% 0.99M/0.99M [00:00<00:00, 7.93MB/s]
Downloading: 100% 446k/446k [00:00<00:00, 5.78MB/s]
Downloading: 100% 665/665 [00:00<00:00, 1.13MB/s]
Data size is 3060
Token indices sequence length is longer than the specified maximum sequence length for this model (1140 > 1024). Running this sequence through the model will result in indexing errors
Downloading: 100% 523M/523M [00:07<00:00, 69.9MB/s]
Train both prefix and GPT
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True
to disable this warning
FutureWarning,
Training epoch 0
coco_prefix: 0% 0/76 [00:00<?, ?it/s]Traceback (most recent call last):
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 370, in
main()
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 366, in main
train(dataset, model, args, output_dir=args.out_dir, output_prefix=args.prefix)
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 314, in train
outputs = model(tokens, prefix, mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 233, in forward
out = self.gpt(inputs_embeds=embedding_cat, labels=labels, attention_mask=mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 1061, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 397, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 332, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 189, in _attn
attn_weights = torch.matmul(query, key.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 430.00 MiB (GPU 0; 14.76 GiB total capacity; 13.54 GiB already allocated; 145.75 MiB free; 13.69 GiB reserved in total by PyTorch)
coco_prefix: 0% 0/76 [00:00<?, ?it/s]
In code: https://github.com/rmokady/CLIP_prefix_caption/blob/main/parse_coco.py#L26, you refer to the validation set to get the training samples. Also, in your file: https://drive.google.com/file/d/1D3EzUK1d1lNhD2hAvRiKPThidiVbP2K_/view?usp=sharing, there are validation captions (labels) in the training dataset, which results in obvious label leakage since the model will be trained on the validation samples. Actually if we train the model for more than 10 epochs (e.g. 100 epochs), the results in the validation set can reach an unreasonably high scores. Can you explain this?
I am following the Replicate docs for this model to run on my own machine https://replicate.com/rmokady/clip_prefix_caption
I can run Docker fine with docker run -d -p 5000:5000 --gpus=all r8.im/rmokady/clip_prefix_caption@sha256:d703881e7b50eb009779b7e1e79d394639df0af71be28c4100f88c03fdccdbf0
However when I run curl, all inputs give {"detail":"Not Found"}
E.g.:
$ curl http://localhost:5000/predict -X POST \
-F image=@red_panda.jpg \
-F model=conceptual-captions \
-F use_beam_search=False
{"detail":"Not Found"}
$ curl http://localhost:5000/predict -X POST \
-F image=@red_panda.jpg \
-F model=coco \
-F use_beam_search=False
{"detail":"Not Found"}
I'm inferring that coco/conceptual-captions are the correct model names to use in this context, however I get the same output ({"detail":"Not Found"}
) passing other strings (e.g., -F model=example \
). Is there something I'm missing/could the docs be made clearer?
prefix_dim = 640 in the pretrained model. But how to translate CLIP's 512 embedding into 640 before forwarding to the net?
Hello, great project :)
I wanted to try some stuff with you code, and wanted to evaluate the results with the same metrics you did on COCO or Conceptual Captions.
Is it possible for you to share the evaluation code for these metrics? And what split of the dataset did you check those against?
Thanks!
Dear @rmokady ,
Thank you for your great work, it is very interesting.
Is there a way to generate different caption for the ame image?
Thank you for your help.
Best Wishes,
Alex
Hey, super small minor thing: In your README, you swapped the links for training data and validation data:
Download [training images](http://images.cocodataset.org/zips/val2014.zip) and [validation images](http://images.cocodataset.org/zips/train2014.zip) and unzip (We use Karpathy et el. split).
Obviously a small nit, but also an easy fix in the future.
Hi,
As far as I understand the model and the usage of GPT2, shouldn't the get_dummy_token
function return torch.ones() * -100
instead of torch.zeros()
? This is because we should be ignoring the outputs of GPT2 for these prefix inputs. Currently, it's forcing the model to predict token 0 which is the exclamation mark ("!").
Reference lines: https://github.com/rmokady/CLIP_prefix_caption/blob/main/train.py#L222-L223
Thanks!
I have attempted to train with the gpt2-xl
model from huggingface as well as a custom dataloader that can preprocess webdataset archives, but am having issues when it comes to inference.
I have created a small script to test inference (inference_test.py
) on my fork but it seems to repeatedly generate the same unrelated caption for any input (below are some samples from the RedCaps dataset):
I'm not too sure what has gone wrong, but I believe it's either down to my dataloader not preprocessing the dataset in the correct way, or it's down to the model not supporting GPT2-xl.
parse_webdatasets.py
and is modified from rom1504/clip-retrieval
's inference script. It does more of the transformations usually seen in train.py beforehand to optimise the training speed, and generates many .npy
files as a result.train.py
.Do you know what might be causing this?
Thanks in advance
Can you share the evaluation procedure? Thanks!
Hi @rmokady, what a clever approach!
I'm trying this approach on my custom dataset and manage to get it start training. I'm figuring out way to add evaluate code to better manage the training, but in the mean time, I wonder what is your loss score when stop training the model on each mode: Train only prefix
and Train both prefix and GPT
?
Heihei, here's my issues:
a) I'm getting a very low variety of outputs with ([very] different) custom images, e.g. "...sitting on a cellphone", "...with a cellphone", "...a cellphone and a surfboard".
b) The captions don't change after subsequent runs. Is this an issue with the model or with weights?
c) Allthough I altered the 'entry_length' and 'stop_token' (and also temperatur and p) it has no effect on the caption whatsoever.
Has any1 an idea what I'm doing wrong or could you point me in the right direction for research?
Thank you in advance and all the best,
Hidéo
I have downloaded the pretrain weights from Google Drive. But it seems there's something wrong with it. It's not working because of mismatching!
https://drive.google.com/file/d/14pXWwB4Zm82rsDdvbGguLfx9F8aM7ovT/view?usp=sharing
https://drive.google.com/file/d/1IdaBtMSvtyzF0ByVaBHtvM0JYSXRExRX/view?usp=sharing
How can I do for this?
Can you provide the dataset API.
I am trying to implement a custom clip:
model = clip_model.eval()
device = CUDA(0) if is_gpu else "cpu"
model = clip_model.to(device)
use_beam_search = False
import pandas as pd
prefix_length=10
df = pd.DataFrame([],columns=['image','desc'])
for fname in [i for i in os.listdir('image') if i.endswith('.png')]:
image = io.imread('image/'+fname)
pil_image = PIL.Image.fromarray(image)
image = preprocess(pil_image).unsqueeze(0).to(device)
with torch.no_grad():
prefix = model.encode_image(image).to(device, dtype=torch.float32)
prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)
if use_beam_search:
generated_text_prefix = generate_beam(model, tokenizer, embed=prefix_embed)[0]
else:
generated_text_prefix = generate2(model, tokenizer, embed=prefix_embed)
df.loc[len(df)] = [fname,generated_text_prefix]
print(generated_text_prefix)
Getting error in the below code:
#prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)
ModuleAttributeError: 'CLIP' object has no attribute 'clip_project'
Hi,
That's an interesting work, thanks for sharing the code!
I see that you already directed several users to the evaluation procedure of Oscar to generate the json file with image captions. I tried that but it requires many adaptations, as the models and the input data are very different. It seems a delicate merge, rather than just running another script.
Am I missing something? did you use the run_captioning.py script with the evaluation flag?
Thanks
I found that both your clip_prefix_captioning_inference.ipynb and your demo https://replicate.com/rmokady/clip_prefix_caption with worked very well for my images.
I am wondering how you got the weights behind your demo. I mean the specific hyper parameter you used to get it.
For coco mode, I ran python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ with 40 batch_size and all other default arguments. I trained it for many epochs and saved many weights. However, the weights I kept didn't work as well as yours.
Could you please tell me the specific hyper parameters and arguments for the weight behind clip_prefix_captioning_inference.ipynb and your demo https://replicate.com/rmokady/clip_prefix_caption with coco model? I will be extremely grateful!
Has this paper been accepted yet? thank you
请问这个文件运行一直卡住是什么问题?
It's a very nice work,can you provide the code for evaluation it? I don't know how to evaluate it. Thank you!
I'm trying to refactor your model based on transformers, but I'm having a problem: there's always an error somewhere, but I've tried a lot of solutions and I don't have a clue.
class ClipCaptionModel(PreTrainedModel):
def __init__(self, config):
super(ClipCaptionModel, self).__init__(config)
self.prefix_length = config.prefix_length
self.clip_length = config.clip_length
self.prefix_size = config.prefix_size
self.num_layers = config.num_layers
self.mapping_type = config.mapping_type
decoder = config.decoder
self.gpt = GPT2LMHeadModel.from_pretrained('uer/gpt2-chinese-cluecorpussmall')
self.gpt_embedding_size = self.gpt.transformer.wte.weight.shape[1]
self.clip_project = TransformerMapper(self.prefix_size, self.gpt_embedding_size, self.prefix_length, self.clip_length, self.num_layers) #(512,768,10,8)
print(self.prefix_size, self.gpt_embedding_size, self.prefix_length, self.clip_length, self.num_layers)
def get_dummy_token(self, batch_size: int, device: torch.device) -> torch.Tensor:
return torch.zeros(batch_size, self.prefix_length, dtype=torch.int64, device=device)
def forward(self,
tokens: torch.Tensor,
prefix: torch.Tensor,
mask: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None):
embedding_text = self.gpt.transformer.wte(tokens)
print(prefix.shape)
prefix_projections = self.clip_project(prefix).view(-1, self.prefix_length, self.gpt_embedding_size)
embedding_cat = torch.cat((prefix_projections, embedding_text), dim=1)
if labels is not None:
dummy_token = self.get_dummy_token(tokens.shape[0], tokens.device)
labels = torch.cat((dummy_token, tokens), dim=1)
out = self.gpt(inputs_embeds=embedding_cat, labels=labels, attention_mask=mask)
return out
class ClipCaptionPrefix(ClipCaptionModel):
def parameters(self, recurse: bool = True):
return self.clip_project.parameters()
def train(self, mode: bool = True):
super(ClipCaptionPrefix, self).train(mode)
self.gpt.eval()
return self
`
Here is the address on the colab:https://colab.research.google.com/drive/1sEg9HbDwRPs9_SNVjjsPE_sk449P9Svc#scrollTo=3pP_n5oQrXPg&uniqifier=1
Hey, real quick question - I'm putting together a custom dataset of captions in the coco format, and it's pretty obvious how the "image_id" and "caption" values are put to use, but the "id" field isn't too clear. Is it necessary?
Hi - I'm having trouble with prefix_size mismatch on my own finetuned model with CLIP features using ViT B-32. I'm learning just the transforming mapping and no gpt-2 finetuning, using the commands given in the readme.
My understanding is that CLIP with ViT B-32 uses prefix_size = 512
while ResNet encoders use prefix_size = 640
, e.g. Radford et al Visual Transformers Appendix F, and also here in train.py
:
Line 355 in 1ad805a
However, in the prediction script, which I've essentially copied from the transformer notebook (https://github.com/rmokady/CLIP_prefix_caption/blob/main/notebooks/transformer_inference.ipynb)
the prefix size is set to 640.
model = ClipCaptionPrefix(prefix_length, clip_length=40, prefix_size=640,
num_layers=8, mapping_type='transformer')
This worked for me for your pretrained coco model, but not my finetuned model, where I get a dimensionality mismatch between 512/640. Can you help me out here? Should I be using prefix_size = 640 or 512 for training/inference?
Thank you!
Stella
Ps: FYI there are a few typos in the commands in the Readme.md
, where num_layres
should be num_layers
.
Hi, can you tell me whether it is ViT-B/32 that you use to obtain the results reported in the paper? Did you run experiments also with a ResNet-based CLIP model and do you have any observations which works better? Thanks!
Are there any important changes that need to be made to the train.py file to train a model with the the Conceptual Captions 3M dataset?
I've been attempting to train a model myself using the author's recommended settings for training an MLP with GPT2 finetune. Hyperparameters such as learning rate, batch size, and num layers are all same as author. My only changes have been upgrading the CLIP to ViT_B/16 and GPT2 to gpt2-medium.
Using the train.py script I am able to get my training to run and the reported loss is decreasing through the epochs. The problem comes when I run inference. It always outputs "a model walks the runway at the fashion show during event." no matter what image I give it.
Any guidance would be appreciated.
I have a question about the above code. I observe that in the pad_tokens function, it will save the padded token. However, if we call the pad_tokens function with the same token again, the padded token saved where the padded part is set to 0, will be used to generate the mask with ge(0), as a result, the mask will always be one for all positions. Do I miss something?
Can you help me inference for a batch image
Hello, thank you very much for your excellent work! I tried to apply the model to a new dataset, and I processed the dataset according to the required data format, but there seemed to be some problems with my results.
The resulting results are similar to: "12935.jpg ThereĠareĠtwoĠgroundtrackfieldsĠinĠtheĠimageĠaboveĠtheĠimageĠabove….".
I have checked that there are no redundant characters in my annotation. Could you please explain the cause? Could it be GPT2? Thank you very much!
Hi, I'm thinking if you can share the pretrained model weights so it is easier to try it out without first having to train our own models?
Hi,
Really cool work! One thing that could be cool (it's a feature that was recently added to HuggingFace Transformers) is the ability to load any custom PyTorch model from the hub as introduced in this PR. The idea would be that you can create a new model on the hub (it's as easy as clicking "new model", then git add .
, git commit
, git push
to upload the weights to the repo on the hub). You can also pass in a custom script defining your model. Next, one can load your custom model as follows:
from transformers import AutoModel
model = AutoModel.from_pretrained("rmokday/clip_prefix_caption", trust_remote_code=True)
This could be easier compared to hosting the models on Google Drive, and you don't need to define the model again everytime someone wants to use it in a notebook :)
Do you mind trying this out?
Kind regards,
Niels
I would like to ask how can Clip-Cap generate multiple different sentences for one image?
I've changed the entry_count
count in the generate2()
function, but the output sentence is the same.
run parse_conceptual.py, then it ended with the following error:
NotImplementedError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat. This usually means that this function requires a non-empty list of Tensors, or that you (the operator writer) forgot to register a fallback function. Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
could you please help?
Is there a more direct way to obtain the pretrained weights than by using Colab + Google Drive?
I have trained the model (both MLP and GPT-2) using the CC3M dataset but the loss doesn't seem to decrease very much (stays around 3.0). What loss can I expect for a good model? How many epochs should I run it for? Also, is any specific hyperparameter tuning required for CC? I have a model trained for 5 epochs but it generates a similar caption for every image. I tried fitting on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. Please let me know.
When I ran your code, I got an error with MappingType.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.