Code Monkey home page Code Monkey logo

clip_prefix_caption's People

Contributors

ak391 avatar amirhertz avatar andreasjansson avatar neverix avatar rmokady avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

clip_prefix_caption's Issues

CUDA out of memory

Hello,

Would you happen to have any tips for training the model? I keep on running into a CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 15.90 GiB total capacity; 14.75 GiB already allocated; 23.75 MiB free; 14.83 GiB reserved in total by PyTorch) problem despite training on a P-100 with 16GB of RAM. I've tried decreasing the batch size from 40 to 10 with no improvement - in fact it seems to run into a slightly different issue: RuntimeError: CUDA out of memory. Tried to allocate 892.00 MiB (GPU 0; 15.90 GiB total capacity; 14.13 GiB already allocated; 193.75 MiB free; 14.66 GiB reserved in total by PyTorch).

Setting the batch_size to 2 seems to get things running, albeit very slowly. What hardware did you train on?

Getting multiple caption outputs

Hi,
First of all, I want to say thanks a lot for this repo!

My question is,
How can I get an output of n captions for an input image? which are also with high quality like the first one?

For example if n=3, I'd like to get 3 different captions that describe the same caption as good as possible.

I'm curious about the training dataset to evaluate "nocaps".

Hello. Thanks for sharing your code!

I'm curious about the training dataset to evaluate "nocaps".

I was wondering if the model(clipcap model) was trained with the train split of "MS COCO Karpathy's splits" and evaluated using the validation split of "nocaps".

Reproducing validation results

Hello,

Thanks for the great work! I was interested in reproducing the transformer network with frozen GPT-2, and achieved slightly lower performance on COCO so far:

Metric reported reproduced
Bleu@4 33.53 31.0
METEOR 27.45 27.1
CIDEr 113.08 105.7
SPICE 21.05 20.4

I was wondering if the provided code should be able to reproduce the validation scores or if I am missing something?

have you tried different CLIP models?

Hi @rmokady,
Thank you for your nice work, I learned a lot from it. Since the default CLIP model you are using seems to be the ViT-B32 version, I am wondering if you have tried other visual features e.g. from ViT-L or the resnet models? I can't find it mentioned in the paper. I'm trying to train a similar model at the moment and assume the features extracted from bigger vision encoders would contain more information.

Best, David

Finetuning on custom data (videos)

Hi,

Congrats on the amazing project. I have a video dataset with captions. I wanted to fine-tune this model on video dataset to generate custom captions. I am planning to extract frame-wise features and concatenate them during training. Please, let me know how can I fine-tune on custom dataset.

Thanks in advance.

Clip text encoder instead of language model (GPT-2)

Hi thanks for providing great work!
Just in case, did you experiment with the text encoder of the clip instead of the language pretrained model?
Of course, I think the clip text encoder only learns the alignment with visual features, but I'm still curious whether clip text encoder has generation ability.

Thanks

Not able to train custom data.

I am trying to train data having roughly 3000 images on google colab GPU and resulting GPU error as below. So I tried giving 50 images to process then it is working fine. But, I believe it should not be GPU issue as coco is able to train more than 10000 images. I checked out the format of data and images as well those look fine based on coco data format. Any leads on this is appreciated.

Downloading: 100% 0.99M/0.99M [00:00<00:00, 7.93MB/s]
Downloading: 100% 446k/446k [00:00<00:00, 5.78MB/s]
Downloading: 100% 665/665 [00:00<00:00, 1.13MB/s]
Data size is 3060
Token indices sequence length is longer than the specified maximum sequence length for this model (1140 > 1024). Running this sequence through the model will result in indexing errors
Downloading: 100% 523M/523M [00:07<00:00, 69.9MB/s]
Train both prefix and GPT
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
FutureWarning,

Training epoch 0
coco_prefix: 0% 0/76 [00:00<?, ?it/s]Traceback (most recent call last):
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 370, in
main()
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 366, in main
train(dataset, model, args, output_dir=args.out_dir, output_prefix=args.prefix)
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 314, in train
outputs = model(tokens, prefix, mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 233, in forward
out = self.gpt(inputs_embeds=embedding_cat, labels=labels, attention_mask=mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 1061, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 397, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 332, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 189, in _attn
attn_weights = torch.matmul(query, key.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 430.00 MiB (GPU 0; 14.76 GiB total capacity; 13.54 GiB already allocated; 145.75 MiB free; 13.69 GiB reserved in total by PyTorch)
coco_prefix: 0% 0/76 [00:00<?, ?it/s]

LABEL LEAKAGE During Training

In code: https://github.com/rmokady/CLIP_prefix_caption/blob/main/parse_coco.py#L26, you refer to the validation set to get the training samples. Also, in your file: https://drive.google.com/file/d/1D3EzUK1d1lNhD2hAvRiKPThidiVbP2K_/view?usp=sharing, there are validation captions (labels) in the training dataset, which results in obvious label leakage since the model will be trained on the validation samples. Actually if we train the model for more than 10 epochs (e.g. 100 epochs), the results in the validation set can reach an unreasonably high scores. Can you explain this?

Docker run: {"detail":"Not Found"}

I am following the Replicate docs for this model to run on my own machine https://replicate.com/rmokady/clip_prefix_caption

I can run Docker fine with docker run -d -p 5000:5000 --gpus=all r8.im/rmokady/clip_prefix_caption@sha256:d703881e7b50eb009779b7e1e79d394639df0af71be28c4100f88c03fdccdbf0

However when I run curl, all inputs give {"detail":"Not Found"}

E.g.:

$ curl http://localhost:5000/predict -X POST \
  -F image=@red_panda.jpg \
  -F model=conceptual-captions \
  -F use_beam_search=False
{"detail":"Not Found"}

$ curl http://localhost:5000/predict -X POST \
  -F image=@red_panda.jpg \
  -F model=coco \
  -F use_beam_search=False
{"detail":"Not Found"}

I'm inferring that coco/conceptual-captions are the correct model names to use in this context, however I get the same output ({"detail":"Not Found"}) passing other strings (e.g., -F model=example \). Is there something I'm missing/could the docs be made clearer?

Evaluation Code

Hello, great project :)

I wanted to try some stuff with you code, and wanted to evaluate the results with the same metrics you did on COCO or Conceptual Captions.
Is it possible for you to share the evaluation code for these metrics? And what split of the dataset did you check those against?

Thanks!

inference

when I run the generate_beam for the caption,there are many space in the caption,Do you know why, thank you.
image

Nit: Swapped links in README

Hey, super small minor thing: In your README, you swapped the links for training data and validation data:

Download [training images](http://images.cocodataset.org/zips/val2014.zip) and [validation images](http://images.cocodataset.org/zips/train2014.zip) and unzip (We use Karpathy et el. split).

Obviously a small nit, but also an easy fix in the future.

Issues When Training with WebDatasets/Larger GPT2 Model

I have attempted to train with the gpt2-xl model from huggingface as well as a custom dataloader that can preprocess webdataset archives, but am having issues when it comes to inference.

I have created a small script to test inference (inference_test.py) on my fork but it seems to repeatedly generate the same unrelated caption for any input (below are some samples from the RedCaps dataset):

image

I'm not too sure what has gone wrong, but I believe it's either down to my dataloader not preprocessing the dataset in the correct way, or it's down to the model not supporting GPT2-xl.

  • The preprocessing script is at parse_webdatasets.py and is modified from rom1504/clip-retrieval's inference script. It does more of the transformations usually seen in train.py beforehand to optimise the training speed, and generates many .npy files as a result.
  • These files are loaded with the custom dataloader on train.py.

Do you know what might be causing this?

Thanks in advance

Reproducing loss

Hi @rmokady, what a clever approach!

I'm trying this approach on my custom dataset and manage to get it start training. I'm figuring out way to add evaluate code to better manage the training, but in the mean time, I wonder what is your loss score when stop training the model on each mode: Train only prefix and Train both prefix and GPT?

Low variety of outputs.

Heihei, here's my issues:

a) I'm getting a very low variety of outputs with ([very] different) custom images, e.g. "...sitting on a cellphone", "...with a cellphone", "...a cellphone and a surfboard".

b) The captions don't change after subsequent runs. Is this an issue with the model or with weights?

c) Allthough I altered the 'entry_length' and 'stop_token' (and also temperatur and p) it has no effect on the caption whatsoever.

Has any1 an idea what I'm doing wrong or could you point me in the right direction for research?

Thank you in advance and all the best,
Hidéo

A question about the x dimension.

When I train only transformer mapping network,I found that the dimension of x is(40 , 512),but prefix_dim = 640.I don't know why this is happening. Is it caused by the extraction of clip features? Hope to get your help, thank you.
image

'CLIP' object has no attribute 'clip_project'

I am trying to implement a custom clip:


model = clip_model.eval() 
device = CUDA(0) if is_gpu else "cpu"
model = clip_model.to(device)

use_beam_search = False 

import pandas as pd
prefix_length=10


df = pd.DataFrame([],columns=['image','desc'])
for fname in [i for i in os.listdir('image') if i.endswith('.png')]:
  image = io.imread('image/'+fname)
  pil_image = PIL.Image.fromarray(image)  
  image = preprocess(pil_image).unsqueeze(0).to(device)

  with torch.no_grad():
      prefix = model.encode_image(image).to(device, dtype=torch.float32)
      prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)
      
  if use_beam_search:
      generated_text_prefix = generate_beam(model, tokenizer, embed=prefix_embed)[0]
  else:
      generated_text_prefix = generate2(model, tokenizer, embed=prefix_embed)

  df.loc[len(df)] = [fname,generated_text_prefix]
  print(generated_text_prefix)

Getting error in the below code:

#prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)

ModuleAttributeError: 'CLIP' object has no attribute 'clip_project'

Evaluation script for COCO

Hi,

That's an interesting work, thanks for sharing the code!

I see that you already directed several users to the evaluation procedure of Oscar to generate the json file with image captions. I tried that but it requires many adaptations, as the models and the input data are very different. It seems a delicate merge, rather than just running another script.
Am I missing something? did you use the run_captioning.py script with the evaluation flag?

Thanks

weight behind clip_prefix_captioning_inference.ipynb and your demo

I found that both your clip_prefix_captioning_inference.ipynb and your demo https://replicate.com/rmokady/clip_prefix_caption with worked very well for my images.
I am wondering how you got the weights behind your demo. I mean the specific hyper parameter you used to get it.
For coco mode, I ran python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ with 40 batch_size and all other default arguments. I trained it for many epochs and saved many weights. However, the weights I kept didn't work as well as yours.

Could you please tell me the specific hyper parameters and arguments for the weight behind clip_prefix_captioning_inference.ipynb and your demo https://replicate.com/rmokady/clip_prefix_caption with coco model? I will be extremely grateful!

Evaluate problem

It's a very nice work,can you provide the code for evaluation it? I don't know how to evaluate it. Thank you!

An image loss error occurred while extracting image features.

Thank you for sharing amazing work.
The following problem occurred when I used parse_coke.py to extract features. I checked the data set and did a search on the Internet and couldn't find coco_val2014_000000116100.jpg. I hope to get your help.Thanks.
image

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

I'm trying to refactor your model based on transformers, but I'm having a problem: there's always an error somewhere, but I've tried a lot of solutions and I don't have a clue.
image

class ClipCaptionModel(PreTrainedModel):
  def __init__(self, config):
    super(ClipCaptionModel, self).__init__(config)
    self.prefix_length = config.prefix_length
    self.clip_length = config.clip_length
    self.prefix_size = config.prefix_size
    self.num_layers = config.num_layers
    self.mapping_type = config.mapping_type 
    decoder = config.decoder
    self.gpt = GPT2LMHeadModel.from_pretrained('uer/gpt2-chinese-cluecorpussmall')
    self.gpt_embedding_size = self.gpt.transformer.wte.weight.shape[1]
    self.clip_project = TransformerMapper(self.prefix_size, self.gpt_embedding_size, self.prefix_length, self.clip_length, self.num_layers)  #(512,768,10,8)
    print(self.prefix_size, self.gpt_embedding_size, self.prefix_length, self.clip_length, self.num_layers)

  def get_dummy_token(self, batch_size: int, device: torch.device) -> torch.Tensor:
    return torch.zeros(batch_size, self.prefix_length, dtype=torch.int64, device=device)

  def forward(self, 
              tokens: torch.Tensor, 
              prefix: torch.Tensor, 
              mask: Optional[torch.Tensor] = None,
              labels: Optional[torch.Tensor] = None):
    
      embedding_text = self.gpt.transformer.wte(tokens)
      print(prefix.shape)
      prefix_projections = self.clip_project(prefix).view(-1, self.prefix_length, self.gpt_embedding_size)
      embedding_cat = torch.cat((prefix_projections, embedding_text), dim=1)
      if labels is not None:
        dummy_token = self.get_dummy_token(tokens.shape[0], tokens.device)
        labels = torch.cat((dummy_token, tokens), dim=1)
      out = self.gpt(inputs_embeds=embedding_cat, labels=labels, attention_mask=mask)
      return out


class ClipCaptionPrefix(ClipCaptionModel):

    def parameters(self, recurse: bool = True):
        return self.clip_project.parameters()

    def train(self, mode: bool = True):
        super(ClipCaptionPrefix, self).train(mode)
        self.gpt.eval()
        return self

`
Here is the address on the colab:https://colab.research.google.com/drive/1sEg9HbDwRPs9_SNVjjsPE_sk449P9Svc#scrollTo=3pP_n5oQrXPg&uniqifier=1

purpose of "id" element in coco train data?

Hey, real quick question - I'm putting together a custom dataset of captions in the coco format, and it's pretty obvious how the "image_id" and "caption" values are put to use, but the "id" field isn't too clear. Is it necessary?

prefix size for Vit/ResNet

Hi - I'm having trouble with prefix_size mismatch on my own finetuned model with CLIP features using ViT B-32. I'm learning just the transforming mapping and no gpt-2 finetuning, using the commands given in the readme.

My understanding is that CLIP with ViT B-32 uses prefix_size = 512 while ResNet encoders use prefix_size = 640, e.g. Radford et al Visual Transformers Appendix F, and also here in train.py:

prefix_dim = 640 if args.is_rn else 512

However, in the prediction script, which I've essentially copied from the transformer notebook (https://github.com/rmokady/CLIP_prefix_caption/blob/main/notebooks/transformer_inference.ipynb)
the prefix size is set to 640.

model = ClipCaptionPrefix(prefix_length, clip_length=40, prefix_size=640,
                                  num_layers=8, mapping_type='transformer')

This worked for me for your pretrained coco model, but not my finetuned model, where I get a dimensionality mismatch between 512/640. Can you help me out here? Should I be using prefix_size = 640 or 512 for training/inference?

Thank you!
Stella

Ps: FYI there are a few typos in the commands in the Readme.md, where num_layres should be num_layers.

Which CLIP model?

Hi, can you tell me whether it is ViT-B/32 that you use to obtain the results reported in the paper? Did you run experiments also with a ResNet-based CLIP model and do you have any observations which works better? Thanks!

Training on Conceptual Captions

Are there any important changes that need to be made to the train.py file to train a model with the the Conceptual Captions 3M dataset?

I've been attempting to train a model myself using the author's recommended settings for training an MLP with GPT2 finetune. Hyperparameters such as learning rate, batch size, and num layers are all same as author. My only changes have been upgrading the CLIP to ViT_B/16 and GPT2 to gpt2-medium.

Using the train.py script I am able to get my training to run and the reported loss is decreasing through the epochs. The problem comes when I run inference. It always outputs "a model walks the runway at the fashion show during event." no matter what image I give it.

Any guidance would be appreciated.

Question About the ClipCocoDataset in train.py

image

I have a question about the above code. I observe that in the pad_tokens function, it will save the padded token. However, if we call the pad_tokens function with the same token again, the padded token saved where the padded part is set to 0, will be used to generate the mask with ge(0), as a result, the mask will always be one for all positions. Do I miss something?

A question about applying models to new datasets

Hello, thank you very much for your excellent work! I tried to apply the model to a new dataset, and I processed the dataset according to the required data format, but there seemed to be some problems with my results.

The resulting results are similar to: "12935.jpg ThereĠareĠtwoĠgroundtrackfieldsĠinĠtheĠimageĠaboveĠtheĠimageĠabove….".

I have checked that there are no redundant characters in my annotation. Could you please explain the cause? Could it be GPT2? Thank you very much!

Add model to HuggingFace

Hi,

Really cool work! One thing that could be cool (it's a feature that was recently added to HuggingFace Transformers) is the ability to load any custom PyTorch model from the hub as introduced in this PR. The idea would be that you can create a new model on the hub (it's as easy as clicking "new model", then git add ., git commit, git push to upload the weights to the repo on the hub). You can also pass in a custom script defining your model. Next, one can load your custom model as follows:

from transformers import AutoModel

model = AutoModel.from_pretrained("rmokday/clip_prefix_caption", trust_remote_code=True)

This could be easier compared to hosting the models on Google Drive, and you don't need to define the model again everytime someone wants to use it in a notebook :)

Do you mind trying this out?

Kind regards,

Niels

parse_conceptual.py error

run parse_conceptual.py, then it ended with the following error:

NotImplementedError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat. This usually means that this function requires a non-empty list of Tensors, or that you (the operator writer) forgot to register a fallback function. Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

could you please help?

Conceptual Captions Training

I have trained the model (both MLP and GPT-2) using the CC3M dataset but the loss doesn't seem to decrease very much (stays around 3.0). What loss can I expect for a good model? How many epochs should I run it for? Also, is any specific hyperparameter tuning required for CC? I have a model trained for 5 epochs but it generates a similar caption for every image. I tried fitting on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. Please let me know.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.