rmokady / clip_prefix_caption Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 213.0 2.24 MB

Simple image captioning model

License: MIT License

Jupyter Notebook 97.29% Python 2.71%

clip_prefix_caption's People

Contributors

Stargazers

Watchers

Forkers

c00renut miguelbandera andreasjansson ak391 roboflow namzakku eimearnolan nomardata nghiepnc kirby123-cmd joetm dumpmemory ameerhamza111 javi897 ananyamalik neverix namnaku87 somasekhar-nakkala techthiyanes randomwalker300 fourmi1995 dips4717 linh-vjppro-67 mingzju theocoombes amarinode huang-xx chapter544 dmarx feitgemel nishikanth3 hackerfriendly kerengaiger moslehpour eenzeenee charleneleong-ai lmathia2 russell-shu szulm lwed duyuankai1992 qiao-lang1997 shuangli59 maxylee kms77 bpiyush jaireyu geekhusky anonymousdestroyer cyp0630 cedro3 socromp peternara berleon nobelvictory rm-rf-me zhihong1224 priyamtejaswin netamor dmitriyg228 alexwortega mymuli cloudinaryltd chenyutongthu jasonyangu stephanielewkowitz zhihao-chen mtkshu orm011 artyaltanzaya daniyalt mychamli calvinnncy97 sadlerjonathan simrit1 surisdi linhuixiao hideosnes vidsgr enes3774 tbmihailov sarwar3328 sysang anishk23733 robertodessi thandal eeyrw skat00sh summatic chenxwh hhhh17 jeelvaishnav seungyounshin paxtonedgar sinadumi ananya-sahu adpat0324 abdoiiii verigle amazingyx

clip_prefix_caption's Issues

Evaluation Code

Hello, great project :)

I wanted to try some stuff with you code, and wanted to evaluate the results with the same metrics you did on COCO or Conceptual Captions.
Is it possible for you to share the evaluation code for these metrics? And what split of the dataset did you check those against?

Thanks!

FInal token index when training a model

Hi, thanks a lot for making the code available, it's a great resource to use!

I was wondering why the index when computing the loss from the output of gpt is shifted by 1 on the left:
https://github.com/rmokady/CLIP_prefix_caption/blob/main/train.py#L315

shouldn't it be logits = outputs.logits[:, dataset.prefix_length :]

Thanks!

Has this paper been accepted yet? thank you

inference

when I run the generate_beam for the caption,there are many space in the caption，Do you know why, thank you.

Dear friend,please tell me why I can't run the example.I've tried a lot of solutions but i don't know if that's right.

I am a newbie.
I can't run the code from CLIP_prefix_caption.
Hope to get your help, thank you.

Training on Conceptual Captions

Are there any important changes that need to be made to the train.py file to train a model with the the Conceptual Captions 3M dataset?

I've been attempting to train a model myself using the author's recommended settings for training an MLP with GPT2 finetune. Hyperparameters such as learning rate, batch size, and num layers are all same as author. My only changes have been upgrading the CLIP to ViT_B/16 and GPT2 to gpt2-medium.

Using the train.py script I am able to get my training to run and the reported loss is decreasing through the epochs. The problem comes when I run inference. It always outputs "a model walks the runway at the fashion show during event." no matter what image I give it.

Any guidance would be appreciated.

Conceptual Captions Training

I have trained the model (both MLP and GPT-2) using the CC3M dataset but the loss doesn't seem to decrease very much (stays around 3.0). What loss can I expect for a good model? How many epochs should I run it for? Also, is any specific hyperparameter tuning required for CC? I have a model trained for 5 epochs but it generates a similar caption for every image. I tried fitting on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. Please let me know.

purpose of "id" element in coco train data?

Hey, real quick question - I'm putting together a custom dataset of captions in the coco format, and it's pretty obvious how the "image_id" and "caption" values are put to use, but the "id" field isn't too clear. Is it necessary?

labels in MLP+GPT2 tuning?

I see that your code does not use labels, does this not affect the results of MLP+GPT2 tuning?

different caption for the same image

Dear @rmokady ,

Thank you for your great work, it is very interesting.

Is there a way to generate different caption for the ame image?

Thank you for your help.

Best Wishes,

Alex

Adding code for batched run with multiple gpu support, for faster training?

Thank you for your work. Do you have any plans to add code that supports multiple gpu's?

Issues When Training with WebDatasets/Larger GPT2 Model

I have attempted to train with the gpt2-xl model from huggingface as well as a custom dataloader that can preprocess webdataset archives, but am having issues when it comes to inference.

I have created a small script to test inference (inference_test.py) on my fork but it seems to repeatedly generate the same unrelated caption for any input (below are some samples from the RedCaps dataset):

I'm not too sure what has gone wrong, but I believe it's either down to my dataloader not preprocessing the dataset in the correct way, or it's down to the model not supporting GPT2-xl.

The preprocessing script is at parse_webdatasets.py and is modified from rom1504/clip-retrieval's inference script. It does more of the transformations usually seen in train.py beforehand to optimise the training speed, and generates many .npy files as a result.
These files are loaded with the custom dataloader on train.py.

Do you know what might be causing this?

Thanks in advance

clip_prefix_captioning_inference.ipynb

请问这个文件运行一直卡住是什么问题？

Finetuning on custom data (videos)

Hi,

Congrats on the amazing project. I have a video dataset with captions. I wanted to fine-tune this model on video dataset to generate custom captions. I am planning to extract frame-wise features and concatenate them during training. Please, let me know how can I fine-tune on custom dataset.

Thanks in advance.

Can you share the evaluation procedure? Thanks!

Which CLIP model?

Hi, can you tell me whether it is ViT-B/32 that you use to obtain the results reported in the paper? Did you run experiments also with a ResNet-based CLIP model and do you have any observations which works better? Thanks!

'CLIP' object has no attribute 'clip_project'

I am trying to implement a custom clip:


model = clip_model.eval() 
device = CUDA(0) if is_gpu else "cpu"
model = clip_model.to(device)

use_beam_search = False 

import pandas as pd
prefix_length=10


df = pd.DataFrame([],columns=['image','desc'])
for fname in [i for i in os.listdir('image') if i.endswith('.png')]:
  image = io.imread('image/'+fname)
  pil_image = PIL.Image.fromarray(image)  
  image = preprocess(pil_image).unsqueeze(0).to(device)

  with torch.no_grad():
      prefix = model.encode_image(image).to(device, dtype=torch.float32)
      prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)
      
  if use_beam_search:
      generated_text_prefix = generate_beam(model, tokenizer, embed=prefix_embed)[0]
  else:
      generated_text_prefix = generate2(model, tokenizer, embed=prefix_embed)

  df.loc[len(df)] = [fname,generated_text_prefix]
  print(generated_text_prefix)

Getting error in the below code:

#prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)

ModuleAttributeError: 'CLIP' object has no attribute 'clip_project'

LABEL LEAKAGE During Training

In code: https://github.com/rmokady/CLIP_prefix_caption/blob/main/parse_coco.py#L26, you refer to the validation set to get the training samples. Also, in your file: https://drive.google.com/file/d/1D3EzUK1d1lNhD2hAvRiKPThidiVbP2K_/view?usp=sharing, there are validation captions (labels) in the training dataset, which results in obvious label leakage since the model will be trained on the validation samples. Actually if we train the model for more than 10 epochs (e.g. 100 epochs), the results in the validation set can reach an unreasonably high scores. Can you explain this?

Question About the ClipCocoDataset in train.py

I have a question about the above code. I observe that in the pad_tokens function, it will save the padded token. However, if we call the pad_tokens function with the same token again, the padded token saved where the padded part is set to 0, will be used to generate the mask with ge(0), as a result, the mask will always be one for all positions. Do I miss something?

Incorrect use of labels for GPT2?

Hi,

As far as I understand the model and the usage of GPT2, shouldn't the get_dummy_token function return torch.ones() * -100 instead of torch.zeros()? This is because we should be ignoring the outputs of GPT2 for these prefix inputs. Currently, it's forcing the model to predict token 0 which is the exclamation mark ("!").

Reference lines: https://github.com/rmokady/CLIP_prefix_caption/blob/main/train.py#L222-L223

Thanks!

I would like to ask how can Clip-Cap generate multiple different sentences for one image?

I've changed the entry_count count in the generate2() function, but the output sentence is the same.

inference for batch

Can you help me inference for a batch image

Reproducing loss

Hi @rmokady, what a clever approach!

I'm trying this approach on my custom dataset and manage to get it start training. I'm figuring out way to add evaluate code to better manage the training, but in the mean time, I wonder what is your loss score when stop training the model on each mode: Train only prefix and Train both prefix and GPT?

Can you share me how to read Conceptal Caption dataset?

Can you provide the dataset API.

I'm curious about the training dataset to evaluate "nocaps".

Hello. Thanks for sharing your code!

I'm curious about the training dataset to evaluate "nocaps".

I was wondering if the model(clipcap model) was trained with the train split of "MS COCO Karpathy's splits" and evaluated using the validation split of "nocaps".

weight behind clip_prefix_captioning_inference.ipynb and your demo

I found that both your clip_prefix_captioning_inference.ipynb and your demo https://replicate.com/rmokady/clip_prefix_caption with worked very well for my images.
I am wondering how you got the weights behind your demo. I mean the specific hyper parameter you used to get it.
For coco mode, I ran python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ with 40 batch_size and all other default arguments. I trained it for many epochs and saved many weights. However, the weights I kept didn't work as well as yours.

Could you please tell me the specific hyper parameters and arguments for the weight behind clip_prefix_captioning_inference.ipynb and your demo https://replicate.com/rmokady/clip_prefix_caption with coco model？ I will be extremely grateful!

Evaluation script for COCO

Hi,

That's an interesting work, thanks for sharing the code!

I see that you already directed several users to the evaluation procedure of Oscar to generate the json file with image captions. I tried that but it requires many adaptations, as the models and the input data are very different. It seems a delicate merge, rather than just running another script.
Am I missing something? did you use the run_captioning.py script with the evaluation flag?

Thanks

parse_conceptual.py error

run parse_conceptual.py, then it ended with the following error:

NotImplementedError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat. This usually means that this function requires a non-empty list of Tensors, or that you (the operator writer) forgot to register a fallback function. Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

could you please help?

Evaluate problem

It's a very nice work,can you provide the code for evaluation it? I don't know how to evaluate it. Thank you!

Docker run: {"detail":"Not Found"}

I am following the Replicate docs for this model to run on my own machine https://replicate.com/rmokady/clip_prefix_caption

I can run Docker fine with docker run -d -p 5000:5000 --gpus=all r8.im/rmokady/clip_prefix_caption@sha256:d703881e7b50eb009779b7e1e79d394639df0af71be28c4100f88c03fdccdbf0

However when I run curl, all inputs give {"detail":"Not Found"}

E.g.:

$ curl http://localhost:5000/predict -X POST \
  -F image=@red_panda.jpg \
  -F model=conceptual-captions \
  -F use_beam_search=False
{"detail":"Not Found"}

$ curl http://localhost:5000/predict -X POST \
  -F image=@red_panda.jpg \
  -F model=coco \
  -F use_beam_search=False
{"detail":"Not Found"}

I'm inferring that coco/conceptual-captions are the correct model names to use in this context, however I get the same output ({"detail":"Not Found"}) passing other strings (e.g., -F model=example \). Is there something I'm missing/could the docs be made clearer?

Reproducing validation results

Hello,

Thanks for the great work! I was interested in reproducing the transformer network with frozen GPT-2, and achieved slightly lower performance on COCO so far:

Metric	reported	reproduced
Bleu@4	33.53	31.0
METEOR	27.45	27.1
CIDEr	113.08	105.7
SPICE	21.05	20.4

I was wondering if the provided code should be able to reproduce the validation scores or if I am missing something?

Clip text encoder instead of language model (GPT-2)

Hi thanks for providing great work!
Just in case, did you experiment with the text encoder of the clip instead of the language pretrained model?
Of course, I think the clip text encoder only learns the alignment with visual features, but I'm still curious whether clip text encoder has generation ability.

Thanks

prefix size for Vit/ResNet

Hi - I'm having trouble with prefix_size mismatch on my own finetuned model with CLIP features using ViT B-32. I'm learning just the transforming mapping and no gpt-2 finetuning, using the commands given in the readme.

My understanding is that CLIP with ViT B-32 uses prefix_size = 512 while ResNet encoders use prefix_size = 640, e.g. Radford et al Visual Transformers Appendix F, and also here in train.py:

CLIP_prefix_caption/train.py

Line 355 in 1ad805a

prefix_dim = 640 if args.is_rn else 512

However, in the prediction script, which I've essentially copied from the transformer notebook (https://github.com/rmokady/CLIP_prefix_caption/blob/main/notebooks/transformer_inference.ipynb)
the prefix size is set to 640.

model = ClipCaptionPrefix(prefix_length, clip_length=40, prefix_size=640,
                                  num_layers=8, mapping_type='transformer')

This worked for me for your pretrained coco model, but not my finetuned model, where I get a dimensionality mismatch between 512/640. Can you help me out here? Should I be using prefix_size = 640 or 512 for training/inference?

Thank you!
Stella

Ps: FYI there are a few typos in the commands in the Readme.md, where num_layres should be num_layers.

Sharing of pretrained model weights?

Hi, I'm thinking if you can share the pretrained model weights so it is easier to try it out without first having to train our own models?

Object of type MappingType is not JSON serializable

When I ran your code, I got an error with MappingType.

A question about the x dimension.

When I train only transformer mapping network,I found that the dimension of x is(40 , 512),but prefix_dim = 640.I don't know why this is happening. Is it caused by the extraction of clip features? Hope to get your help, thank you.

The downloaded pretrained weights cannot be imported correctly using the jupyter notebook.

I have downloaded the pretrain weights from Google Drive. But it seems there's something wrong with it. It's not working because of mismatching!

https://drive.google.com/file/d/14pXWwB4Zm82rsDdvbGguLfx9F8aM7ovT/view?usp=sharing
https://drive.google.com/file/d/1IdaBtMSvtyzF0ByVaBHtvM0JYSXRExRX/view?usp=sharing

How can I do for this?

For the transformer (without fine-tuning GPT-2) we provide COCO pretrained model.

prefix_dim = 640 in the pretrained model. But how to translate CLIP's 512 embedding into 640 before forwarding to the net?

FileNotFoundError: [Errno 2] No such file or directory: 'gcloud': 'gcloud'

Does this problem have big guy to encounter excuse me? I hope you can help me

Getting multiple caption outputs

Hi,
First of all, I want to say thanks a lot for this repo!

My question is,
How can I get an output of n captions for an input image? which are also with high quality like the first one?

For example if n=3, I'd like to get 3 different captions that describe the same caption as good as possible.

Nit: Swapped links in README

Hey, super small minor thing: In your README, you swapped the links for training data and validation data:

Download [training images](http://images.cocodataset.org/zips/val2014.zip) and [validation images](http://images.cocodataset.org/zips/train2014.zip) and unzip (We use Karpathy et el. split).

Obviously a small nit, but also an easy fix in the future.

CUDA out of memory

Hello,

Would you happen to have any tips for training the model? I keep on running into a CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 15.90 GiB total capacity; 14.75 GiB already allocated; 23.75 MiB free; 14.83 GiB reserved in total by PyTorch) problem despite training on a P-100 with 16GB of RAM. I've tried decreasing the batch size from 40 to 10 with no improvement - in fact it seems to run into a slightly different issue: RuntimeError: CUDA out of memory. Tried to allocate 892.00 MiB (GPU 0; 15.90 GiB total capacity; 14.13 GiB already allocated; 193.75 MiB free; 14.66 GiB reserved in total by PyTorch).

Setting the batch_size to 2 seems to get things running, albeit very slowly. What hardware did you train on?

have you tried different CLIP models?

Hi @rmokady,
Thank you for your nice work, I learned a lot from it. Since the default CLIP model you are using seems to be the ViT-B32 version, I am wondering if you have tried other visual features e.g. from ViT-L or the resnet models? I can't find it mentioned in the paper. I'm trying to train a similar model at the moment and assume the features extracted from bigger vision encoders would contain more information.

Best, David

Add model to HuggingFace

Hi,

Really cool work! One thing that could be cool (it's a feature that was recently added to HuggingFace Transformers) is the ability to load any custom PyTorch model from the hub as introduced in this PR. The idea would be that you can create a new model on the hub (it's as easy as clicking "new model", then git add ., git commit, git push to upload the weights to the repo on the hub). You can also pass in a custom script defining your model. Next, one can load your custom model as follows:

from transformers import AutoModel

model = AutoModel.from_pretrained("rmokday/clip_prefix_caption", trust_remote_code=True)

This could be easier compared to hosting the models on Google Drive, and you don't need to define the model again everytime someone wants to use it in a notebook :)

Do you mind trying this out?

Kind regards,

Niels

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

I'm trying to refactor your model based on transformers, but I'm having a problem: there's always an error somewhere, but I've tried a lot of solutions and I don't have a clue.

class ClipCaptionModel(PreTrainedModel):
  def __init__(self, config):
    super(ClipCaptionModel, self).__init__(config)
    self.prefix_length = config.prefix_length
    self.clip_length = config.clip_length
    self.prefix_size = config.prefix_size
    self.num_layers = config.num_layers
    self.mapping_type = config.mapping_type 
    decoder = config.decoder
    self.gpt = GPT2LMHeadModel.from_pretrained('uer/gpt2-chinese-cluecorpussmall')
    self.gpt_embedding_size = self.gpt.transformer.wte.weight.shape[1]
    self.clip_project = TransformerMapper(self.prefix_size, self.gpt_embedding_size, self.prefix_length, self.clip_length, self.num_layers)  #(512,768,10,8)
    print(self.prefix_size, self.gpt_embedding_size, self.prefix_length, self.clip_length, self.num_layers)

  def get_dummy_token(self, batch_size: int, device: torch.device) -> torch.Tensor:
    return torch.zeros(batch_size, self.prefix_length, dtype=torch.int64, device=device)

  def forward(self, 
              tokens: torch.Tensor, 
              prefix: torch.Tensor, 
              mask: Optional[torch.Tensor] = None,
              labels: Optional[torch.Tensor] = None):
    
      embedding_text = self.gpt.transformer.wte(tokens)
      print(prefix.shape)
      prefix_projections = self.clip_project(prefix).view(-1, self.prefix_length, self.gpt_embedding_size)
      embedding_cat = torch.cat((prefix_projections, embedding_text), dim=1)
      if labels is not None:
        dummy_token = self.get_dummy_token(tokens.shape[0], tokens.device)
        labels = torch.cat((dummy_token, tokens), dim=1)
      out = self.gpt(inputs_embeds=embedding_cat, labels=labels, attention_mask=mask)
      return out


class ClipCaptionPrefix(ClipCaptionModel):

    def parameters(self, recurse: bool = True):
        return self.clip_project.parameters()

    def train(self, mode: bool = True):
        super(ClipCaptionPrefix, self).train(mode)
        self.gpt.eval()
        return self

`
Here is the address on the colab：https://colab.research.google.com/drive/1sEg9HbDwRPs9_SNVjjsPE_sk449P9Svc#scrollTo=3pP_n5oQrXPg&uniqifier=1

Low variety of outputs.

Heihei, here's my issues:

a) I'm getting a very low variety of outputs with ([very] different) custom images, e.g. "...sitting on a cellphone", "...with a cellphone", "...a cellphone and a surfboard".

b) The captions don't change after subsequent runs. Is this an issue with the model or with weights?

c) Allthough I altered the 'entry_length' and 'stop_token' (and also temperatur and p) it has no effect on the caption whatsoever.

Has any1 an idea what I'm doing wrong or could you point me in the right direction for research?

Thank you in advance and all the best,
Hidéo

Downloading Pretrained Model Without Going Through Colab?

Is there a more direct way to obtain the pretrained weights than by using Colab + Google Drive?

An image loss error occurred while extracting image features.

Thank you for sharing amazing work.
The following problem occurred when I used parse_coke.py to extract features. I checked the data set and did a search on the Internet and couldn't find coco_val2014_000000116100.jpg. I hope to get your help.Thanks.

A question about applying models to new datasets

Hello, thank you very much for your excellent work！ I tried to apply the model to a new dataset, and I processed the dataset according to the required data format, but there seemed to be some problems with my results.

The resulting results are similar to: "12935.jpg ThereĠareĠtwoĠgroundtrackfieldsĠinĠtheĠimageĠaboveĠtheĠimageĠabove….".

I have checked that there are no redundant characters in my annotation. Could you please explain the cause? Could it be GPT2? Thank you very much!

Not able to train custom data.

I am trying to train data having roughly 3000 images on google colab GPU and resulting GPU error as below. So I tried giving 50 images to process then it is working fine. But, I believe it should not be GPU issue as coco is able to train more than 10000 images. I checked out the format of data and images as well those look fine based on coco data format. Any leads on this is appreciated.

Downloading: 100% 0.99M/0.99M [00:00<00:00, 7.93MB/s]
Downloading: 100% 446k/446k [00:00<00:00, 5.78MB/s]
Downloading: 100% 665/665 [00:00<00:00, 1.13MB/s]
Data size is 3060
Token indices sequence length is longer than the specified maximum sequence length for this model (1140 > 1024). Running this sequence through the model will result in indexing errors
Downloading: 100% 523M/523M [00:07<00:00, 69.9MB/s]
Train both prefix and GPT
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
FutureWarning,

Training epoch 0
coco_prefix: 0% 0/76 [00:00<?, ?it/s]Traceback (most recent call last):
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 370, in
main()
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 366, in main
train(dataset, model, args, output_dir=args.out_dir, output_prefix=args.prefix)
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 314, in train
outputs = model(tokens, prefix, mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/content/gdrive/MyDrive/CLIP_prefix_caption/train.py", line 233, in forward
out = self.gpt(inputs_embeds=embedding_cat, labels=labels, attention_mask=mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 1061, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 397, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 332, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 189, in _attn
attn_weights = torch.matmul(query, key.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 430.00 MiB (GPU 0; 14.76 GiB total capacity; 13.54 GiB already allocated; 145.75 MiB free; 13.69 GiB reserved in total by PyTorch)
coco_prefix: 0% 0/76 [00:00<?, ?it/s]

rmokady / clip_prefix_caption Goto Github PK

clip_prefix_caption's People

Contributors

Stargazers

Watchers

Forkers

clip_prefix_caption's Issues

Recommend Projects

Recommend Topics

Recommend Org