Code Monkey home page Code Monkey logo

image-text-embedding's Introduction

Dual-Path Convolutional Image-Text Embedding

[Paper] [Slide] ⬅️ I recommend to check this slide first. ⬅️

This repository contains the code for our paper Dual-Path Convolutional Image-Text Embedding. Thank you for your kindly attention.

Some News

5 Sep 2021 I love the sentence that 'Define yourself via tell what you are different from others' (exemplar SVM), which also is the spirit of the instance loss.

11 June 2020 People live in the 3D world. We release one new person re-id code Person Re-identification in the 3D Space, which conduct representation learning in the 3D space. You are welcomed to check out it.

30 April 2020 We have won the AICity Challenge 2020 in CVPR 2020, yielding the 1st Place Submission to the retrieval track 🚗. Check out here.

01 March 2020 We release one new image retrieval dataset, called University-1652, for drone-view target localization and drone navigation 🚁. It has a similar setting with the person re-ID. You are welcomed to check out it.

What's New: We updated the paper to the second version, adding more illustration about the mechanism of the proposed instance loss.

Install Matconvnet

I have included my Matconvnet in this repo, so you do not need to download it again.You just need to uncomment and modify some lines in gpu_compile.m and run it in Matlab. Try it~ (The code does not support cudnn 6.0. You may just turn off the Enablecudnn or try cudnn5.1)

If you fail in compilation, you may refer to http://www.vlfeat.org/matconvnet/install/

Prepocess Datasets

  1. Extract wrod2vec weights. Follow the instruction in ./word2vector_matlab;

  2. Prepocess the dataset. Follow the instruction in ./dataset. You can choose one dataset to run. Three datasets need different prepocessing. I write the instruction for Flickr30k, MSCOCO and CUHK-PEDES.

  3. Download the model pre-trained on ImageNet. And put the model into './data'.

(bash) wget http://www.vlfeat.org/matconvnet/models/imagenet-resnet-50-dag.mat

Alternatively, you may try VGG16 or VGG19.

You may have a different split with me. (Sorry, this is my fault. I used a random split.) Just for a backup, this is the dictionary archive used in the paper.

Trained Model

You may download the three trained models from GoogleDrive new GoogleDrive.

Train

  • For Flickr30k, run train_flickr_word2_1_pool.m for Stage I training.

Run train_flickr_word_Rankloss_shift_hard for Stage II training.

  • For MSCOCO, run train_coco_word2_1_pool.m for Stage I training.

Run train_coco_Rankloss_shift_hard.m for Stage II training.

  • For CUHK-PEDES, run train_cuhk_word2_1_pool.m for Stage I training.

Run train_cuhk_word_Rankloss_shift for Stage II training.

Test

Select one model and have fun!

  • For Flickr30k, run test/extract_pic_feature_word2_plus_52.m and to extract the feature from image and text. Note that you need to change the model path in the code.

  • For MSCOCO, run test_coco/extract_pic_feature_word2_plus.m and to extract the feature from image and text. Note that you need to change the model path in the code.

  • For CUHK-PEDES, run test_cuhk/extract_pic_feature_word2_plus_52.m and to extract the feature from image and text. Note that you need to change the model path in the code.

CheckList

  • Get word2vec weight

  • Data Preparation (Flickr30k)

  • Train on Flickr30k

  • Test on Flickr30k

  • Data Preparation (MSCOCO)

  • Train on MSCOCO

  • Test on MSCOCO

  • Data Preparation (CUHK-PEDES)

  • Train on CUHK-PEDES

  • Test on CUHK-PEDES

  • Run the code on another machine

Citation

@article{zheng2017dual,
  title={Dual-Path Convolutional Image-Text Embeddings with Instance Loss},
  author={Zheng, Zhedong and Zheng, Liang and Garrett, Michael and Yang, Yi and Xu, Mingliang and Shen, Yi-Dong},
  journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
  doi={10.1145/3383184},
  note={\mbox{doi}:\url{10.1145/3383184}},
  volume={16},
  number={2},
  pages={1--23},
  year={2020},
  publisher={ACM New York, NY, USA}
}

image-text-embedding's People

Contributors

layumi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

image-text-embedding's Issues

Error in train_flickr_word2_1_pool.m

Hi I was following your instructions from Readme but while executing train_flickr_word2_1_pool.m appeared an error and I have no idea how to deal with it. Can you please help me?
Error description:

>> train_flickr_word2_1_pool
cnn_train_dag: resetting GPU

ans = 

  CUDADevice with properties:

                      Name: 'GeForce RTX 2060'
                     Index: 1
         ComputeCapability: '7.5'
            SupportsDouble: 1
             DriverVersion: 10.1000
            ToolkitVersion: 8
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 6.2228e+09
           AvailableMemory: 5.5524e+09
       MultiprocessorCount: 30
              ClockRateKHz: 1200000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

train: epoch 01:   1/253:Error using vl_nnconv
The FILTERS depth does not divide the DATA depth.

Error in dagnn.Conv/forward (line 12)
      outputs{1} = vl_nnconv(...

Error in dagnn.Layer/forwardAdvanced (line 85)
      outputs = obj.forward(inputs, {net.params(par).value}) ;

Error in dagnn.DagNN/eval (line 91)
  obj.layers(l).block.forwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>processEpoch (line 222)
      net.eval(inputs, params.derOutputs, 'holdOn', s < params.numSubBatches) ;

Error in cnn_train_dag (line 90)
    [net, state] = processEpoch(net, state, params, 'train',opts) ;

Error in train_flickr_word2_1_pool (line 39)
[net,info] = cnn_train_dag(net, imdb, @getBatch,opts) ;

High loss for Text CNN in Stage 1 and COCO dataset questions

Hey layumi, I am trying to replicate your results for mscoco in tensorflow I had some questions about processing data and loss:

  1. At the end of Stage 1 my text CNN ('objective_txt') loss is high around 5.5, what was the loss you got at the end of Stage 1?

  2. in dataset/MSCOCO-prepare/prepare_wordcnn_feature2.m you create
    wordcnn = zeros(32,611765,'int16')
    then loop over all the captions in MSCOCO, but there is 616,767 captions in MSCOCO, so what's the reason of this 5002 difference? it throws an out of range error when I implemented it in tensorflow because there is more captions than the rows/columns in the matrix wordcnn created

  3. coco_dictionary.mat dimensions is 29972 in your code but my dimensions are different? I wonder if this is the reason why the loss is high or it might be because tensorflow uses a different random generator than matlab, if you have any suggestion on this that would be great

Thank you!

The loss of Image CNN of training stage2 on CUHK-PEDES dataset suddenly increases a lot

Hi layumi, I run the train_cuhk_Rankloss_shift.m after running the train_cuhk_word2_1_pool.m, but after about 40 epochs, I found the training result of Image CNN to be bad(as Fig.1). I didn't change any hyper parameters. I found that the layer parameters of the first few layers of Image CNN became NaN after about 40 epochs. Do you know why this happens?
stage2
Before that, the training result looked good when running the train_cuhk_word2_1_pool.m for training stage1(as Fig.2).
stage1

train_cuhk_word2_1_pool.m requires rgbMean value and dimension mis-match in subset features

Hello
I was trying to execute your code as I required it urgently:

  1. First there was an issue with a dimension mis-match of the subset.features required by line 275 in the file:cukh_word2_pool.m :
    net.params(first).value = reshape(single(subset.features'),1,1,7263,300);

As the subset.features produced by the make_dictionary.m in line 29 gives features of dimension 300x7264:
subset.features = w_features(:,sub);
HOW CAN THIS BE SOLVED?

Also, You have commented line 273 and 274 line in the file cukh_word2_pool.m, should they be uncommented?:
%m = mean(subset.features,2);
%subset.features = subset.features-repmat(m,1,20074);

  1. Next, when I try to run the training file train_cuhk_word2_1_pool.m, I get an error as:
    Reference to non-existent field 'rgbMean'.
    Error in train_cuhk_word2_1_pool (line 16)
    im_mean = imdb.rgbMean;

In the prepare_imdb.m, you have commented line 24 and 25 which has this variable, should these be uncommented?:
%resize_image; % resize image to 256x256, return mm.
%imdb.rgbMean = mm;

CUHK results drops compared to reported numbers.

txt-image rank-1:0.384178 mAP:0.354052 Medr:3.000000
txt-image rank-5:0.610136 rank-10:0.703866

which is normally 5-6% lower than reported. I made no change to the parameters, any ideas what could possibly be the reason for the performance drop?

trained on Ubuntu 16.04 LTS Matlab 2015b (preprocessing data on another machine with a higher version for jsondecode support) + cudnn 5.0 with a Titan Xp.

How to avoid overfitting

Hello,
ZheDong, thanks for you sharing such a good work. I want to reproduce it in Pytorch,but I'm sorry that I encountered the overfitting problem.
To get the results quickly, I randomly choose 10,000 samples as traindata and 1,000 as valdata, 1,000 as testdata separately. Finally I got about 100% recall@5 on the training set while only half of it on the val data.
And I'm a fresh man to ImageTextEmbedding,could you share some solutions to that. I guess there are relevant reasons:

  1. Data normalization. I don't compute the mean and var of train_data explicitly, and just divide it by 255, subtract 0.5, and thendivide it by 0.5

  2. L2 regularization. I just use the 1e-5 regularization intensity

  3. The complexity of classifier. After generator, I add a classifier with a softmax layer directly. Whether more fully connection layers can slow down the fitting of the training set

Finally, I want to ask how to mine the hard triplet online in Pytorch efficiently.

Thanks.

Extracting text for the single image

Sorry for duplication of the issue #1, but can you, please, explain how can I extract text for the single image given as input? It is not clear for me what steps I need to do to get text description of the single image.

Also, I was wondering if I can extract text for some external image, I mean for the image that was not included in the train and val image set?

I will really appreciate any help.

Ask you a question

Recently read your thesis, there are problems in the process of running the code, ask you.
The following problem occurs when split data set in Image-Text-Embedding:
Index exceeds matrix dimensions.

Error in prepare_imdb (line 24)
Images = fullfile(imdb.images.data(train(1:end))) ;

Going beyond the border, I don’t know what’s wrong.

Problem training on MSCOCO

I am trying to train the model on MSCOCO and run into the following issues:

1- When running 'train_coco_word2_1_pool.m' as you suggest, I get the error that the function 'coco_word2_pool_no_w2v' does not exist.

2- I therefore changed it for 'coco_word2_pool' since this function is indeed in the directory (is this what you meant?). Then I get the following error:

_Error using reshape
To RESHAPE the number of elements must not change.

Error in coco_word2_pool (line 273)
net.params(first).value = reshape(single(subset.features'),1,1,29972,300);

Error in train_coco_word2_1_pool (line 17)
net = coco_word2_pool();_

3- I experience the same issue when running 'train_coco_word2_1_pool_vgg19.m'

4- The reason that I am training is that I want to reproduce your results but the 20 epochs in your pretrained model don't seem to be enough. Are these the parameters that you used to report test results?

I am running the code on a MacBook Pro, on Matlab R2018b.
Thank you in advance.

The loss on COCO dataset of training stage1 dosen't decrease

Hi, I run the train_coco_word2_1_pool.m, but after more than 10 epochs, I found the train result is still bad(as below). I didn't change any hyper parameters but I don't know why it doesn't work.
qq 20180515115715
Can you tell me what make this result happen? And I found the learning rate in your code is 0.1, but it's reported as 0.001 in the paper. Which lr is correct and better in this task?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.