layumi / image-text-embedding Goto Github PK

TOMM2020 Dual-Path Convolutional Image-Text Embedding :feet: https://arxiv.org/abs/1711.05535

License: MIT License

C++ 22.17% C 1.76% MATLAB 42.40% M 0.08% Cuda 32.93% Shell 0.67%

visual-semantic bidirectional-retrieval language-retrieval matconvnet matlab person-reidentification image-search image-retrieval cross-modal-retrieval cross-modality

image-text-embedding's Introduction

Dual-Path Convolutional Image-Text Embedding

[Paper] [Slide] ⬅️ I recommend to check this slide first. ⬅️

This repository contains the code for our paper Dual-Path Convolutional Image-Text Embedding. Thank you for your kindly attention.

Some News

Instance Loss (Pytorch version) is now available at https://github.com/layumi/Person_reID_baseline_pytorch/blob/master/instance_loss.py

5 Sep 2021 I love the sentence that 'Define yourself via tell what you are different from others' (exemplar SVM), which also is the spirit of the instance loss.

11 June 2020 People live in the 3D world. We release one new person re-id code Person Re-identification in the 3D Space, which conduct representation learning in the 3D space. You are welcomed to check out it.

30 April 2020 We have won the AICity Challenge 2020 in CVPR 2020, yielding the 1st Place Submission to the retrieval track 🚗. Check out here.

01 March 2020 We release one new image retrieval dataset, called University-1652, for drone-view target localization and drone navigation 🚁. It has a similar setting with the person re-ID. You are welcomed to check out it.

What's New: We updated the paper to the second version, adding more illustration about the mechanism of the proposed instance loss.

Install Matconvnet

I have included my Matconvnet in this repo, so you do not need to download it again.You just need to uncomment and modify some lines in gpu_compile.m and run it in Matlab. Try it~ (The code does not support cudnn 6.0. You may just turn off the Enablecudnn or try cudnn5.1)

If you fail in compilation, you may refer to http://www.vlfeat.org/matconvnet/install/

Prepocess Datasets

Extract wrod2vec weights. Follow the instruction in ./word2vector_matlab;
Prepocess the dataset. Follow the instruction in ./dataset. You can choose one dataset to run. Three datasets need different prepocessing. I write the instruction for Flickr30k, MSCOCO and CUHK-PEDES.
Download the model pre-trained on ImageNet. And put the model into './data'.

(bash) wget http://www.vlfeat.org/matconvnet/models/imagenet-resnet-50-dag.mat

Alternatively, you may try VGG16 or VGG19.

You may have a different split with me. (Sorry, this is my fault. I used a random split.) Just for a backup, this is the dictionary archive used in the paper.

Trained Model

You may download the three trained models from ~~GoogleDrive~~ new GoogleDrive.

Train

For Flickr30k, run train_flickr_word2_1_pool.m for Stage I training.

Run train_flickr_word_Rankloss_shift_hard for Stage II training.

For MSCOCO, run train_coco_word2_1_pool.m for Stage I training.

Run train_coco_Rankloss_shift_hard.m for Stage II training.

For CUHK-PEDES, run train_cuhk_word2_1_pool.m for Stage I training.

Run train_cuhk_word_Rankloss_shift for Stage II training.

Test

Select one model and have fun!

For Flickr30k, run test/extract_pic_feature_word2_plus_52.m and to extract the feature from image and text. Note that you need to change the model path in the code.
For MSCOCO, run test_coco/extract_pic_feature_word2_plus.m and to extract the feature from image and text. Note that you need to change the model path in the code.
For CUHK-PEDES, run test_cuhk/extract_pic_feature_word2_plus_52.m and to extract the feature from image and text. Note that you need to change the model path in the code.

CheckList

Citation

@article{zheng2017dual,
  title={Dual-Path Convolutional Image-Text Embeddings with Instance Loss},
  author={Zheng, Zhedong and Zheng, Liang and Garrett, Michael and Yang, Yi and Xu, Mingliang and Shen, Yi-Dong},
  journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
  doi={10.1145/3383184},
  note={\mbox{doi}:\url{10.1145/3383184}},
  volume={16},
  number={2},
  pages={1--23},
  year={2020},
  publisher={ACM New York, NY, USA}
}

image-text-embedding's People

Contributors

Stargazers

Watchers

Forkers

xun-yang bygreencn lan1991xu wangzhpp zgsxwsdxg little1tow wanjinchang shiyongde swordsmanxyz pinkwayne alenaliu vanpersie32 roozbehsanaei xogo123 wynmew liyuanyaun ryfan-rs runqingz image-compression southatsouth hyzcn lebronyxm codes-kzhan robo-warrior zhihang-li allenmujie jianqiangq denethor1997 ailuo linghushaoxia abbynini kiran-raja codegank sabirdvd tommywongww gptalgopro brucew91 wangxu-scu gbusr kwanegx hongminli kinect59 amirunpri2018 af00731 liu6tot heroichitesh litongxin666 yuanmengzhixing henry51 anubhav-jangra littlesuncaicai doublepoints yuanwei0908 firewaterfire binhmisfit warren-ding lijian10086 deepqd98 sui6662012 saisudhapanigrahi yxmhuahua camelliahmy yangyangkiki daoramey guojiaxian zxryjpnl strategist922 liviust snow-meteor tetris-player phoenix-repo huangjh98 areafather

image-text-embedding's Issues

any way to get text/image embedding

is there any way to get text or image embedding if given text or image as input?
thanks

Sorry, forget about this

forget about this

Error in train_flickr_word2_1_pool.m

Hi I was following your instructions from Readme but while executing train_flickr_word2_1_pool.m appeared an error and I have no idea how to deal with it. Can you please help me?
Error description:

>> train_flickr_word2_1_pool
cnn_train_dag: resetting GPU

ans = 

  CUDADevice with properties:

                      Name: 'GeForce RTX 2060'
                     Index: 1
         ComputeCapability: '7.5'
            SupportsDouble: 1
             DriverVersion: 10.1000
            ToolkitVersion: 8
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 6.2228e+09
           AvailableMemory: 5.5524e+09
       MultiprocessorCount: 30
              ClockRateKHz: 1200000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

train: epoch 01:   1/253:Error using vl_nnconv
The FILTERS depth does not divide the DATA depth.

Error in dagnn.Conv/forward (line 12)
      outputs{1} = vl_nnconv(...

Error in dagnn.Layer/forwardAdvanced (line 85)
      outputs = obj.forward(inputs, {net.params(par).value}) ;

Error in dagnn.DagNN/eval (line 91)
  obj.layers(l).block.forwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>processEpoch (line 222)
      net.eval(inputs, params.derOutputs, 'holdOn', s < params.numSubBatches) ;

Error in cnn_train_dag (line 90)
    [net, state] = processEpoch(net, state, params, 'train',opts) ;

Error in train_flickr_word2_1_pool (line 39)
[net,info] = cnn_train_dag(net, imdb, @getBatch,opts) ;

Are layers 'fc1' & 'fc1' illustrated in the paper?

Hi, I am interested in how these two layers influence the experiment result, because it seems that these two layers are not showed in the arXiv paper but in your matlab code.

Thank you!

High loss for Text CNN in Stage 1 and COCO dataset questions

Hey layumi, I am trying to replicate your results for mscoco in tensorflow I had some questions about processing data and loss:

At the end of Stage 1 my text CNN ('objective_txt') loss is high around 5.5, what was the loss you got at the end of Stage 1?
in dataset/MSCOCO-prepare/prepare_wordcnn_feature2.m you create
wordcnn = zeros(32,611765,'int16')
then loop over all the captions in MSCOCO, but there is 616,767 captions in MSCOCO, so what's the reason of this 5002 difference? it throws an out of range error when I implemented it in tensorflow because there is more captions than the rows/columns in the matrix wordcnn created
coco_dictionary.mat dimensions is 29972 in your code but my dimensions are different? I wonder if this is the reason why the loss is high or it might be because tensorflow uses a different random generator than matlab, if you have any suggestion on this that would be great

Thank you!

The loss of Image CNN of training stage2 on CUHK-PEDES dataset suddenly increases a lot

Hi layumi, I run the train_cuhk_Rankloss_shift.m after running the train_cuhk_word2_1_pool.m, but after about 40 epochs, I found the training result of Image CNN to be bad(as Fig.1). I didn't change any hyper parameters. I found that the layer parameters of the first few layers of Image CNN became NaN after about 40 epochs. Do you know why this happens?

Before that, the training result looked good when running the train_cuhk_word2_1_pool.m for training stage1(as Fig.2).

Error in train_cuhk_word2_1_pool (line 16);'rgbMean'.

hello, I have meet the problem when I run train_cuhk_word2_1_pool.m
Reference to non-existent field 'rgbMean'.
Error in train_cuhk_word2_1_pool (line 16); how to solve it , thanks!

how to use the demo？

How to use the trained model

train_cuhk_word2_1_pool.m requires rgbMean value and dimension mis-match in subset features

Hello
I was trying to execute your code as I required it urgently:

First there was an issue with a dimension mis-match of the subset.features required by line 275 in the file:cukh_word2_pool.m :
net.params(first).value = reshape(single(subset.features'),1,1,7263,300);

As the subset.features produced by the make_dictionary.m in line 29 gives features of dimension 300x7264:
subset.features = w_features(:,sub);
HOW CAN THIS BE SOLVED?

Also, You have commented line 273 and 274 line in the file cukh_word2_pool.m, should they be uncommented?:
%m = mean(subset.features,2);
%subset.features = subset.features-repmat(m,1,20074);

Next, when I try to run the training file train_cuhk_word2_1_pool.m, I get an error as:
Reference to non-existent field 'rgbMean'.
Error in train_cuhk_word2_1_pool (line 16)
im_mean = imdb.rgbMean;

In the prepare_imdb.m, you have commented line 24 and 25 which has this variable, should these be uncommented?:
%resize_image; % resize image to 256x256, return mm.
%imdb.rgbMean = mm;

existence of pytorch code?

Is there any pytorch implementation of the same? Would really appreciate it.

CUHK results drops compared to reported numbers.

txt-image rank-1:0.384178 mAP:0.354052 Medr:3.000000
txt-image rank-5:0.610136 rank-10:0.703866

which is normally 5-6% lower than reported. I made no change to the parameters, any ideas what could possibly be the reason for the performance drop?

trained on Ubuntu 16.04 LTS Matlab 2015b (preprocessing data on another machine with a higher version for jsondecode support) + cudnn 5.0 with a Titan Xp.

How to set the learningrate when training CUHK-PEDES?

Hi,
I want to train the CUHK-PEDES using your method, in your paper "In the first training stage, we fixed the pre-trained image CNN, and train the text CNN only. The learning rate is 0.001.", https://github.com/layumi/Image-Text-Embedding/blob/master/train_cuhk_word2_1_pool.m#L34
and
https://github.com/layumi/Image-Text-Embedding/blob/master/cuhk_word2_pool.m#L269
the learning rate is 0.1 and 0.001, how should I set the learning rate?

How to avoid overfitting

Hello,
ZheDong, thanks for you sharing such a good work. I want to reproduce it in Pytorch,but I'm sorry that I encountered the overfitting problem.
To get the results quickly, I randomly choose 10,000 samples as traindata and 1,000 as valdata, 1,000 as testdata separately. Finally I got about 100% recall@5 on the training set while only half of it on the val data.
And I'm a fresh man to ImageTextEmbedding,could you share some solutions to that. I guess there are relevant reasons:

Data normalization. I don't compute the mean and var of train_data explicitly, and just divide it by 255, subtract 0.5, and thendivide it by 0.5
L2 regularization. I just use the 1e-5 regularization intensity
The complexity of classifier. After generator, I add a classifier with a softmax layer directly. Whether more fully connection layers can slow down the fitting of the training set

Finally, I want to ask how to mine the hard triplet online in Pytorch efficiently.

Thanks.

Your dataset webset cannot be accessed

Extracting text for the single image

Sorry for duplication of the issue #1, but can you, please, explain how can I extract text for the single image given as input? It is not clear for me what steps I need to do to get text description of the single image.

Also, I was wondering if I can extract text for some external image, I mean for the image that was not included in the train and val image set?

I will really appreciate any help.

Ask you a question

Recently read your thesis, there are problems in the process of running the code, ask you.
The following problem occurs when split data set in Image-Text-Embedding:
Index exceeds matrix dimensions.

Error in prepare_imdb (line 24)
Images = fullfile(imdb.images.data(train(1:end))) ;

Going beyond the border, I don’t know what’s wrong.

Problem training on MSCOCO

I am trying to train the model on MSCOCO and run into the following issues:

1- When running 'train_coco_word2_1_pool.m' as you suggest, I get the error that the function 'coco_word2_pool_no_w2v' does not exist.

2- I therefore changed it for 'coco_word2_pool' since this function is indeed in the directory (is this what you meant?). Then I get the following error:

_Error using reshape
To RESHAPE the number of elements must not change.

Error in coco_word2_pool (line 273)
net.params(first).value = reshape(single(subset.features'),1,1,29972,300);

Error in train_coco_word2_1_pool (line 17)
net = coco_word2_pool();_

3- I experience the same issue when running 'train_coco_word2_1_pool_vgg19.m'

4- The reason that I am training is that I want to reproduce your results but the 20 epochs in your pretrained model don't seem to be enough. Are these the parameters that you used to report test results?

I am running the code on a MacBook Pro, on Matlab R2018b.
Thank you in advance.

The loss on COCO dataset of training stage1 dosen't decrease

Hi, I run the train_coco_word2_1_pool.m, but after more than 10 epochs, I found the train result is still bad(as below). I didn't change any hyper parameters but I don't know why it doesn't work.

Can you tell me what make this result happen? And I found the learning rate in your code is 0.1, but it's reported as 0.001 in the paper. Which lr is correct and better in this task?