Code Monkey home page Code Monkey logo

x-dim's Introduction

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

License: MIT

Official PyTorch implementation of the paper Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching. We referred to the implementations of GPO to build up our codebase.

Motivation

Illustration of motivation. (a) For the mapped visual region and textual word features in the $d$-dimensional shared representation space, which can be represented as a dimensional semantic correspondence vector, existing paradigm typically employs a default independent aggregation for all dimensions to compose word-region semantic similarity. Yet, as we investigated in the state-of-the-art model NAAF, dimensions in that shared space are not mutually independent, where there are some dimensions with significant tendency, i.e., statistical co-occurrence probabilities, to jointly represent specific semantics, e.g., (b) for dog and (c) for man.

Aggregation comparison. Dimensional correspondences with mutual dependencies are marked with the same color, where exiting aggregation completely ignore this intrinsic information, probably leading to limitations, while our key idea is to mine and leverage it.

Introduction

In this paper, we are motivated by an insightful finding that dimensions are \emph{not mutually independent}, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone. Some results are better than those reported in the paper.

Results on MS-COCO (1K)

Visual Backbone Text Backbone R1 R5 R10 R1 R5 R10 Rsum Link
X-Dim BUTD region BERT-base 82.6 97.1 99.0 67.4 92.5 96.8 535.4 Here

Results on Flickr30K

Visual Backbone Text Backbone R1 R5 R10 R1 R5 R10 Rsum Link
X-Dim BUTD region BERT-base 83.5 96.9 98.0 67.5 89.1 93.3 528.2 Here

Preparation

Environment

We recommended the following dependencies.

Data

You can download the dataset through Baidu Cloud. Download links are Flickr30K and MSCOCO, the extraction code is: USTC.

Training

sh  train_region_f30k.sh
sh  train_region_coco.sh

Evaluation

Test on Flickr30K

python test.py

To do cross-validation on MSCOCO, pass fold5=True with a model trained using --data_name coco_precomp.

python testall.py

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@inproceedings{zhang2023unlocking,
  title={Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching},
  author={Zhang, Kun and Zhang, Lei and Hu, Bo and Zhu, Mengxiao and Mao, Zhendong},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={4828--4837},
  year={2023}
}

x-dim's People

Contributors

crossmodalgroup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

kkzhang95

x-dim's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.