Code Monkey home page Code Monkey logo

vln-gela's Introduction

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

This repository is the official implementation of Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation (ICCV 2023 Oral).

Cross-modal alignment is one key challenge for Visionand-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pretraining objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.

framework

Requirements

  1. Install Matterport3D simulators: follow instructions here.
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
  1. Install requirements:
conda create --name VLN-GELA python=3.8.5
conda activate VLN-GELA
pip install -r requirements.txt
  1. Download datasets from Baidu Netdisk, including processed annotations, features and pre-trained models of R2R and CVDN datasets. Put the data in datasets directory.

  2. Download the GEL-R2R dataset from Baidu Netdisk. Put the data in datasets/R2R/annotations/GELR2R directory.

Adaptive Pre-training

Grounded entity-landmark adaptive pre-training:

bash ada_pretrain_src/pretrain_r2r.sh

Fine-tuning & Evaluation

cd finetune_src
bash scripts/run_r2r.sh # (run_cvdn.sh)

Citation

@InProceedings{Cui_2023_ICCV,
    author    = {Cui, Yibo and Xie, Liang and Zhang, Yakun and Zhang, Meishan and Yan, Ye and Yin, Erwei},
    title     = {Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {12043-12053}
}

Acknowledgments

Our code is based on VLN-HAMT, EnvEdit and MDETR. Thanks for their great works!

vln-gela's People

Contributors

csir1996 avatar

Stargazers

 avatar Joochan Joseph Kim avatar  avatar  avatar Ashutosh Pandey avatar Zhide Zhong avatar  avatar Rui Chen avatar Wu Chen avatar  avatar Lucas Lyu avatar Sijin Chen avatar Shuo Feng avatar Tree avatar Dong An avatar Zun Wang avatar

Watchers

Tree avatar  avatar

vln-gela's Issues

Descriptions of the GELR2R datasets

Hi Yibo,

Thanks for your great work!
Could I get some clarification on the format of the GELR2R annotations for the landmarks?
It seems that the "landmark_bbox_coords" appears unconventional, either not pixel-based or normalized to [0,1], with x-values ranging from 0 to 2 and y-values from 0 to 1, which makes me confused. Also, the paper says the format of the bounding box is [x,y,w,h], but I found it more like [x1,y1, x2, y2] ?
Another problem is how can I get the landmark images. The landmark bounding boxes are based on the panorama of the Matterport3D datasets. However, in most VLN works, they use 36 discrete views to represent that panorama. Is it also discrete views in the GELR2R annotations or some tools are used to generate the panorama for the bounding boxes?

Cheers

Bug Report: RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Firstly, thanks for your innovate and excellent work! I got an error when I try to reproduce the results of the paper (in the pretraining stage).
Would you like to help me please? Of course, I'll try to figure out and fix it.

Environments:

  • OS: Windows Subsystem for Linux 2 (Ubuntu 22.04)
  • CPU: Intel i7-13700KF
  • GPU: NVIDIA RTX 3060 12GB
  • Python: 3.8.18
  • PyTorch: 2.1.1_py3.8_cuda12.1_cudnn8.9.2_0
  • NumPy: 1.24.4
  • CUDA: 12.1
  • cuDNN: 8.9.2
12/11/2023 16:50:43 - WARNING - __main__ -   Output directory (datasets/R2R/exprs/pretrain/test) already exists and is not empty.
12/11/2023 16:50:43 - INFO - __main__ -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
  0%|                                                                                                                                                       | 0/200000 [00:00<?, ?it/s]Some weights of MultiStepNavCMTPreTraining were not initialized from the model checkpoint at None and are newly initialized: ['bbox_head.net.2.weight', 'span_head.net.0.bias', 'span_head.net.4.bias', 'con_projection_image.bias', 'land_att.linear_out.weight', 'bbox_head.net.4.bias', 'span_head.net.4.weight', 'con_projection_text.bias', 'span_att.linear_in.weight', 'bbox_head.net.2.bias', 'bbox_head.net.0.bias', 'span_head.net.0.weight', 'span_att.linear_out.weight', 'con_projection_text.weight', 'con_projection_image.weight', 'land_att.linear_in.weight', 'span_head.net.2.weight', 'bbox_head.net.4.weight', 'bbox_head.net.0.weight', 'span_head.net.2.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
data_num: 4675 14039 121819 58065
data_num: 182945 1083659 121819 58065
data_num: 340 1021 9017 4304
data_num: 775 2325 20364 9517
12/11/2023 16:52:47 - INFO - __main__ -   mlm: 1083659 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sap: 6565468 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sar: 6565468 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sprel: 6565468 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   mrc: 1083659 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   itm: 1083659 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   gel: 121819 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   mlm: 1021 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sap: 6201 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sar: 6201 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sprel: 6201 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   mrc: 1021 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   itm: 1021 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   gel: 9017 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   mlm: 2325 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sap: 13875 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sar: 13875 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   sprel: 13875 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   mrc: 2325 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   itm: 2325 samples loaded
12/11/2023 16:52:47 - INFO - __main__ -   gel: 20364 samples loaded
/home/zerowing/VLN-GELA/ada_pretrain_src/data/r2r_tasks.py:595: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /opt/conda/conda-bld/pytorch_1699449229234/work/torch/csrc/utils/tensor_new.cpp:261.)
  batch['sp_targets'] = torch.FloatTensor(batch['sp_targets'])
12/11/2023 16:52:50 - INFO - __main__ -   ***** Running training with 1 GPUs *****
12/11/2023 16:52:50 - INFO - __main__ -     Batch size = 64
12/11/2023 16:52:50 - INFO - __main__ -     Accumulate steps = 1
12/11/2023 16:52:50 - INFO - __main__ -     Num steps = 200000
Traceback (most recent call last):
  File "/home/zerowing/miniforge3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zerowing/miniforge3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zerowing/.vscode-server/extensions/ms-python.python-2023.22.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/zerowing/.vscode-server/extensions/ms-python.python-2023.22.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/zerowing/.vscode-server/extensions/ms-python.python-2023.22.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/zerowing/.vscode-server/extensions/ms-python.python-2023.22.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/zerowing/.vscode-server/extensions/ms-python.python-2023.22.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/zerowing/.vscode-server/extensions/ms-python.python-2023.22.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/zerowing/VLN-GELA/ada_pretrain_src/main_r2r.py", line 578, in <module>
    main(args)
  File "/home/zerowing/VLN-GELA/ada_pretrain_src/main_r2r.py", line 249, in main
    loss = model(batch, task=task, compute_loss=True)
  File "/home/zerowing/miniforge3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zerowing/miniforge3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zerowing/VLN-GELA/ada_pretrain_src/model/pretrain_cmt.py", line 210, in forward
    return self.forward_mrc(batch['txt_ids'], batch['txt_masks'],
  File "/home/zerowing/VLN-GELA/ada_pretrain_src/model/pretrain_cmt.py", line 323, in forward_mrc
    hist_mrc_targets = self._compute_masked_hidden(hist_img_probs, hist_mrc_masks)
  File "/home/zerowing/VLN-GELA/ada_pretrain_src/model/pretrain_cmt.py", line 251, in _compute_masked_hidden
    hidden_masked = hidden[mask].contiguous().view(-1, hidden.size(-1))
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Also, a FileNotFound error is found but I already fix it (maybe).
It appears at:

File "/home/zerowing/VLN-GELA/ada_pretrain_src/data/r2r_data.py", line 349, in get_image_feature
    with h5py.File(self.img_ft_file, 'r') as f:

Because ''datasets/R2R/features/pth_vit_base_patch16_224_imagenet_e2e.hdf5" doesn't exist, my solution is to rename the file called "/datasets/R2R/features/pth_vit_base_patch16_224_imagenet_r2r.e2e.ft.22k.hdf5" to make sure the program could find the file.
Or just modify the 42nd line ""img_ft_file": "datasets/R2R/features/pth_vit_base_patch16_224_imagenet_e2e.hdf5"," in "ada_pretrain_src/config/pretrain_r2r.json" from "pth_vit_base_patch16_224_imagenet_e2e.hdf5" to "pth_vit_base_patch16_224_imagenet_r2r.e2e.ft.22k.hdf5".

The schedule for code open sourcing?

I greatly admire your impactful and innovative work, and I would like to kindly inquire about the expected timeline for code open sourcing. It has been a while since the notification of ICCV 2023.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.