Code Monkey home page Code Monkey logo

adapt's Introduction

ADAPT: Action-aware Driving Caption Transformer

This repository is an official implementation of ADAPT: Action-aware Driving Caption Transformer, accepted by ICRA 2023.

Created by Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou and Jingjing Liu from Institute for AI Industry Research(AIR), Tsinghua University.

Introduction

We propose an end-to-end transformer-based architecture, ADAPT (Action-aware Driving cAPtion Transformer), which provides user-friendly natural language narrations and reasoning for autonomous vehicular control and action. ADAPT jointly trains both the driving caption task and the vehicular control prediction task, through a shared video representation.

This repository contains the training and testing of the proposed framework in paper, as well as the demo in smulator environment and real word.

Note

This reposity will be updated soon, including:

  • Uploading the Preprocessed Data of BDDX.
  • Uploading the Raw Data of BDDX, along with an easier processing script.
  • Uploading the Visualization Codes of raw data and results.
  • Updating the Experiment Codes to make it easier to get up with.
  • Uploading the Conda Environments of ADAPT.

Table of contents

Getting Started

1. Installation as Conda

Create conda environment:

conda create --name ADAPT python=3.8

Install torch:

pip install torch==1.13.1+cu117 torchaudio==0.13.1+cu117 torchvision==0.14.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html

Install apex:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ..
rm -rf apex

Install mpi4py:

conda install -c conda-forge mpi4py openmpi

Then install other dependency:

pip install -r requirements.txt

2. Launch Docker Container (Recommended)

We provide a Docker image to make it easy to get up. Before you run the launch_container.sh, please ensure the directory name is right in launch_container.sh and your current directory.

sh launch_container.sh

Our latest docker image jxbbb/adapt:latest is adapted from linjieli222/videocap_torch1.7:fairscale, which supports the following mixed precision training

  • Torch.amp
  • Nvidia Apex O2
  • deepspeed
  • fairscale

Models

  • We release our best performing checkpoints. You can download these models at [ Google Drive ] and place them under checkpoints directory. If the directory does not exist, you can create one.

  • We release the base video-swin models we used during training in [ Google Drive ]. If you want to use other pretrained video-swin models, you can refer to Video-Swin-Transformer.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended. Our scripts require the user to have the docker group membership so that docker commands can be run without sudo.

Then download codes and files for caption evaluation in Google Drive and put it under src/evalcap

Dataset Preparation

You can ether download the preprocessed data in this site, or just download the raw videos and car information in this site, and preprocess it with the code in src/prepro.

The resulting data structure should follow the hierarchy as below.

${REPO_DIR}
|-- checkpoints
|-- datasets  
|   |-- BDDX
|   |   |-- frame_tsv
|   |   |-- captions_BDDX.json
|   |   |-- training_32frames_caption_coco_format.json
|   |   |-- training_32frames.yaml
|   |   |-- training.caption.lineidx
|   |   |-- training.caption.lineidx.8b
|   |   |-- training.caption.linelist.tsv
|   |   |-- training.caption.tsv
|   |   |-- training.img.lineidx
|   |   |-- training.img.lineidx.8b
|   |   |-- training.img.tsv
|   |   |-- training.label.lineidx
|   |   |-- training.label.lineidx.8b
|   |   |-- training.label.tsv
|   |   |-- training.linelist.lineidx
|   |   |-- training.linelist.lineidx.8b
|   |   |-- training.linelist.tsv
|   |   |-- validation...
|   |   |-- ...
|   |   |-- validation...
|   |   |-- testing...
|   |   |-- ...
|   |   |-- testing...
|-- datasets_part
|-- docs
|-- models
|   |-- basemodel
|   |-- captioning
|   |-- video_swin_transformer
|-- scripts 
|-- src
|-- README.md 
|-- ... 
|-- ... 

Quick Demo

We provide a demo to run end-to-end inference on the test video.

Our inference code will take a video as input, and generate video caption.

sh scripts/inference.sh

The prediction should look like

Prediction: The car is stopped because the traffic light turns red.

Evaluation

We provide example scripts to evaluate pre-trained checkpoints.

# Assume in the docker container 
sh scripts/BDD_test.sh

Training

We provide example scripts to train our model in different sets.

Basic Model

# Assume in the docker container 
sh scripts/BDDX_multitask.sh

Only DCG(Driving Caption Generation) Head

# Assume in the docker container 
sh scripts/BDDX_only_caption.sh

Only CSP(Control signal prediction) Head

# Assume in the docker container 
sh scripts/BDDX_only_signal.sh

Only Predicting One Sentence (instead of both narration\description and reasoning\explanation)

# Assume in the docker container 
sh scripts/BDDX_multitask_des.sh
sh scripts/BDDX_multitask_exp.sh

Remember that this two commands require two additional testing data. The data suructure should be:

${REPO_DIR} 
|-- datasets
|   |-- BDDX
|   |-- BDDX_des
|   |-- BDDX_exp

Qualititive results

Citation

If you find our work useful in your research, please consider citing:

@article{jin2023adapt,
  title={ADAPT: Action-aware Driving Caption Transformer},
  author={Jin, Bu and Liu, Xinyu and Zheng, Yupeng and Li, Pengfei and Zhao, Hao and Zhang, Tong and Zheng, Yuhang and Zhou, Guyue and Liu, Jingjing},
  journal={arXiv preprint arXiv:2302.00673},
  year={2023}
}

Acknowledgments

Our code is built on top of open-source GitHub repositories. We thank all the authors who made their code public, which tremendously accelerates our project progress. If you find these works helpful, please consider citing them as well.

Microsoft/SwinBERT

JinkyuKimUCB/BDD-X-dataset

huggingface/transformers

Microsoft/DeepSpeed

Nvidia/Apex

FAIR/FairScale

adapt's People

Contributors

jxbbb avatar lyx997 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adapt's Issues

It seems the code has not been updated yet?

Has the code updated? I try to run the provided checkpoint, but it output like this: Prediction: ##ウ ɲウmind doctorウ bumped unlessiz negotiating includeウウウ28? intending cutler leah contacts? cutler?? stands backstroke protect leah. Anything wrong?

Experimental results

image image

Hello, you have done a good job! There are some gaps between the results I ran out and the papers. Experimental configuration: 4 RTX3090 ,bs = 2. The experimental results are obtained from the above two documents. May I ask what is probably the cause of the problem?

CARLA demo and results

Hi, nice work and quite inspiring! I noticed that you mentioned you have deployed your system in CARLA and real-world scenarios, so could you provide some demos or quantitive results? Or did I miss anything? I just wanna know the generalization performance of your system.

Thanks a lot for your sincere help.

About the hdf5 file for the car signal

Thanks for your great work!
I downloaded processed data from the provided link, but cannot find the h5 file for the car signal as the model inputs?
Specifically, in this code the loader loads the h5 file of sensor data from the root path (I think it should be BDDX/processed_video_info/), but I cannot find them in downloaded processed data. Do I make some mistakes or the h5 file cannot be released now?
Looking forward to hearing from you!

About control signal prediction

Thanks for sharing the code. Recently I read your excellent work ADAPT, and I have some questions about the control prediction part.

  1. I wonder whether you use past control signals for control signal prediction, i.e., use a_{t-1}, a_{t-2},....,a_{0} for the prediction of a_t. Or you directly predict control signals from input video images.

  2. If you directly predict controls from input videos, since you do not have time length of each video clip as input, how could you predict speed from the input video? Or say, how could you know whether the input video is 3 seconds or 10 seconds? It seems that you do not send this information into your network.

Control signal output

Thanks for sharing the code and data. May I know how to reproduce your control signal output? Since I cannot find any related information. For a fair comparison, could you please provide the raw predicted control signal or scripts for reproduction?

missing variables in demo.py

here, what do caps and infos refer to

ADAPT/demo.py

Lines 130 to 135 in 66353c2

for anno in caps:
if cnt >= int(anno['sTime'])*30 and cnt < int(anno['eTime'])*30:
cap = anno['action'] + ' ' + anno['justification']
break
accelerator, accuracy, course, curvature, goaldir, latitude, longitude, speed = infos['accelerator'], infos['accuracy'], infos['course'], infos['curvature'], infos['goaldir'], infos['latitude'], infos['longitude'], infos['speed']

  • It would be great if you can provide the Dockerfile.

control signal output

Hi, I wonder whether the model you give now has the ability to output the control signal of both speed and course? If I set signal_type course speed, it will output a dimension error(it requires [1,768] and I give [2,768]). It just has one dimension, and what is this one dimension?

About BDD-X

I would like to ask a question about the BDD-X dataset, where can I get the video or picture files of BDD-X? I found the file type you gave training_32frames_img_size256 .img.tsv file, which seems to be base64 encoding? I can't be sure because I can't decode it to open as a picture.

How to inference the motion control signal?

Could you please show a simple demo of predicting the motion control signal? I check the repo codes and paper and understand that I need to set the args.use_car_sensor = True and set the value of key car_info of the inputs.

inputs = {'is_decode': True,

But I have no idea what I should use as the value of car_info and the model will predict exactly which signals. Could you give some specific instructions? Thanks a lot.

Request Dataset Authorization

I am very interested in your work, can you provide raw video data?

I have sent you an email but have not received a reply yet, looking forward to your reply!

Thank you!

About control signal

Hello, I tried to get the output of the control signal through the model by inputting the video, but when I refer to the code of the training part, I found that the dimension of the output is not the same, is the code not updated? How do you get the control signal from the input? Thank you for your answer.

About Quick Demo

Thank you for your excellent work!

However, I want to make sure that Quick Demo does not include a control signal output, right (because the multitask parameter is not set?). So does the model.bin file in the checkpoint you provided also just train the driving caption output?

If this is the case, then is there a script provided that can output the results of the control signal prediction?

Or what modifications and settings do I need to make in order to be able to output from the input video to the driving caption along with the control signal?

[Installation as Conda] Mismatch between python version and pandas version

I created a python 3.8 environment called ADAPT based on Getting Started, but encountered an error when installing pandas
The original requirements file asked for pandas version == 2.1.3
image
However, version 2.1.3 must correspond to python 3.9.
image
Did I make a mistake somewhere?
What should I do, upgrade python to 3.9 or downgrade pandas to 2.0.3?

How to slove the datalaoder problem when training the model?

When I try to use the officile code to train the mdoel, I got the problem when laoding the training data:

Original Traceback (most recent call last):
File "/home/mayue/miniconda3/envs/ADAPT/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in worker_loop
data = fetcher.fetch(index)
File "/home/mayue/miniconda3/envs/ADAPT/lib/python3.8/site-packages/torch/utils/data/utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mayue/miniconda3/envs/ADAPT/lib/python3.8/site-packages/torch/utils/data/utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/csr/ADAPT/src/datasets/vision_language_tsv.py", line 395, in getitem
raw_frames, is_video = self.get_visual_data(img_idx, start, end)
File "/home/csr/ADAPT/src/datasets/vision_language_tsv.py", line 322, in get_visual_data
row = self.get_row_from_tsv(self.visual_tsv, idx)
File "/home/csr/ADAPT/src/datasets/vision_language_tsv.py", line 171, in get_row_from_tsv
assert row[0].split('/')[0] == self.image_keys[img_idx].split('
')[-1] or row[0].split('
')[-1] == self.image_keys[img_idx].split('
')[-1]
AssertionError

I have seen the original code and I guess you have faced with the same problem, could you tell me how to fix it, thank you very much!

Annotation Tools

Could you please provide the annotation tools you use when annotating datasets?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.