Visual Abductive Reasoning

This repository is an official PyTorch implementation of paper:
Visual Abductive Reasoning.
Chen Liang, Wenguan Wang, Tianfei Zhou, Yi Yang
CVPR 2022.

News & Update Logs:

[2022-03-25] Repo created. Paper, code, and data will come in a few days. Stay tuned.
[2022-03-26] VAR dataset v1.0 released; Evaluation toolkit uploaded.
[2022-04-07] Full paper available at arXiv.
[2022-06-22] Pre-extracted features released; Available at Baidu Net Disk (code: dvar) and OneDrive.
[2022-06-23] Feature extraction code with brief insturctions at here; Full code released.

Abstract

Abductive reasoning seeks the likeliest possible explanation for partial observations. Although abduction is frequently employed in human daily reasoning, it is rarely explored in computer vision literature. In this paper, we propose a new task and dataset, Visual Abductive Reasoning (VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the visual premise. Based on our large-scale VAR dataset, we devise a strong baseline model, Reasoner (causal-and-cascaded reasoning Transformer). First, to capture the causal structure of the observations, a contextualized directional position embedding strategy is adopted in the encoder, that yields discriminative representations for the premise and hypothesis. Then, multiple decoders are cascaded to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that Reasoner surpasses many famous video-language models, while still being far behind human performance. This work is expected to foster future efforts in the reasoning-beyond-observation paradigm.

VAR Dataset

Data Preparation

Note: You may download videos from youtube with youtube-dl and then extract raw frames with opencv through step1-1 and step1-2, or directly download pre-extracted features through step1-3.

Step 1-1: Prepare videos.

First, you can run the following script to prepare videos. The codes are adapted from here.

bash data/tools/download_videos.sh

Note: You may fail to download some videos due to geographical restrictions or some other potential reasons. We have maintained a copy of the full dataset. Please fill out this form for having access.

Step 1-2: Prepare raw rgb frames.

Then, use following scripts to extract RGB frames.

bash data/tools/extract_video_frames.sh

Step 1-3: Prepare pre-extracted feature.

If you are not interested in end-to-end training or extracting features with customized models, a quick option is to download our pre-extracted features. Both video and vocabulary features are available at Baidu Net Disk (code: dvar) and OneDrive. Code for feature extraction is also released at here.

Step 2: Prepare VAR annotations.

For annotations, you may clone this github repository. VAR annotation files are released at data. Another option is to download annotation files at Baidu Net Disk (code: dvar) and OneDrive.

Step 3: Check directory structure.

code_root/
└── data/
    └── VAR/
        ├── data/
            ├── var_video_duration_v1.0.csv
            ├── var_train_v1.0.json
            ├── var_val_v1.0.json
        	└── var_test_v1.0.json 
        ├── videos/
        ├── rawframes/
        ├── video_feature/
        └── vocab_feature/

Annotations

The VAR dataset contains 3 subsets:

Split	#examples	Filename	Description
train	7,053	var_train_v1.0.json	Model Training
val	460	var_val_v1.0.json	Hyperparameter tuning
test	1,093	var_test_v1.0.json	Model Testing

Here is an annotated example from VAR test split:

"wsr_4yS8jc8":                         # example id
{
    "events": [
        {
            "video_id": "X0Tj0nItuZQ",
            "timestamp": [80, 103],    # start/end timestamp of a event
            "clip_idx": 0,
            "clip_tot": 4,             # total number of events in an example 
            "duration": 145,           # duration time of the entire video 
            "sentence": "A couple walks in from outside and then kisses each other."
        },
        ... # omit...
        {
            "video_id": "uirnMHR7IG8", # an example might contain multiple videos 
            "timestamp": [104, 130],
            "clip_idx": 3,
            "clip_tot": 4,
            "duration": 167,
            "sentence": "The wife is screaming and crying on the ground."
        }
    ],
    "hypothesis": 2,                   # index of the hypothesis
    "split": "test"                    # split of this example
}

Evaluation

We provide a toolkit for model evaluation. If you are interested in performance comparison with Reasonser, we strongly recommend you to test VAR models using our static BERTScore implementation. If not, you may skip step 1.

Step 0: Prepare model prediction

Prediction results (json files) are expected to be organized following this format:

{
    "EXAMPLE_ID": 
    [       # list of events
        {
            "sentence": str,
            "gt_sentence": str,
            "is_hypothesis": bool
        },
        ... # omit...
    ],
    ...
}

Step 1: Prepare static version of BERTScore

First, install from pypi with pip by:

pip install bert-score

Next, download Roberta model at Baidu Net Disk (code: dvar) or OneDrive, and extract tar file:

tar xzf roberta_large_619fd8c.tar.gz -C ./eval_kit

Step 2: Evaluate the models

python eval_kit/evaluate_models.py ${PREDICTION_FILE}

Baseline: Reasoner Usage

Note: Please first prepare both the VAR Dataset and pre-extracted features following the instructions above.

Prerequisites

Step 1: Prepare conda environment.

conda create -n var python=3.7.10
conda activate var
pip install -r requirements.txt
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Step 2: [optional] Build iRPE operators implemented by CUDA.

This is used to speed up backpropagation through relative position module, which is directly forked from iRPE. Find more information at the original repository. Note: nvcc is necessary to build CUDA operators.

cd src/model/utils/rpe_ops
python setup.py install --user

Training

bash scripts/train.sh ${N_GPUS} ${MODEL_NAME}

Note: Training with two NVIDIA GeForce 2080 Ti is recommended.

FAQ

I arrive at better results (~3-4 points higher in CIDEr) on Explanation set or slightly worse results (~1-2 points lower in CIDEr) on Premise set. Why?

This is expected behavior. I do observe that the results are varied when trained on a different machine (with two NVIDIA GeForce 3090 or one NVIDIA Tesla V100) or with a different pytorch version (1.9.1 or 1.11.0), c.f., this discussion. But fortunately, a relatively small performance perturbation is observed on one single machine with a fixed seed.

License

Code License

The implementation codes are released under the MIT license. Find details in the LICENSE file for more information.

Data License

Annotations of the VAR dataset are released under CC BY 4.0. Find details in the LICENSE file for more information.

Citation

If you find the dataset or code useful, please consider citing our paper:

@inproceedings{liang2022var,
  title      = {Visual Abductive Reasoning},
  author     = {Liang, Chen and Wang, Wenguan and Zhou, Tianfei and Yang, Yi},
  booktitle  = {IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year       = {2022}
}

Acknowledgment

Our implementation of Reasoner is partly based on the following codebases. We gratefully thank the authors for their wonderful works: mmaction, BERTScore, MART, vlep, densecap, iRPE.

Contact

This repository is currently maintained by Chen Liang.

Some data in training dataset seems weird.

First, thank you for sharing these codes and models across the community. I think it is an exciting task that is worth deep researching.

However, during my glimpse of the dataset structure, I see some samples, especially in the last part of the training dataset, which looks weird. There is an example here:

{'events': [{'video_id': 'UY0nYr-dXEI',
   'timestamp': [0, 157],
   'clip_idx': 0,
   'clip_tot': 10,
   'sentence': 'After a meeting with his pyschiatrist, the doctor, the man participates in an office conga line and talks to his pets about the woman he has fallen for.',
   'duration': 157},
  {'video_id': 'pYmo3PXF_T4',
   'timestamp': [0, 155],
   'clip_idx': 1,
   'clip_tot': 10,
   'sentence': 'After he hits a deer with his car, the woman runs away from the man, causing him to lose control.',
   'duration': 155},
  {'video_id': 'H5I1DyJ3w1g',
   'timestamp': [0, 151],
   'clip_idx': 2,
   'clip_tot': 10,
   'sentence': 'the man cleans up the mess he made from killing the woman, but her severed head begins speaking to him.',
   'duration': 151},
  {'video_id': 'bJSDrRcwwKQ',
   'timestamp': [0, 149],
   'clip_idx': 3,
   'clip_tot': 10,
   'sentence': 'the man recalls his involvement in the gruesome death of his mother.',
   'duration': 149},
  {'video_id': 'xllpnvAmnHE',
   'timestamp': [0, 157],
   'clip_idx': 4,
   'clip_tot': 10,
   'sentence': "the man talks to his cat, dog, and the woman's severed head to determine if he is truly a serial killer.",
   'duration': 157},
  {'video_id': 'ml_zSw6yWOE',
   'timestamp': [0, 170],
   'clip_idx': 5,
   'clip_tot': 10,
   'sentence': 'the man tries to get the woman to calm down after she discovers that he murdered a woman.',
   'duration': 170},
  {'video_id': 'SaaTUj7m8e4',
   'timestamp': [0, 150],
   'clip_idx': 6,
   'clip_tot': 10,
   'sentence': 'After the man kills the woman, all the voices in his head offer an opinion.',
   'duration': 150},
  {'video_id': 'ZMplRnotp8M',
   'timestamp': [0, 148],
   'clip_idx': 7,
   'clip_tot': 10,
   'sentence': 'the man kidnaps the doctor and takes her to a remote field, where he forces her to give him the therapy he needs.',
   'duration': 148},
  {'video_id': 'Ax-iwIoIxjY',
   'timestamp': [0, 132],
   'clip_idx': 8,
   'clip_tot': 10,
   'sentence': 'When the police arrive at his apartment, the man escapes through the vents, causing a gas leak and explosion.',
   'duration': 132},
  {'video_id': 'Gk_2euKF9MY',
   'timestamp': [0, 160],
   'clip_idx': 9,
   'clip_tot': 10,
   'sentence': 'the man passes on and succumbs to the voices in his head, who perform a cheery song and dance number.',
   'duration': 160}],
 'hypothesis': 2,
 'split': 'train'}

As shown, all clips start with 0 and end in different timestamps. Is it expected or some mistaken samples?

Thanks!

leonnnop / var Goto Github PK

var's Introduction