Code Monkey home page Code Monkey logo

var's Introduction

Visual Abductive Reasoning

This repository is an official PyTorch implementation of paper:
Visual Abductive Reasoning.
Chen Liang, Wenguan Wang, Tianfei Zhou, Yi Yang
CVPR 2022.

News & Update Logs:

  • [2022-03-25] Repo created. Paper, code, and data will come in a few days. Stay tuned.
  • [2022-03-26] VAR dataset v1.0 released; Evaluation toolkit uploaded.
  • [2022-04-07] Full paper available at arXiv.
  • [2022-06-22] Pre-extracted features released; Available at Baidu Net Disk (code: dvar) and OneDrive.
  • [2022-06-23] Feature extraction code with brief insturctions at here; Full code released.

Abstract

Abductive reasoning seeks the likeliest possible explanation for partial observations. Although abduction is frequently employed in human daily reasoning, it is rarely explored in computer vision literature. In this paper, we propose a new task and dataset, Visual Abductive Reasoning (VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the visual premise. Based on our large-scale VAR dataset, we devise a strong baseline model, Reasoner (causal-and-cascaded reasoning Transformer). First, to capture the causal structure of the observations, a contextualized directional position embedding strategy is adopted in the encoder, that yields discriminative representations for the premise and hypothesis. Then, multiple decoders are cascaded to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that Reasoner surpasses many famous video-language models, while still being far behind human performance. This work is expected to foster future efforts in the reasoning-beyond-observation paradigm.

VAR Dataset

Data Preparation

Note: You may download videos from youtube with youtube-dl and then extract raw frames with opencv through step1-1 and step1-2, or directly download pre-extracted features through step1-3.

Step 1-1: Prepare videos.

First, you can run the following script to prepare videos. The codes are adapted from here.

bash data/tools/download_videos.sh

Note: You may fail to download some videos due to geographical restrictions or some other potential reasons. We have maintained a copy of the full dataset. Please fill out this form for having access.

Step 1-2: Prepare raw rgb frames.

Then, use following scripts to extract RGB frames.

bash data/tools/extract_video_frames.sh

Step 1-3: Prepare pre-extracted feature.

If you are not interested in end-to-end training or extracting features with customized models, a quick option is to download our pre-extracted features. Both video and vocabulary features are available at Baidu Net Disk (code: dvar) and OneDrive. Code for feature extraction is also released at here.

Step 2: Prepare VAR annotations.

For annotations, you may clone this github repository. VAR annotation files are released at data. Another option is to download annotation files at Baidu Net Disk (code: dvar) and OneDrive.

Step 3: Check directory structure.

code_root/
└── data/
    └── VAR/
        ├── data/
            ├── var_video_duration_v1.0.csv
            ├── var_train_v1.0.json
            ├── var_val_v1.0.json
        	└── var_test_v1.0.json 
        ├── videos/
        ├── rawframes/
        ├── video_feature/
        └── vocab_feature/

Annotations

The VAR dataset contains 3 subsets:

Split #examples Filename Description
train 7,053 var_train_v1.0.json Model Training
val 460 var_val_v1.0.json Hyperparameter tuning
test 1,093 var_test_v1.0.json Model Testing

Here is an annotated example from VAR test split:

"wsr_4yS8jc8":                         # example id
{
    "events": [
        {
            "video_id": "X0Tj0nItuZQ",
            "timestamp": [80, 103],    # start/end timestamp of a event
            "clip_idx": 0,
            "clip_tot": 4,             # total number of events in an example 
            "duration": 145,           # duration time of the entire video 
            "sentence": "A couple walks in from outside and then kisses each other."
        },
        ... # omit...
        {
            "video_id": "uirnMHR7IG8", # an example might contain multiple videos 
            "timestamp": [104, 130],
            "clip_idx": 3,
            "clip_tot": 4,
            "duration": 167,
            "sentence": "The wife is screaming and crying on the ground."
        }
    ],
    "hypothesis": 2,                   # index of the hypothesis
    "split": "test"                    # split of this example
}

Evaluation

We provide a toolkit for model evaluation. If you are interested in performance comparison with Reasonser, we strongly recommend you to test VAR models using our static BERTScore implementation. If not, you may skip step 1.

Step 0: Prepare model prediction

Prediction results (json files) are expected to be organized following this format:

{
    "EXAMPLE_ID": 
    [       # list of events
        {
            "sentence": str,
            "gt_sentence": str,
            "is_hypothesis": bool
        },
        ... # omit...
    ],
    ...
}

Step 1: Prepare static version of BERTScore

First, install from pypi with pip by:

pip install bert-score

Next, download Roberta model at Baidu Net Disk (code: dvar) or OneDrive, and extract tar file:

tar xzf roberta_large_619fd8c.tar.gz -C ./eval_kit

Step 2: Evaluate the models

python eval_kit/evaluate_models.py ${PREDICTION_FILE} 

Baseline: Reasoner Usage

Note: Please first prepare both the VAR Dataset and pre-extracted features following the instructions above.

Prerequisites

Step 1: Prepare conda environment.

conda create -n var python=3.7.10
conda activate var
pip install -r requirements.txt
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Step 2: [optional] Build iRPE operators implemented by CUDA.

This is used to speed up backpropagation through relative position module, which is directly forked from iRPE. Find more information at the original repository. Note: nvcc is necessary to build CUDA operators.

cd src/model/utils/rpe_ops
python setup.py install --user

Training

bash scripts/train.sh ${N_GPUS} ${MODEL_NAME} 

Note: Training with two NVIDIA GeForce 2080 Ti is recommended.

FAQ

I arrive at better results (~3-4 points higher in CIDEr) on Explanation set or slightly worse results (~1-2 points lower in CIDEr) on Premise set. Why?

This is expected behavior. I do observe that the results are varied when trained on a different machine (with two NVIDIA GeForce 3090 or one NVIDIA Tesla V100) or with a different pytorch version (1.9.1 or 1.11.0), c.f., this discussion. But fortunately, a relatively small performance perturbation is observed on one single machine with a fixed seed.

License

Code License

The implementation codes are released under the MIT license. Find details in the LICENSE file for more information.

Data License

Annotations of the VAR dataset are released under CC BY 4.0. Find details in the LICENSE file for more information.

Citation

If you find the dataset or code useful, please consider citing our paper:

@inproceedings{liang2022var,
  title      = {Visual Abductive Reasoning},
  author     = {Liang, Chen and Wang, Wenguan and Zhou, Tianfei and Yang, Yi},
  booktitle  = {IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year       = {2022}
}

Acknowledgment

Our implementation of Reasoner is partly based on the following codebases. We gratefully thank the authors for their wonderful works: mmaction, BERTScore, MART, vlep, densecap, iRPE.

Contact

This repository is currently maintained by Chen Liang.

var's People

Contributors

leonnnop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

var's Issues

Some datas are wired.

Some examples are in the same video, while there is a lot of overlap. E.g., the start time of the second event is in the time space of the first event.

image

找不到哪里测试和哪里加载训练出来的模型

你好,我现在在复现你的代码,用来做毕业设计。我已经用run.py训练出来了模型,但现在我没有找到哪里训练,以及哪里加载训练出来的模型。难道是直接在run.py里面把evaluate_mode参数改成test吗?但这样我感觉也不太对,因为也加载了训练集

Problems about results in paper

Hi! How can I reproduce the results reported in paper? I get CIDer at 34.97 for observed events and 37.68 for explanation events. I have run the codes for several times, and results are similar.

Some data in training dataset seems weird.

First, thank you for sharing these codes and models across the community. I think it is an exciting task that is worth deep researching.

However, during my glimpse of the dataset structure, I see some samples, especially in the last part of the training dataset, which looks weird. There is an example here:

{'events': [{'video_id': 'UY0nYr-dXEI',
   'timestamp': [0, 157],
   'clip_idx': 0,
   'clip_tot': 10,
   'sentence': 'After a meeting with his pyschiatrist, the doctor, the man participates in an office conga line and talks to his pets about the woman he has fallen for.',
   'duration': 157},
  {'video_id': 'pYmo3PXF_T4',
   'timestamp': [0, 155],
   'clip_idx': 1,
   'clip_tot': 10,
   'sentence': 'After he hits a deer with his car, the woman runs away from the man, causing him to lose control.',
   'duration': 155},
  {'video_id': 'H5I1DyJ3w1g',
   'timestamp': [0, 151],
   'clip_idx': 2,
   'clip_tot': 10,
   'sentence': 'the man cleans up the mess he made from killing the woman, but her severed head begins speaking to him.',
   'duration': 151},
  {'video_id': 'bJSDrRcwwKQ',
   'timestamp': [0, 149],
   'clip_idx': 3,
   'clip_tot': 10,
   'sentence': 'the man recalls his involvement in the gruesome death of his mother.',
   'duration': 149},
  {'video_id': 'xllpnvAmnHE',
   'timestamp': [0, 157],
   'clip_idx': 4,
   'clip_tot': 10,
   'sentence': "the man talks to his cat, dog, and the woman's severed head to determine if he is truly a serial killer.",
   'duration': 157},
  {'video_id': 'ml_zSw6yWOE',
   'timestamp': [0, 170],
   'clip_idx': 5,
   'clip_tot': 10,
   'sentence': 'the man tries to get the woman to calm down after she discovers that he murdered a woman.',
   'duration': 170},
  {'video_id': 'SaaTUj7m8e4',
   'timestamp': [0, 150],
   'clip_idx': 6,
   'clip_tot': 10,
   'sentence': 'After the man kills the woman, all the voices in his head offer an opinion.',
   'duration': 150},
  {'video_id': 'ZMplRnotp8M',
   'timestamp': [0, 148],
   'clip_idx': 7,
   'clip_tot': 10,
   'sentence': 'the man kidnaps the doctor and takes her to a remote field, where he forces her to give him the therapy he needs.',
   'duration': 148},
  {'video_id': 'Ax-iwIoIxjY',
   'timestamp': [0, 132],
   'clip_idx': 8,
   'clip_tot': 10,
   'sentence': 'When the police arrive at his apartment, the man escapes through the vents, causing a gas leak and explosion.',
   'duration': 132},
  {'video_id': 'Gk_2euKF9MY',
   'timestamp': [0, 160],
   'clip_idx': 9,
   'clip_tot': 10,
   'sentence': 'the man passes on and succumbs to the voices in his head, who perform a cheery song and dance number.',
   'duration': 160}],
 'hypothesis': 2,
 'split': 'train'}

As shown, all clips start with 0 and end in different timestamps. Is it expected or some mistaken samples?

Thanks!

A question about temporal overlapping between events.

Thanks for sharing the code and the data of this interesting work.

I have a question after checking some samples in var_*_v1.0.json. The target of VAR is to infer the explanation from the incomplete observations. That is to say, the video content of the "explanation" event should not be exposed to models, am I right? However, it seems there exists temporal overlapping between the "explanation" and "observation" events. For example, in the following sample, the index of the "explanation" event is 1, whose timestamp is [3.09, 61,72] while the timestamp of the next video clip is [15.43, 55.24]. Is this acceptable?

{
  "events": [
    {
      "video_id": "ehGHCYKzyZ8",
      "timestamp": [
        0,
        2.78
      ],
      "clip_idx": 0,
      "clip_tot": 6,
      "sentence": "The video starts with a title logo sequence.",
      "duration": 61.72
    },
    {
      "video_id": "ehGHCYKzyZ8",
      "timestamp": [
        3.09,
        61.72
      ],
      "clip_idx": 1,
      "clip_tot": 6,
      "sentence": "A man and woman are in a living room demonstrating exercises.",
      "duration": 61.72
    },
    {
      "video_id": "ehGHCYKzyZ8",
      "timestamp": [
        15.43,
        55.24
      ],
      "clip_idx": 2,
      "clip_tot": 6,
      "sentence": "The woman lays on the ground.",
      "duration": 61.72
    },
    {
      "video_id": "ehGHCYKzyZ8",
      "timestamp": [
        17.59,
        54
      ],
      "clip_idx": 3,
      "clip_tot": 6,
      "sentence": "The man starts pointing to different areas of the woman's body as she does an exercise.",
      "duration": 61.72
    },
    {
      "video_id": "ehGHCYKzyZ8",
      "timestamp": [
        39.81,
        54.62
      ],
      "clip_idx": 4,
      "clip_tot": 6,
      "sentence": "The woman begins to do small sit ups.",
      "duration": 61.72
    },
    {
      "video_id": "ehGHCYKzyZ8",
      "timestamp": [
        56.47,
        61.72
      ],
      "clip_idx": 5,
      "clip_tot": 6,
      "sentence": "The woman ends with a final title logo sequence.",
      "duration": 61.72
    }
  ],
  "hypothesis": 1,
  "split": "train"
}

The raw videos

I have downloaded the raw videos and concat them, while I can not obtain all the videos. Can you provide a way for zipping them .
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.