This is our implementation of Multi-level Scene Description Network in Scene Graph Generation from Objects, Phrases and Region Captions. The project is based on PyTorch version of faster R-CNN. (Update: model links have been updated. Sorry for the inconvenience.)
We have released our newly proposed scene graph generation model in our ECCV-2018 paper:
Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation.
Check the github repo Factorizable Net if you are interested.
- README for training
- README for project settings
- our trained RPN
- our trained Full Model
- Our cleansed Visual Genome Dataset
- training codes
- evaluation codes
- Model acceleration (please refer to our ECCV project).
- Multi-GPU support
We are still working on the project. If you are interested, please Follow our project.
-
Install the requirements (you can use pip or Anaconda):
conda install pip pyyaml sympy h5py cython numpy scipy conda install -c menpo opencv3 conda install -c soumith pytorch torchvision cuda80 pip install easydict
-
Clone the Faster R-CNN repository
git clone [email protected]:yikang-li/MSDN.git
-
Build the Cython modules for nms and the roi_pooling layer
cd MSDN/faster_rcnn ./make.sh cd ..
-
Download the trained full model and trained RPN, and place it to
output/trained_model
-
Download our cleansed Visual Genome dataset. And unzip it:
tar xzvf top_150_50.tgz
- p.s. Our ipython scripts for data cleansing is also released.
-
Download Visual Genome images
-
Place Images and cleansed annotations to coresponding folders:
mkdir -p data/visual_genome
cd data/visual_genome
ln -s /path/to/VG_100K_images_folder VG_100K_images
ln -s /path/to/downloaded_folder top_150_50
- p.s. You can change the default data directory by modifying
__C.IMG_DATA_DIR
infaster_rcnn/fast_rcnn/config.py
-
Training in multiple stages. (Single-GPU training may take about one week.)
- Training RPN for object proposals and caption region proposals (the shared conv layers are fixed). We also provide our pretrained RPN model.
by default, the training is done on a small part of the full dataset:
CUDA_VISIBLE_DEVICES=0 python train_rpn.py
For full Dataset Training:
CUDA_VISIBLE_DEVICES=0 python train_rpn_region.py --max_epoch=10 --step_size=2 --dataset_option=normal --model_name=RPN_full_region
--step_size
is set to indicate the number of epochs to decay the learning rate,dataset_option
is to indicate the\[ small | fat | normal \]
subset.- Training MSDN
Here, we use SGD (controled by
--optimizer
)by default:CUDA_VISIBLE_DEVICES=0 python train_hdn.py --load_RPN --saved_model_path=./output/RPN/RPN_region_full_best.h5 --dataset_option=normal --enable_clip_gradient --step_size=2 --MPS_iter=1 --caption_use_bias --caption_use_dropout --rnn_type LSTM_normal
-
Furthermore, we can directly use end-to-end training from scratch (not recommended). The result is not good.
CUDA_VISIBLE_DEVICES=0 python train_hdn.py --dataset_option=normal --enable_clip_gradient --step_size=3 --MPS_iter=1 --caption_use_bias --caption_use_dropout --max_epoch=11 --optimizer=1 --lr=0.001
Our pretrained full Model is provided for your evaluation for further implementation. (Please download the related files in advance.)
./eval.sh
Currently, the accuracy of our released version is slightly different from the reported results in the paper:Recall@50: 11.705%; Recall@100: 14.085%.
We thank longcw for his generously releasing the PyTorch Implementation of Faster R-CNN.
@inproceedings{li2017msdn,
author={Li, Yikang and Ouyang, Wanli and Zhou, Bolei and Wang, Kun and Wang, Xiaogang},
title={Scene graph generation from objects, phrases and region captions},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision},
year = {2017}
}
The pre-trained models and the MSDN technique are released for uncommercial use.
Contact Yikang LI if you have questions.