Authors: Yi Zhang, Fang-Yi Chao, Ge-Peng Ji, Deng-Ping Fan, Lu Zhang, Ling Shao.
Figure 1: Annotation examples from the proposed ASOD60K dataset. (a) Illustration of head movement (HM). The subjects wear Head-Mounted Displays (HMDs) and observe 360° scenes by moving their head to control a field-of-view (FoV) in the range of 360°×180°. (b) Each subject (i.e., Subject 1 to Subject N) watches the video without restriction. (c) The HMD-embedded eye tracker records their eye fixations. (d) According to the fixations, we provide coarse-to-fine annotations for each FoV including (e) super/sub-classes, instance-level masks and attributes (e.g., GD-Geometrical Distortion).
Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD), which mimics human attention mechanism by segmenting salient objects with the guidance of audio-visual cues. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modeling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos.
🏃 🏃 🏃 KEEP UPDATING.
Figure 2: Summary of widely used salient object detection (SOD) datasets and the proposed panoramic video SOD (PV-SOD) dataset. #Img: The number of images/frames. #GT: The number of ground-truth masks. Pub. = Publication. Obj.-Level = Object-Level. Ins.-Level = Instance-Level. Fix.GT = Fixation-guided ground truths. † denotes equirectangular (ER) images.
Figure 3: Examples of challenging attributes on equirectangular (ER) images from our ASOD60K, with instance-level GT and fixations as annotation guidance. f(k,l,m) denote random frames of a given video.
Figure 4: More annotations. Passed and rejected examples of annotation quality control.
Figure 5: Attributes description and stastistics. (a)/(b) represent the correlation and frequency of ASOD60K’s attributes, respectively.
Figure 6: Statistics of the proposed ASOD60K. (a) Super-/sub-category information. (b) Instance density of each sub-class. (c) Main components of ASOD60K scenes.
Figure 7: Performance comparison of 7/3 state-of-the-art conventional I-SOD/V-SOD methods and one PI-SOD method over ASOD60K. ↑/↓ denotes a larger/smaller value is better. Best result of each column is bolded.
Figure 8: Performance comparison of 7/3/1 state-of-the-art I-SOD/V-SOD/PI-SOD methods based on each of the attributes.
No. | Year | Pub. | Title | Links |
---|---|---|---|---|
01 | 2019 | IEEE CVPR | Cascaded Partial Decoder for Fast and Accurate Salient Object Detection | Paper/Project |
02 | 2019 | IEEE ICCV | Stacked Cross Refinement Network for Edge-Aware Salient Object Detection | Paper/Project |
03 | 2020 | AAAI | F3Net: Fusion, Feedback and Focus for Salient Object Detection | Paper/Project |
04 | 2020 | IEEE CVPR | Multi-scale Interactive Network for Salient Object Detection | Paper/Project |
05 | 2020 | IEEE CVPR | Label Decoupling Framework for Salient Object Detection | Paper/Project |
06 | 2020 | ECCV | Highly Efficient Salient Object Detection with 100K Parameters | Paper/Project |
07 | 2020 | ECCV | Suppress and Balance: A Simple Gated Network for Salient Object Detection | Paper/Project |
08 | 2019 | IEEE CVPR | See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks | Paper/Project |
09 | 2019 | IEEE ICCV | Semi-Supervised Video Salient Object Detection Using Pseudo-Labels | Paper/Project |
10 | 2020 | AAAI | Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection | Paper/Project |
11 | 2020 | IEEE SPL | FANet: Features Adaptation Network for 360° Omnidirectional Salient Object Detection | Paper/Project |
All the quantitative results were computed based on one-key Python toolbox: https://github.com/zzhanghub/eval-co-sod .
The whole object-/instance-level ground truth with default split can be downloaded from Baidu Dirve(k3h8) or Google Drive.
The videos with default split can be downloaded from Google Drive or OneDrive.
The head movement and eye fixation data can be downloaded from Google Drive
To generate video frames, please refer to video_to_frames.py.
To get access to raw videos on YouTube, please refer to video_seq_link.
To check basic information regarding the raw videos, please refer to video_information.txt (keep updating).
Please feel free to drop an e-mail to [email protected] for questions or further discussion.
If you have any question on head movement and eye fixation data, please contact [email protected]
@article{zhang2021asod60k,
title={ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos},
author={Zhang, Yi and Chao, Fang-Yi and Ji, Ge-Peng and Fan, Deng-Ping and Zhang, Lu and Shao, Ling},
journal={arXiv preprint arXiv:2107.11629},
year={2021}
}