CVPR 2021整理:



🌟 ICCV 2021持续更新最新论文/paper和相应的开源代码/code!

🚗 ICCV 2021 收录列表

🚂ICCV 2021 报告和demo视频汇总

🚗 官网链接:

⏲️ 时间 ⌚ 论文/paper接收公布时间:2021年7月23日

✋ ​注:欢迎各位大佬提交issue,分享ICCV 2021论文/paper和开源项目!共同完善这个项目

✈️ 为了方便下载,已将论文/paper存储在文件夹中 ✔️ 表示论文/paper已下载 / Paper Download

🔨 目录 |Table of Contents(点击直接跳转)


✔️Conformer: Local Features Coupling Global Representations for Visual Recognition

Contextual Convolutional Neural Networks

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Reg-IBP: Efficient and Scalable Neural Network Robustness Training via Interval Bound Propagation

Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?



✔️FineAction: A Fined Video Dataset for Temporal Action Localization

KoDF: A Large-scale Korean DeepFake Detection Dataset

LLVIP: A Visible-infrared Paired Dataset for Low-light Vision

Meta Self-Learning for Multi-Source Domain Adaptation: A Benchmark

✔️MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions

Semantically Coherent Out-of-Distribution Detection

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach

Who's Waldo? Linking People Across Text and Images (Oral)



Asymmetric Loss For Multi-Label Classification

Bias Loss for Mobile Neural Networks

Focal Frequency Loss for Image Reconstruction and Synthesis

Orthogonal Projection Loss

Rank & Sort Loss for Object Detection and Instance Segmentation (Oral)



BN-NAS: Neural Architecture Search with Batch Normalization

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

CONet: Channel Optimization for Convolutional Neural Networks

FOX-NAS: Fast, On-device and Explainable Neural Architecture Search

Pi-NAS: Improving Neural Architecture Search by Reducing Supernet Training Consistency Shift

RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Single-DARTS: Towards Stable Architecture Search


Image Classification

Tune It or Don't Use It: Benchmarking Data-Efficient Image Classification


Vision Transformer

AutoFormer: Searching Transformers for Visual Recognition

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

Conditional DETR for Fast Training Convergence

Fast Convergence of DETR with Spatially Modulated Co-Attention

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers (Oral)

GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer

HiFT: Hierarchical Feature Transformer for Aerial Tracking

High-Fidelity Pluralistic Image Completion with Transformers

Improving 3D Object Detection with Channel-wise Transformer

Is it Time to Replace CNNs with Transformers for Medical Images?

Learning Spatio-Temporal Transformer for Visual Tracking

MUSIQ: Multi-scale Image Quality Transformer

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction (Oral)

PlaneTR: Structure-Guided Transformers for 3D Plane Recovery

PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers (Oral)

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Rethinking and Improving Relative Position Encoding for Vision Transformer

Rethinking Spatial Dimensions of Vision Transformers

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer

SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

SOTR: Segmenting Objects with Transformers

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers

The Right to Talk: An Audio-Visual Transformer Approach

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

TransFER: Learning Relation-aware Facial Expression Representations with Transformers

TransPose: Keypoint Localization via Transformer

✔️Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

✔️Visual Transformer with Statistical Test for COVID-19 Classification

Vision Transformer with Progressive Sampling

Visual Saliency Transformer

Vision-Language Transformer and Query Generation for Referring Segmentation


目标检测/Object Detection

Active Learning for Deep Object Detection via Probabilistic Modeling

Boosting Weakly Supervised Object Detection via Learning Bounding Box Adjusters

Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery

Conditional Variational Capsule Network for Open Set Recognition

DetCo: Unsupervised Contrastive Learning for Object Detection

DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection

Deployment of Deep Neural Networks for Object Detection on Edge AI Devices with Runtime Optimization

Detecting Invisible People

FMODetect: Robust Detection and Trajectory Estimation of Fast Moving Objects

GraphFPN: Graph Feature Pyramid Network for Object Detection

MDETR : Modulated Detection for End-to-End Multi-Modal Understanding

Oriented R-CNN for Object Detection

Rank & Sort Loss for Object Detection and Instance Segmentation (Oral)

Reconcile Prediction Consistency for Balanced Object Detection

Vector-Decomposed Disentanglement for Domain-Invariant Object Detection


Salient Object Detections

Disentangled High Quality Salient Object Detection

Specificity-preserving RGB-D Saliency Detection


3D目标检测 / 3D Object Detection

Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather

LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector

Improving 3D Object Detection with Channel-wise Transformer

Is Pseudo-Lidar needed for Monocular 3D Object detection?

ODAM: Object Detection, Association, and Mapping using Posed RGB Video (Oral)

RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection

Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency


目标跟踪 / Object Tracking

DepthTrack : Unveiling the Power of RGBD Tracking

Exploring Simple 3D Multi-Object Tracking for Autonomous Driving

Is First Person Vision Challenging for Object Tracking?

Learning to Track Objects from Unlabeled Videos

Learn to Match: Automatic Matching Network Design for Visual Tracking

Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths

Saliency-Associated Object Tracking

Video Annotation for Visual Tracking via Selection and Refinement


Image Semantic Segmentation

Complementary Patch for Weakly Supervised Semantic Segmentation

Calibrated Adversarial Refinement for Stochastic Semantic Segmentation

Deep Metric Learning for Open World Semantic Segmentation

Dual Path Learning for Domain Adaptation of Semantic Segmentation

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation

Exploring Cross-Image Pixel Contrast for Semantic Segmentation (Oral)

Enhanced Boundary Learning for Glass-like Object Segmentation

From Contexts to Locality: Ultra-high Resolution Ie Segmentation via Locality-aware Contextual Correlation

ISNet: Integrate Image-Level and Semantic-Level Context for Semantic Segmentation

Generalize then Adapt: Source-Free Domain Adaptive Semantic Segmentation

Labels4Free: Unsupervised Segmentation using StyleGAN

LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Learning Meta-class Memory for Few-Shot Semantic Segmentation

Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation

Mining Contextual Information Beyond Image for Semantic Segmentation

Mining Latent Classes for Few-shot Segmentation(Oral)

Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation

Multi-Anchor Active Domain Adaptation for Semantic Segmentation (Oral)

Personalized Image Semantic Segmentation

Pixel Contrastive-Consistent Semi-Supervised Semantic Segmentation

Pseudo-mask Matters inWeakly-supervised Semantic Segmentation

RECALL: Replay-based Continual Learning in Semantic Segmentation

Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation(Oral)

Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models

Self-Regulation for Semantic Segmentation

Semantic Concentration for Domain Adaptation

ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer

SOTR: Segmenting Objects with Transformers

Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation

The Marine Debris Dataset for Forward-Looking Sonar Semantic Segmentation

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals

Weakly Supervised Temporal Anomaly Segmentation with Dynamic Time Warping


Semantic Scene Segmentation

BiMaL: Bijective Maximum Likelihood Approach to Domain Adaptation in Semantic Scene Segmentation


3D Semantic Segmentation

VMNet: Voxel-Mesh Network for Geodesic-aware 3D Semantic Segmentation


3D Instance Segmentation

Hierarchical Aggregation for 3D Instance Segmentation

Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks


实例分割/Instance Segmentation

CDNet: Centripetal Direction Network for Nuclear Instance Segmentation

✔️Crossover Learning for Fast Online Video Instance Segmentation

✔️Instances as Queries

Rank & Sort Loss for Object Detection and Instance Segmentation (Oral)


视频分割 / video semantic segmentation

Domain Adaptive Video Segmentation via Temporal Consistency Regularization

Full-Duplex Strategy for Video Object Segmentation

Hierarchical Memory Matching Network for Video Object Segmentation

Joint Inductive and Transductive Learning for Video Object Segmentation


Medical Image Segmentation

Recurrent Mask Refinement for Few-Shot Medical Image Segmentation


Medical Image Analysis

Studying the Effects of Self-Attention for Medical Image Analysis



3DStyleNet: Creating 3D Shapes with Geometric and Texture Style Variations (Oral)

AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer

Click to Move: Controlling Video Generation with Sparse Motion

Disentangled Lifespan Face Synthesis

Dual Projection Generative Adversarial Networks for Conditional Image Generation

EigenGAN: Layer-Wise Eigen-Learning for GANs

GAN Inversion for Out-of-Range Images with Geometric Transformations

Generative Models for Multi-Illumination Color Constancy

Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs

InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images

Learning to Diversify for Single Domain Generalization

Manifold Matching via Deep Metric Learning for Generative Modeling

Meta Gradient Adversarial Attack

Online Multi-Granularity Distillation for GAN Compression

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation

PixelSynth: Generating a 3D-Consistent Experience from a Single Image

SemIE: Semantically-Aware Image Extrapolation

SketchLattice: Latticed Representation for Sketch Manipulation

Sketch Your Own GAN

Target Adaptive Context Aggregation for Video Scene Graph Generation

Toward Spatially Unbiased Generative Models

Towards Vivid and Diverse Image Colorization with Generative Color Prior

Unconditional Scene Graph Generation

Unsupervised Geodesic-preserved Generative Adversarial Networks for Unconstrained 3D Pose Transfer


Style Transfer

Domain-Aware Universal Style Transfer


细粒度分类/Fine-Grained Visual Categorization

Benchmark Platform for Ultra-Fine-Grained Visual Categorization BeyondHuman Performance

Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach


Multi-Label Recognition

Residual Attention: A Simple but Effective Method for Multi-Label Recognition


Long-Tailed Recognition

ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot Oral


Geometric deep learning

Manifold Matching via Deep Metric Learning for Generative Modeling

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation


Zero/Few Shot

Binocular Mutual Learning for Improving Few-shot Classification

Boosting the Generalization Capability in Cross-Domain Few-shot Learning via Noise-enhanced Supervised Autoencoder

Discriminative Region-based Multi-Label Zero-Shot Learning

Domain Generalization via Gradient Surgery

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation

Few-Shot Batch Incremental Road Object Detection via Detector Fusion

Field-Guide-Inspired Zero-Shot Learning

Few-shot Visual Relationship Co-localization

Generalized Source-free Domain Adaptation

Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting

Relational Embedding for Few-Shot Classification

SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation

Transductive Few-Shot Classification on the Oblique Manifold

Visual Domain Adaptation for Monocular Depth Estimation on Resource-Constrained Hardware



Adversarial Robustness for Unsupervised Domain Adaptation

Collaborative Unsupervised Visual Representation Learning from Decentralized Data

Instance Similarity Learning for Unsupervised Feature Representation

Skeleton Cloud Colorization for Unsupervised 3D Action Representation Learning

Unsupervised Dense Deformation Embedding Network for Template-Free Shape Correspondence

Tune it the Right Way: Unsupervised Validation of Domain Adaptation via Soft Neighborhood Density



Digging into Uncertainty in Self-supervised Multi-view Stereo

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Focus on the Positives: Self-Supervised Learning for Biodiversity Monitoring

Improving Self-supervised Learning with Hardness-aware Dynamic Curriculum Learning: An Application to Digital Pathology

Meta Self-Learning for Multi-Source Domain Adaptation: A Benchmark

Reducing Label Effort: Self-Supervised meets Active Learning

Self-supervised Neural Networks for Spectral Snapshot Compressive Imaging

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Self-Supervised Video Representation Learning with Meta-Contrastive Network

SSH: A Self-Supervised Framework for Image Harmonization


Semi Supervised

Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for Open-Set Semi-Supervised Learning


Weakly Supervised

A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation

Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization


Active Learning

Influence Selection for Active Learning

Action Recognition

Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Elaborative Rehearsal for Zero-shot Action Recognition

✔️FineAction: A Fined Video Dataset for Temporal Action Localization

✔️MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions

Spatio-Temporal Dynamic Inference Network for Group Activity Recognition

Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition


时序行为检测 / Temporal Action Localization

Enriching Local and Global Contexts for Temporal Action Localization

Boundary-sensitive Pre-training for Temporal Localization in Videos


手语识别/Sign Language Recognition

Visual Alignment Constraint for Continuous Sign Language Recognition


Hand Pose Estimation

HandFoldingNet: A 3D Hand Pose Estimation Network Using Multiscale-Feature Guided Folding of a 2D Hand Skeleton


Pose Estimation

2D Pose Estimation

Hand-Object Contact Consistency Reasoning for Human Grasps Generation

Human Pose Regression with Residual Log-likelihood Estimation Oral

Online Knowledge Distillation for Efficient Pose Estimation

TransPose: Keypoint Localization via Transformer

3D Pose Estimation

EventHPE: Event-based 3D Human Pose and Shape Estimation

DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders(Oral)

FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration

Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation


PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop


6D Object Pose Estimation

RePOSE: Real-Time Iterative Rendering and Refinement for 6D Object Pose Estimation

SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation


Human Reconstruction

ARCH++: Animation-Ready Clothed Human Reconstruction Revisited

imGHUM: Implicit Generative Models of 3D Human Shape and Articulated Pose

Learning Motion Priors for 4D Human Body Capture in 3D Scenes (Oral)

Probabilistic Modeling for Human Mesh Recovery


3D Scene Understanding

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (Oral)


Face Recognition

Masked Face Recognition Challenge: The InsightFace Track Report

Masked Face Recognition Challenge: The WebFace260M Track Report

PASS: Protected Attribute Suppression System for Mitigating Bias in Face Recognition

SynFace: Face Recognition with Synthetic Data

Unravelling the Effect of Image Distortions for Biased Prediction of Pre-trained Face Recognition Models


Face Reconstruction

Towards High Fidelity Monocular Face Reconstruction with Rich Reflectance using Self-supervised Learning and Ray Tracing


Facial Expression Recognition

TransFER: Learning Relation-aware Facial Expression Representations with Transformers

Understanding and Mitigating Annotation Bias in Facial Expression Recognition



ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

IDM: An Intermediate Domain Module for Domain Adaptive Person Re-ID Oral

Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences

Learning Instance-level Spatial-Temporal Patterns for Person Re-identification

Learning Compatible Embeddings

Multi-Expert Adversarial Attack Detection in Person Re-identification Using Context Inconsistency

Towards Discriminative Representation Learning for Unsupervised Person Re-identification

TransReID: Transformer-based Object Re-Identification

Video-based Person Re-identification with Spatial and Temporal Memory Networks


Pedestrian Detection

MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?


人群计数 /Crowd Counting

Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework (Oral)

Uniformity in Heterogeneity:Diving Deep into Count Interval Partition for Crowd Counting

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting


Motion Forecasting

Generating Smooth Pose Sequences for Diverse Human Motion Prediction

MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction

RAIN: Reinforced Hybrid Attention Inference Network for Motion Forecasting


Pedestrian Trajectory Prediction

DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets

MG-GAN: A Multi-Generator Model Preventing Out-of-Distribution Samples in Pedestrian Trajectory Prediction




3D High-Fidelity Mask Face Presentation Attack Detection Challenge

Exploring Temporal Coherence for More General Video Face Forgery Detection




对抗攻击/ Adversarial Attacks

A Hierarchical Assessment of Adversarial Severity

AdvDrop: Adversarial Attack to DNNs by Dropping Information

AGKD-BML: Defense Against Adversarial Attack by Attention Guided Knowledge Distillation and Bi-directional Metric Learning

Optical Adversarial Attack

Sample Efficient Detection and Classification of Adversarial Attacks via Self-Supervised Embeddings

TkML-AP: Adversarial Attacks to Top-k Multi-Label Learning


跨模态检索/Cross-Modal Retrieval

Wasserstein Coupled Graph Learning for Cross-Modal Retrieval

深度估计 / Depth Estimation

AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network

Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation (oral)

Motion Basis Learning for Unsupervised Deep Homography Estimationwith Subspace Projection

Regularizing Nighttime Weirdness: Efficient Self-supervised Monocular Depth Estimation in the Dark

Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers

Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting (Oral)

StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation


视频插帧/Video Frame Interpolation

Asymmetric Bilateral Motion Estimation for Video Frame Interpolation

✔️XVFI: eXtreme Video Frame Interpolation(Oral)


Video Reasoning

The Multi-Modal Video Reasoning and Analyzing Competition



GNeRF: GAN-based Neural Radiance Field without Posed Camera

In-Place Scene Labelling and Understanding with Implicit Scene Representation (Oral)

KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo (Oral)

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis

Self-Calibrating Neural Radiance Fields

UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction (Oral)


Shadow Removal

CANet: A Context-Aware Network for Shadow Removal


Image Retrieval

DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models



Designing a Practical Degradation Model for Deep Blind Image Super-Resolution

Dual-Camera Super-Resolution with Aligned Attention Modules

Generalized Real-World Super-Resolution through Adversarial Robustness

Learning for Scale-Arbitrary Super-Resolution from Scale-Specific Networks

Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation


Image Reconstruction

Equivariant Imaging: Learning Beyond the Range Space (Oral)

Spatially-Adaptive Image Restoration using Distortion-Guided Networks


Image Deblurring

Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions


Image Denoising

Deep Reparametrization of Multi-Frame Super-Resolution and Denoising (Oral)

**ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models **Oral

Rethinking Deep Image Prior for Denoising


Image Desnowing

ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss


Image Enhancement

Gap-closing Matters: Perceptual Quality Assessment and Optimization of Low-Light Image Enhancement

Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables


Image Matching

Effect of Parameter Optimization on Classical and Learning-based Image Matching Methods


Image Quality

MUSIQ: Multi-scale Image Quality Transformer


Image Compression

Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform


Image Inpainting

Image Inpainting via Conditional Texture and Structure Dual Generation


Video Inpainting

Internal Video Inpainting by Implicit Long-range Propagation

Occlusion-Aware Video Object Inpainting


Video Recognition

Searching for Two-Stream Models in Multivariate Space for Video Recognition



Multi-scale Matching Networks for Semantic Correspondence


人机交互/Hand-object Interaction

✔️CPF: Learning a Contact Potential Field to Model the Hand-object Interaction

Exploiting Scene Graphs for Human-Object Interaction Detection

Spatially Conditioned Graphs for Detecting Human–Object Interactions


视线估计/Gaze Estimation

Generalizing Gaze Estimation with Outlier-guided Collaborative Adaptation



Improving Contrastive Learning by Visualizing Feature Transformation

Social NCE: Contrastive Learning of Socially-aware Motion Representations

Parametric Contrastive Learning


Graph Convolution Networks

MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction



Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks



Distance-aware Quantization

Dynamic Network Quantization for Efficient Video Inference

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation


Knowledge Distillation

Distilling Holistic Knowledge with Graph Neural Networks

Lipschitz Continuity Guided Knowledge Distillation

G-DetKD: Towards General Distillation Framework for Object Detectors via Contrastive and Semantic-guided Feature Imitation


点云/Point Cloud

A Robust Loss for Point Cloud Registration

A Technical Survey and Evaluation of Traditional Point Cloud Clustering Methods for LiDAR Panoptic Segmentation

(Just) A Spoonful of Refinements Helps the Registration Error Go Down Oral

ABD-Net: Attention Based Decomposition Network for 3D Point Cloud Decomposition

AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds

Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds

CPFN: Cascaded Primitive Fitting Networks for High-Resolution Point Clouds

DRINet: A Dual-Representation Iterative Learning Network for Point Cloud Segmentation

Learning Inner-Group Relations on Point Clouds

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

ME-PCN: Point Completion Conditioned on Mask Emptiness

MVP Benchmark: Multi-View Partial Point Clouds for Completion and Registration

Out-of-Core Surface Reconstruction via Global TGV Minimization

PICCOLO: Point Cloud-Centric Omnidirectional Localization

PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers (Oral)

ReDAL: Region-based and Diversity-aware Active Learning for Point Cloud Semantic Segmentation

SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer

Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds

Towards Efficient Point Cloud Graph Neural Networks Through Architectural Simplification

Unsupervised Learning of Fine Structure Generation for 3D Point Clouds by 2D Projection Matching

Unsupervised Point Cloud Pre-Training via View-Point Occlusion, Completion

Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds of Large Scenes with Learned Virtual View Visibility

Voxel-based Network for Shape Completion by Leveraging Edge Generation

Walk in the Cloud: Learning Curves for Point Clouds Shape Analysis


3D reconstruction

3D Shapes Local Geometry Codes Learning with SDF

3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces

DensePose 3D: Lifting Canonical Surface Maps of Articulated Objects to the Third Dimension

Learning Anchored Unsigned Distance Functions with Gradient Direction Alignment for Single-view Garment Reconstruction

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement(Oral)

VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction


字体生成/Font Generation

✔️Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts


文本检测 / Text Detection

Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection


文本识别 / Text Recognition

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition


Scene Text Recognizer

Data Augmentation for Scene Text Recognition

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network



End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

FOVEA: Foveated Image Magnification for Autonomous Navigation

Learning to drive from a world on rails

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving


Safety-aware Motion Prediction with Unseen Vehicles for Autonomous Driving





Anomaly Detection

DRÆM -- A discriminatively trained reconstruction embedding for surface anomaly detection

Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning


Cross-Camera Convolutional Color Constancy

Learnable Boundary Guided Adversarial Training

Prior-Enhanced network with Meta-Prototypes (PEMP)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Generalized-Shuffled-Linear-Regression (Oral)

VLGrammar: Grounded Grammar Induction of Vision and Language

A New Journey from SDRTV to HDRTV

IICNet: A Generic Framework for Reversible Image Conversion

Structure-Preserving Deraining with Residue Channel Prior Guidance

Learning with Noisy Labels via Sparse Regularization

Neural Strokes: Stylized Line Drawing of 3D Shapes

COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

RINDNet: Edge Detection for Discontinuity in Reflectance, Illumination, Normal and Depth

ELLIPSDF: Joint Object Pose and Shape Optimization with a Bi-level Ellipsoid and Signed Distance Function Description

Unlimited Neighborhood Interaction for Heterogeneous Trajectory Prediction

CanvasVAE: Learning to Generate Vector Graphic Documents

Refining activation downsampling with SoftPool

Aligning Latent and Image Spaces to Connect the Unconnectable

Unifying Nonlocal Blocks for Neural Networks

SLAMP: Stochastic Latent Appearance and Motion Prediction

TransForensics: Image Forgery Localization with Dense Self-Attention

Learning Facial Representations from the Cycle-consistency of Face

NASOA: Towards Faster Task-oriented Online Fine-tuning with a Zoo of Models

Impact of Aliasing on Generalization in Deep Convolutional Networks

Learning Canonical 3D Object Representation for Fine-Grained Recognition

UniNet: A Unified Scene Understanding Network and Exploring Multi-Task Relationships through the Lens of Adversarial Attacks

SUNet: Symmetric Undistortion Network for Rolling Shutter Correction

Learning to Cut by Watching Movies

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Towards Interpretable Deep Metric Learning with Structural Matching

m-RevNet: Deep Reversible Neural Networks with Momentum

DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities

perf4sight: A toolflow to model CNN training performance on Edge GPUs

MT-ORL: Multi-Task Occlusion Relationship Learning

ProAI: An Efficient Embedded AI Hardware for Automotive Applications - a Benchmark Study

SPACE: A Simulator for Physical Interactions and Causal Learning in 3D Environments

CODEs: Chamfer Out-of-Distribution Examples against Overconfidence Issue

Towards Real-World Prohibited Item Detection: A Large-Scale X-ray Benchmark

Pixel Difference Networks for Efficient Edge Detection

Online Continual Learning For Visual Food Classification

DICOM Imaging Router: An Open Deep Learning Framework for Classification of Body Parts from DICOM X-ray Scans

PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation

Learning to Automatically Diagnose Multiple Diseases in Pediatric Chest Radiographs Using Deep Convolutional Neural Networks

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

Finding Representative Interpretations on Convolutional Neural Networks

Investigating transformers in the decomposition of polygonal shapes as point collections

Self-Supervised Pretraining and Controlled Augmentation Improve Rare Wildlife Recognition in UAV Images

Group-aware Contrastive Regression for Action Quality Assessment

End-to-End Dense Video Captioning with Parallel Decoding

PR-RRN: Pairwise-Regularized Residual-Recursive Networks for Non-rigid Structure-from-Motion

Scene Designer: a Unified Model for Scene Search and Synthesis from Sketch

Structured Outdoor Architecture Reconstruction by Exploration and Classification

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision

Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Deep Hybrid Self-Prior for Full 3D Mesh Generation

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning

Thermal Image Processing via Physics-Inspired Deep Networks

A New Journey from SDRTV to HDRTV

Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs

Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates

LOKI: Long Term and Key Intentions for Trajectory Prediction

Stochastic Scene-Aware Motion Prediction

Exploiting Multi-Object Relationships for Detecting Adversarial Attacks in Complex Scenes

Social Fabric: Tubelet Compositions for Video Relation Detection

Causal Attention for Unbiased Visual Recognition

Universal Cross-Domain Retrieval: Generalizing Across Classes and Domains

Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain

Learning to Match Features with Seeded Graph Matching Network

A Unified Objective for Novel Class Discovery

How to cheat with metrics in single-image HDR reconstruction

Towards Understanding the Generative Capability of Adversarially Robust Classifiers (Oral)

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Out-of-boundary View Synthesis Towards Full-Frame Video Stabilization

PatchMatch-RL: Deep MVS with Pixelwise Depth, Normal, and Visibility

Continual Learning for Image-Based Camera Localization

Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data

Detecting and Segmenting Adversarial Graphics Patterns from Images

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

BlockCopy: High-Resolution Video Processing with Block-Sparse Feature Propagation and Online Policies

Learning Signed Distance Field for Multi-view Surface Reconstruction (Oral)

Deep Relational Metric Learning

Ranking Models in Unlabeled New Environments

Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

LSD-StructureNet: Modeling Levels of Structural Detail in 3D Part Hierarchies

BiaSwap: Removing dataset bias with bias-tailored swapping augmentation

LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning

Learning of Visual Relations: The Devil is in the Tails

Bridging Unsupervised and Supervised Depth from Focus via All-in-Focus Supervision

Support-Set Based Cross-Supervision for Video Grounding

Fast Robust Tensor Principal Component Analysis via Fiber CUR Decomposition

Improving Generalization of Batch Whitening by Convolutional Unit Optimization

CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing

NGC: A Unified Framework for Learning with Open-World Noisy Data

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

Learning Cross-modal Contrastive Features for Video Domain Adaptation

Lifelong Infinite Mixture Model Based on Knowledge-Driven Dirichlet Process

A Dual Adversarial Calibration Framework for Automatic Fetal Brain Biometry

LUAI Challenge 2021 on Learning to Understand Aerial Images

Embedding Novel Views in a Single JPEG Image

Learning to Discover Reflection Symmetry via Polar Matching Convolution

Deep 3D Mask Volume for View Synthesis of Dynamic Scenes

Cross-category Video Highlight Detection via Set-based Learning

Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Sparse to Dense Motion Transfer for Face Image Animation

SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos

4D-Net for Learned Multi-Modal Alignment

The Power of Points for Modeling Humans in Clothing

The Functional Correspondence Problem

On the Limits of Pseudo Ground Truth in Visual Camera Re-localisation

Towards Learning Spatially Discriminative Feature Representations


