Code Monkey home page Code Monkey logo

semseg's Introduction

semseg

master

语义图像分割,为图像中的每个像素分配语义标签(例如“道路”,“天空”,“人”,“狗”)的任务使得能够实现许多新应用,例如Pixel 2和Pixel 2 XL智能手机的纵向模式中提供的合成浅景深效果和移动实时视频分割。

引用自Semantic Image Segmentation with DeepLab in TensorFlow

本仓库的开发计划见项目下一步开发计划

下面将近期主要的论文整理表格以供后面进一步总结。


网络实现

  • FCN(VGG和ResNet的骨干网络),已实现,参考fcn_understanding
  • RefineNet,已实现,参考refinenet_understanging
  • DUC,参考duc_understanding
  • DRN,已实现
  • PSPNet,参考pspnet_understanding
  • ENet,已实现
  • ErfNet,已实现
  • EDANet,已实现
  • LinkNet,已实现,参考pytorch-linknet
  • FC-DenseNet,已实现,参考fcdensenet_understanding
  • LRN,已实现,但是没有增加多分辨率loss训练,后期增加。
  • BiSeNet,已实现,主要是ResNet-18和ResNet-101,其余类似。
  • FRRN,已实现,FRRN A和FRRN B。
  • 增加YOLO-V1多任务学习,还未完全测试。
  • GCN
  • ...

semantic segmentation algorithms

这个仓库旨在实现常用的语义分割算法,主要参考如下:


相关论文


弱监督语义分割

  • Generating Self-Guided Dense Annotations for Weakly Supervised Semantic Segmentation

实例分割

目前暂且收集相关实例分割到语义分割目录中,待综述完成单独分离。

  • Semantic Instance Segmentation with a Discriminative Loss Function

数据集实现


数据集增加

通过仿射变换来实现数据集增加的方法扩充语义分割数据集。


依赖

  • pytorch
  • ...

数据


用法

# 在tmux或者另一个终端中开启可视化服务器visdom
python -m visdom.server
# 然后在浏览器中查看127.0.0.1:9097
  • 训练
# 训练模型
python train.py
  • 校验
# 校验模型
python validate.py

ENet可视化结果

以下是相关语义分割论文粗略整理。


ShuffleSeg: Real-time Semantic Segmentation Network

摘要
Real-time semantic segmentation is of significant importance for mobile and robotics related applications. We propose a computationally efficient segmentation network which we term as ShuffleSeg. The proposed architecture is based on grouped convolution and channel shuffling in its encoder for improving the performance. An ablation study of different decoding methods is compared including Skip architecture, UNet, and Dilation Frontend. Interesting insights on the speed and accuracy tradeoff is discussed. It is shown that skip architecture in the decoding method provides the best compromise for the goal of real-time performance, while it provides adequate accuracy by utilizing higher resolution feature maps for a more accurate segmentation. ShuffleSeg is evaluated on CityScapes and compared against the state of the art real-time segmentation networks. It achieves 2x GFLOPs reduction, while it provides on par mean intersection over union of 58.3% on CityScapes test set. ShuffleSeg runs at 15.7 frames per second on NVIDIA Jetson TX2, which makes it of great potential for real-time applications.
会议/期刊 作者 论文 代码
arXiv: 1803.03816 Mostafa Gamal, Mennatullah Siam, Moemen Abdel-Razek ShuffleSeg: Real-time Semantic Segmentation Network TFSegmentation

本文提出了一种基于ShuffleNet的实时语义分割网络,通过在编码器中使用grouped convolution和channle shuffling(ShuffleNet基本结构),同时用不同的解码方法,包括Skip架构,UNet和Dilation前端探索了精度和速度的平衡。

主要动机是:

It was shown in [4][2][3] that depthwise separable convolution or grouped convolution reduce the computational cost, while maintaining good representation capability.

训练的trciks:充分利用CityScapes数据集,将其中粗略标注的图像作为网络预训练,然后基于精细标注的图像作为网络微调。


RTSeg: Real-time Semantic Segmentation Comparative Study

摘要
Semantic segmentation benefits robotics related applications especially autonomous driving. Most of the research on semantic segmentation is only on increasing the accuracy of segmentation models with little attention to computationally efficient solutions. The few work conducted in this direction does not provide principled methods to evaluate the different design choices for segmentation. In this paper, we address this gap by presenting a real-time semantic segmentation benchmarking framework with a decoupled design for feature extraction and decoding methods. The framework is comprised of different network architectures for feature extraction such as VGG16, Resnet18, MobileNet, and ShuffleNet. It is also comprised of multiple meta-architectures for segmentation that define the decoding methodology. These include SkipNet, UNet, and Dilation Frontend. Experimental results are presented on the Cityscapes dataset for urban scenes. The modular design allows novel architectures to emerge, that lead to 143x GFLOPs reduction in comparison to SegNet.
会议/期刊 作者 论文 代码
arXiv: 1803.02758 Mennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, Martin Jagersand RTSeg: Real-time Semantic Segmentation Comparative Study TFSegmentation

和ShuffleSeg: Real-time Semantic Segmentation Network同一作者。

本文整体思路和ShuffleSeg类同,只不过更加抽象了编码器解码器,这里的编码器不再仅仅是ShuffleNet,而是增加了VGG16,Resnet18,MobileNet,方便了后期不同基础网络性能的比较。


SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling

摘要
We propose a novel deep architecture, SegNet, for semantic pixel wise image labelling. SegNet has several attractive properties; (i) it only requires forward evaluation of a fully learnt function to obtain smooth label predictions, (ii) with increasing depth, a larger context is considered for pixel labelling which improves accuracy, and (iii) it is easy to visualise the effect of feature activation(s) in the pixel label space at any depth.
SegNet is composed of a stack of encoders followed by a corresponding decoder stack which feeds into a soft-max classification layer. The decoders help map low resolution feature maps at the output of the encoder stack to full input image size feature maps. This addresses an important drawback of recent deep learning approaches which have adopted networks designed for object categorization for pixel wise labelling. These methods lack a mechanism to map deep layer feature maps to input dimensions. They resort to ad hoc methods to upsample features, e.g. by replication. This results in noisy predictions and also restricts the number of pooling layers in order to avoid too much upsampling and thus reduces spatial context. SegNet overcomes these problems by learning to map encoder outputs to image pixel labels. We test the performance of SegNet on outdoor RGB scenes from CamVid, KITTI and indoor scenes from the NYU dataset. Our results show that SegNet achieves state-of-the-art performance even without use of additional cues such as depth, video frames or post-processing with CRF models.
会议/期刊 作者 论文 代码
arXiv: 1505.07293 Vijay Badrinarayanan, Ankur Handa, Roberto Cipolla SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling caffe-segnet

本文为SegNet-Basic,基本思路就是编码器-解码器架构,指出当前语义分割方法都缺少一个机制将深度特征图map到输入维度的机制,基本都是特定的上采样特征方法,比如复制。


Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

摘要
We present a deep learning framework for probabilistic pixel-wise semantic segmentation, which we term Bayesian SegNet. Semantic segmentation is an important tool for visual scene understanding and a meaningful measure of uncertainty is essential for decision making. Our contribution is a practical system which is able to predict pixelwise class labels with a measure of model uncertainty. We achieve this by Monte Carlo sampling with dropout at test time to generate a posterior distribution of pixel class labels. In addition, we show that modelling uncertainty improves segmentation performance by 2-3% across a number of state of the art architectures such as SegNet, FCN and Dilation Network, with no additional parametrisation. We also observe a significant improvement in performance for smaller datasets where modelling uncertainty is more effective. We benchmark Bayesian SegNet on the indoor SUN Scene Understanding and outdoor CamVid driving scenes datasets.
会议/期刊 作者 论文 代码
arXiv: 1511.02680 Alex Kendall, Vijay Badrinarayanan, Roberto Cipolla Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understandin caffe-segnet

本文主要提出了一种基于概率的像素级语义分割框架Bayesian SegNet,通过建模模型不确定性能够在许多网络中都提升2-3%性能,如SegNet,FCN和Dilation网络。


SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

摘要
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance.
SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
会议/期刊 作者 论文 代码
arXiv: 1511.00561 Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation caffe-segnet

本文提出的SegNet是应用最为广泛的架构,其中SegNet-VGG16在性能和精度上都获得了较大的提升,主要指出了解码器使用的反池化操作。


U-Net: Convolutional Networks for Biomedical Image Segmentation

摘要
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffee) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.
会议/期刊 作者 论文 代码
arXiv: 1505.04597 Olaf Ronneberger, Philipp Fischer, Thomas Brox U-Net: Convolutional Networks for Biomedical Image Segmentation unet第三方

本文提出的U-Net网络能够有效利用标注样本,通过a symmetric expanding path提升分割精度。

semseg's People

Contributors

guanfuchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

semseg's Issues

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

related paper

摘要
The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18* faster, requires 75* less FLOPs, has 79* less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

Dual Attention Network for Scene Segmentation

related paper

摘要
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. we make the code and trained models publicly available at https://github.com/junfu1115/DANet

3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

related paper

摘要
We present 3DMV, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network. In contrast to existing methods that either use geometry or RGB data as input for this task, we combine both data modalities in a joint, end-to-end network architecture. Rather than simply projecting color data into a volumetric grid and operating solely in 3D – which would result in insufficient detail – we first extract feature maps from associated RGB images. These features are then mapped into the volumetric feature grid of a 3D network using a differentiable back-projection layer. Since our target is 3D scanning scenarios with possibly many frames, we use a multi-view pooling approach in order to handle a varying number of RGB input views. This learned combination of RGB and geometric features with our joint 2D-3D architecture achieves significantly better results than existing baselines. For instance, our final result on the ScanNet 3D segmentation benchmark [1] increases from 52.8% to 75% accuracy compared to existing volumetric architectures.

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

related paper

摘要
State-of-the-art approaches for semantic image segmentation are built on Convolutional Neural Networks (CNNs). The typical segmentation architecture is composed of (a) a downsampling path responsible for extracting coarse semantic features, followed by (b) an upsampling path trained to recover the input image resolution at the output of the model and, optionally, (c) a post-processing module (e.g. Conditional Random Fields) to refine the model predictions.
Recently, a new CNN architecture, Densely Connected Convolutional Networks (DenseNets), has shown excellent results on image classification tasks. The idea of DenseNets is based on the observation that if each layer is directly connected to every other layer in a feed-forward fashion then the network will be more accurate and easier to train.
In this paper, we extend DenseNets to deal with the problem of semantic segmentation. We achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining. Moreover, due to smart construction of the model, our approach has much less parameters than currently published best entries for these datasets. Code to reproduce the experiments is publicly available here : https://github.com/SimJeg/FC-DenseNet

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

related paper

摘要
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context atmultiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphicalmodels. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected ConditionalRandom Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU inthe test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our codeis made publicly available online.

项目下一步开发计划

删除线表示任务完成或基本完成。

  • 增加DALI处理数据增广
  • 增加多机多卡训练
  • 代码重构,参考Python风格规范
  • 增加学习率调整率策略和优化器选择,增加step学习率,由于要多个step间隔,可以用list但是比较麻烦,预期后面增加文本配置文件
    参考schedulersoptimizers
    参考StepLR
  • 将LRN模型多loss损失进行复现
  • 总结3D相关的语义分割综述
  • add benchmark for different models #9
  • 增加模型DeConvolution,参考 #48
  • 增加模型FRRN A和FRRN B,参考 #50
  • 增加实现另一种权重损失函数bootstrapped_cross_entropy2d,参考 #49
  • 增加GPU资源不足时用梯度累加近似多batch训练,参考博客PyTorch 大批量数据在单个或多个 GPU 训练指南,仍在调试测试性能,测试没有问题,但是训练需要注意,这种操作相当于增加了batch,需要设置相当的学习率,原先batch size为1时学习率设置较小1e-4,如果用了grad_acc_steps=5那么可以设置稍大的学习率1e-5
  • 查看LightNet,看看相关实现以及其他相关论文架构代码。
  • 预计增加future semantic segmentation, 参考Future Semantic Segmentation with Convolutional LSTM
  • 目前仓库由于增加太多内容比较乱,将主要重构一下,增加input_channel适应不同输入通道
  • 增加DenseASPP模块,参考 #27 ,参考实现DenseASPP
  • 阅读Wider or Deeper: Revisiting the ResNet Model for Visual Recognition论文并实现相关结构,参考 #15 ,下一步实现Wider or Deeper分类网络和对应的分割网络
  • 阅读Large Kernel Matters Improve Semantic Segmentation by Global Convolutional Network并实现相关网络结构,参考 #51
  • 参考resnet.py,将原先的fcn_resnet的思路用反卷积,DeconvBottleneck和Bottleneck的方法实现
  • 阅读Exploring Context with Deep Structured models for Semantic Segmentation论文
  • 阅读Context Encoding for Semantic Segmentation论文并实现相关结构,参考 #52
  • 重构代码PSPNet,可参考模型PyTorch-Encoding,可以参考pytorch-segmentation,其中有PSPNet,UNet和DeepLabV3的实现
  • 参考(pt_mobilenetv2_deeplabv3)[https://github.com/jinfagang/pt_mobilenetv2_deeplabv3]实现基于mobilenetv2的deeplabv3模型
  • ...

Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds

related paper

摘要
Deep learning approaches have made tremendous progress in the field of semantic segmentation over the past few years. However, most current approaches operate in the 2D image space. Direct semantic segmentation of unstructured 3D point clouds is still an open research problem. The recently proposed PointNet architecture presents an interesting step ahead in that it can operate on unstructured point clouds, achieving decent segmentation results. However, it subdivides the input points into a grid of blocks and processes each such block individually. In this paper, we investigate the question how such an architecture can be extended to incorporate larger-scale spatial context. We build upon PointNet and propose two extensions that enlarge the receptive field over the 3D scene. We evaluate the proposed strategies on challenging indoor and outdoor datasets and show improved results in both scenarios.

image

Scale-adaptive Convolutions for Scene Parsing

related paper

摘要
Many existing scene parsing methods adopt Convolutional Neural Networks with fixed-size receptive fields, which frequently result in inconsistent predictions of large objects and invisibility of small objects. To tackle this issue, we propose a scale-adaptive convolution to acquire flexible-size receptive fields during scene parsing. Through adding a new scale regression layer, we can dynamically infer the position-adaptive scale coefficients which are adopted to resize the convolutional patches. Consequently, the receptive fields can be adjusted automatically according to the various sizes of the objects in scene images. Thus, the problems of invisible small objects and inconsistent large-object predictions can be alleviated. Furthermore, our proposed scale-adaptive convolutions are not only differentiable to learn the convolutional parameters and scale coefficients in an end-to-end way, but also of high parallelizability for the convenience of GPU implementation. Additionally, since the new scale regression layers are learned implicitly, any extra training supervision of object sizes is unnecessary. Extensive experiments on Cityscapes and ADE20K datasets well demonstrate the effectiveness of the proposed scale-adaptive convolutions.

image


概述

许多已有的场景解析方法都采用了固定大小感受野的CNN,导致了大目标分割的不一致性和小目标不可见的问题。本文提出了尺度自适应卷积来获取灵活大小的感受野。

Tree-structured Kronecker Convolutional Networks for Semantic Segmentation

related paper

摘要
Most existing semantic segmentation methods employ atrous convolution to enlarge the receptive field of filters, but neglect partial information. To tackle this issue, we firstly propose a novel Kronecker convolution which adopts Kronecker product to expand the standard convolutional kernel for taking into account the partial feature neglected by atrous convolutions. Therefore, it can capture partial information and enlarge the receptive field of filters simultaneously without introducing extra parameters. Secondly, we propose Tree-structured Feature Aggregation (TFA) module which follows a recursive rule to expand and forms a hierarchical structure. Thus, it can naturally learn representations of multi-scale objects and encode hierarchical contextual information in complex scenes. Finally, we design Tree-structured Kronecker Convolutional Networks (TKCN) which employs Kronecker convolution and TFA module. Extensive experiments on three datasets, PASCAL VOC 2012, PASCAL-Context and Cityscapes, verify the effectiveness of our proposed approach. We make the code and the trained model publicly available at https://github.com/wutianyiRosun/TKCN.

image


概述

动机是空洞卷积虽然能增大感受野,但是忽略了局部信息,本文提出的Kronecker卷积能扩展标准卷积考虑被空洞卷积忽略的局部信息。另外还提出了TFA模块(树形结构特征聚合)遵循recursive rule来expand和form级联结构。

ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network

related paper

摘要
We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2 , for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network has better generalization properties than ShuffleNetv2 when tested on the MSCOCO multi-object classification task and the Cityscapes urban scene semantic segmentation task. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2.

Weakly- and Semi-Supervised Panoptic Segmentation

related paper

摘要
We present a weakly supervised model that jointly performs both semantic- and instance-segmentation – a particularly relevant problem given the substantial cost of obtaining pixel-perfect annotation for these tasks. In contrast to many popular instance segmentation approaches based on object detectors, our method does not predict any overlapping instances. Moreover, we are able to segment both “thing” and “stuff” classes, and thus explain all the pixels in the image. “Thing” classes are weakly-supervised with bounding boxes, and “stuff” with image-level tags. We obtain state-of-the-art results on Pascal VOC, for both full and weak supervision (which achieves about 95% of fullysupervised performance). Furthermore, we present the first weakly-supervised results on Cityscapes for both semanticand instance-segmentation. Finally, we use our weakly supervised framework to analyse the relationship between annotation quality and predictive performance, which is of interest to dataset creators.

Multi-Scale Context Aggregation by Dilated Convolutions

related paper

摘要
State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction problems such as semantic segmentation are structurally different from image classification. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.

The UAVid Dataset for Video Semantic Segmentation

related paper

摘要
Video semantic segmentation has been one of the research focus in computer vision recently. It serves as a perception foundation for many fields such as robotics and autonomous driving. The fast development of semantic segmentation attributes enormously to the large scale datasets, especially for the deep learning related methods. Currently, there already exist several semantic segmentation datasets for complex urban scenes, such as the Cityscapes and CamVid datasets. They have been the standard datasets for comparison among semantic segmentation methods. In this paper, we introduce a new high resolution UAV video semantic segmentation dataset as complement, UAVid. Our UAV dataset consists of 30 video sequences capturing high resolution images. In total, 300 images have been densely labelled with 8 classes for urban scene understanding task. Our dataset brings out new challenges. We provide several deep learning baseline methods, among which the proposed novel Multi-Scale-Dilation net performs the best via multi-scale feature extraction. We have also explored the usability of sequence data by leveraging on CRF model in both spatial and temporal domain.

DenseASPP for Semantic Segmentation in Street Scenes

related paper

摘要
Semantic image segmentation is a basic street scene understanding task in autonomous driving, where each pixel in a high resolution image is categorized into a set of semantic labels. Unlike other scenarios, objects in autonomous driving scene exhibit very large scale changes, which poses great challenges for high-level feature representation in a sense that multi-scale information must be correctly encoded. To remedy this problem, atrous convolution[14] was introduced to generate features with larger receptive fields without sacrificing spatial resolution. Built upon atrous convolution, Atrous Spatial Pyramid Pooling (ASPP)[2] was proposed to concatenate multiple atrous-convolved features using different dilation rates into a final feature representation. Although ASPP is able to generate multi-scale features, we argue the feature resolution in the scale-axis is not dense enough for the autonomous driving scenario. To this end, we propose Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), which connects a set of atrous convolutional layers in a dense way, such that it generates multi-scale features that not only cover a larger scale range, but also cover that scale range densely, without significantly increasing the model size. We evaluate DenseASPP on the street scene benchmark Cityscapes[4] and achieve state-of-the-art performance.

CCNet: Criss-Cross Attention for Semantic Segmentation

related paper

摘要
Long-range dependencies can capture useful contextual information to benefit visual understanding problems. In this work, we propose a Criss-Cross Network (CCNet) for obtaining such important information through a more effective and efficient way. Concretely, for each pixel, our CCNet can harvest the contextual information of its surrounding pixels on the criss-cross path through a novel crisscross attention module. By taking a further recurrent operation, each pixel can finally capture the long-range dependencies from all pixels. Overall, our CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the nonlocal block in computing long-range dependencies. 3) The state-of-the-art performance. We conduct extensive experiments on popular semantic segmentation benchmarks including Cityscapes, ADE20K, and instance segmentation benchmark COCO. In particular, our CCNet achieves the mIoU score of 81.4 and 45.22 on Cityscapes test set and ADE20K validation set, respectively, which are the new state-of-the-art results. We make the code publicly available at https://github.com/speedinghzl/CCNet.

Speeding up Semantic Segmentation for Autonomous Driving

related paper

摘要
Deep learning has considerably improved semantic image segmentation. However, its high accuracy is traded against larger computational costs which makes it unsuitable for embedded devices in self-driving cars. We propose a novel deep network architecture for image segmentation that keeps the high accuracy while being efficient enough for embedded devices. The architecture consists of ELU activation functions, a SqueezeNet-like encoder, followed by parallel dilated convolutions, and a decoder with SharpMask-like refinement modules. On the Cityscapes dataset, the new network achieves higher segmentation accuracy than other networks that are tailored to embedded devices. Simultaneously the frame-rate is still sufficiently high for the deployment in autonomous vehicles.

Pyramid Attention Network for Semantic Segmentation

related paper

摘要
A Pyramid Attention Network(PAN) is proposed to exploit the impact of global contextual information in semantic segmentation. Different from most existing works, we combine attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks. Specifically, we introduce a Feature Pyramid Attention module to perform spatial pyramid attention structure on high-level output and combine global pooling to learn a better feature representation, and a Global Attention Upsample module on each decoder layer to provide global context as a guidance of low-level features to select category localization details. The proposed approach achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0% on PASCAL VOC 2012, while training without COCO dataset.

Rethinking Atrous Convolution for Semantic Image Segmentation

related paper

摘要
In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed ‘DeepLabv3’ system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

CGNet: A Light-weight Context Guided Network for Semantic Segmentation

related paper

摘要
The demand of applying semantic segmentation model on mobile devices has been increasing rapidly. Current state-of-the-art networks have enormous amount of parameters hence unsuitable for mobile devices, while other small memory footprint models ignore the inherent characteristic of semantic segmentation. To tackle this problem, we propose a novel Context Guided Network (CGNet), which is a light-weight network for semantic segmentation on mobile devices. We first propose the Context Guided (CG) block, which learns the joint feature of both local feature and surrounding context, and further improves the joint feature with the global context. Based on the CG block, we develop Context Guided Network (CGNet), which captures contextual information in all stages of the network and is specially tailored for increasing segmentation accuracy. CGNet is also elaborately designed to reduce the number of parameters and save memory footprint. Under an equivalent number of parameters, the proposed CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach. Specifically, without any post-processing, CGNet achieves 64.8% mean IoU on Cityscapes with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 × 1024 high-resolution images. The source code for the complete system are publicly available.

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

related paper

摘要
Recently, very deep convolutional neural networks (CNNs) have shown outstanding performance in object recognition and have also been the first choice for dense classification problems such as semantic segmentation. However, repeated subsampling operations like pooling or convolution striding in deep CNNs lead to a significant decrease in the initial image resolution. Here, we present RefineNet, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections. In this way, the deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions. The individual components of RefineNet employ residual connections following the identity mapping mindset, which allows for effective end-to-end training. Further, we introduce chained residual pooling, which captures rich background context in an efficient manner. We carry out comprehensive experiments and set new state-of-the-art results on seven public datasets. In particular, we achieve an intersection-over-union score of 83.4 on the challenging PASCAL VOC 2012 dataset, which is the best reported result to date.

Semantic object classes in video: A high-definition ground truth database

related paper

摘要
Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.
The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corresponding semantically labeled images at 1 Hz and in part, 15 Hz.
The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driving scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expanding this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by measuring the performance of an algorithm from each of three distinct domains: multi-class object recognition, pedestrian detection, and label propagation.

Path Aggregation Network for Instance Segmentation

related paper

摘要
The way that information propagates in neural networks is of great importance. In this paper, we propose Path Aggregation Network (PANet) aiming at boosting information flow in proposal-based instance segmentation framework. Specifically, we enhance the entire feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. We present adaptive feature pooling, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.
These improvements are simple to implement, with subtle extra computational overhead. Our PANet reaches the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes. Code is available at https://github.com/ShuLiu1993/PANet.

Autofocus Layer for Semantic Segmentation

related paper

摘要
We propose the autofocus convolutional layer for semantic segmentation with the objective of enhancing the capabilities of neural networks for multi-scale processing. Autofocus layers adaptively change the size of the effective receptive field based on the processed context to generate more powerful features. This is achieved by parallelising multiple convolutional layers with different dilation rates, combined by an attention mechanism that learns to focus on the optimal scales driven by context. By sharing the weights of the parallel convolutions we make the network scale-invariant, with only a modest increase in the number of parameters. The proposed autofocus layer can be easily integrated into existing networks to improve a model’s representational power. We evaluate our models on the challenging tasks of multi-organ segmentation in pelvic CT and brain tumor segmentation in MRI and achieve very promising performance.

add benchmark for different models

add scripts for different models performance comparing, maybe like below

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
XXX XXX XXX XXX XXX
GTX 1080 Ti XXX XXX XXX XXX
CPU: Dual Xeon E5-2630 v3 XXX XXX XXX XXX

convlstm

请问你有用pytorch复现convlstm了吗?我最近在做这方面的事情,希望交流一下。邮箱:[email protected]

Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes

related paper

摘要
Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pretrained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate segmentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals. Without additional processing steps and without pretraining, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset.

Experiment About Different Models On CamVid

drnseg_a_asymmetric_18

model dataset data_augment solver lr mIoU on test mIoU on val
drnseg_a_asymmetric_18 CamVid True Adam 1e-4 Polynomial 0.52 0.64
test
('FreqW Acc : \t', 0.7480140451512624)
('Overall Acc: \t', 0.8485693848354793)
('Mean Acc : \t', 0.6691773717182845)
('Mean IoU : \t', 0.5200477414567214)
(0, 0.8703268482111521)
(1, 0.7444923275238022)
(2, 0.220143968582406)
(3, 0.9006616267566392)
(4, 0.7280678329717994)
(5, 0.67441106348017)
(6, 0.30791116914622374)
(7, 0.27702273663274146)
(8, 0.7131527308405102)
(9, 0.3732300272115494)
(10, 0.43115256612366437)
(11, 0.0)

val
('FreqW Acc : \t', 0.8724396532274968)
('Overall Acc: \t', 0.926297384946828)
('Mean Acc : \t', 0.7673188682618776)
('Mean IoU : \t', 0.6425360284469198)
(0, 0.9255498218344776)
(1, 0.875076442288688)
(2, 0.16499666386237707)
(3, 0.9663822151276108)
(4, 0.8731120297982072)
(5, 0.9046534009726946)
(6, 0.4412606220136763)
(7, 0.6346996998696124)
(8, 0.7766318674956714)
(9, 0.43558687104360516)
(10, 0.7124827070564164)
(11, 0.0)

loss_iteration

image

loss_epoch

image

mIoU_epoch

image

example visualization

image

Improving Semantic Segmentation via Video Propagation and Label Relaxation

related paper

摘要
Semantic segmentation requires large amounts of pixelwise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models’ ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples leads to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018. Our code and videos can be found at https://nv-adlr.github.io/publication/2018-Segmentation.

The PASCAL Visual Object Classes Challenge: A Retrospective

related paper

摘要
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

CFUN: Combining Faster R-CNN and U-net Network for Efficient Whole Heart Segmentation

related paper

摘要
In this paper,we propose a novel heart segmentation pipeline Combining Faster R-CNN and U-net Network (CFUN). Due to Faster R-CNN’s precise localization ability and U-net’s powerful segmentation ability, CFUN needs only one-step detection and segmentation inference to get the whole heart segmentation result, obtaining good results with significantly reduced computational cost. Besides, CFUN adopts a new loss function based on edge information named 3D Edge-loss as an auxiliary loss to accelerate the convergence of training and improve the segmentation results. Extensive experiments on the public dataset show that CFUN exhibits competitive segmentation performance in a sharply reduced inference time. Our source code and the model are publicly available at https://github.com/Wuziyi616/CFUN.

image


概述

组合Faster R-CNN(定位能力)和U-Net(分割能力)用于心脏分割,提出了3D Edge-loss作为辅助损失加速收敛提升分割结果。

Bridging Category-level and Instance-level Semantic Image Segmentation

related paper

摘要
We propose an approach to instance-level image segmentation that is built on top of category-level segmentation. Specifically, for each pixel in a semantic category mask, its corresponding instance bounding box is predicted using a deep fully convolutional regression network. Thus it follows a different pipeline to the popular detect-then-segment approaches that first predict instances’ bounding boxes, which are the current state-of-the-art in instance segmentation. We show that, by leveraging the strength of our state-of-the-art semantic segmentation models, the proposed method can achieve comparable or even better results to detect-then-segment approaches. We make the following contributions. (i) First, we propose a simple yet effective approach to semantic instance segmentation. (ii) Second, we propose an online bootstrapping method during training, which is critically important for achieving good performance for both semantic category segmentation and instance-level segmentation. (iii) As the performance of semantic category segmentation has a significant impact on the instance-level segmentation, which is the second step of our approach, we train fully convolutional residual networks to achieve the best semantic category segmentation accuracy. On the PASCAL VOC 2012 dataset, we obtain the currently best mean intersection-over-union score of 79.1%. (iv) We also achieve state-of-the-art results for instance-level segmentation.

ShelfNet for Real-time Semantic Segmentation

related paper

摘要
In this project, we present ShelfNet, a lightweight convolutional neural network for accurate real-time semantic segmentation. Different from the standard encoder-decoder structure, ShelfNet has multiple encoder-decoder branch pairs with skip connections at each spatial level, which looks like a shelf with multiple columns. The shelf-shaped structure provides multiple paths for information flow and improves segmentation accuracy. Inspired by the success of recurrent convolutional neural networks, we use modified residual blocks where two convolutional layers share weights. The shared-weight block enables efficient feature extraction and model size reduction. We tested ShelfNet with ResNet50 and ResNet101 as the backbone respectively: they achieved 59 FPS and 42 FPS respectively on a GTX 1080Ti GPU with a 512 × 512 input image. ShelfNet achieved high accuracy: on PASCAL VOC 2012 test set, it achieved 84.2% mIoU with ResNet101 backbone and 82.8% mIoU with ResNet50 backbone; it achieved 75.8% mIoU with ResNet50 backbone on Cityscapes dataset. ShelfNet achieved both higher mIoU and faster inference speed compared with state-of-the-art real-time semantic segmentation models. We provide the implementation https://github.com/juntang-zhuang/ShelfNet.

Light-Weight RefineNet for Real-Time Semantic Segmentation

related paper

摘要
We consider an important task of effective and efficient semantic image segmentation. In particular, we adapt a powerful semantic segmentation architecture, called RefineNet [46], into the more compact one, suitable even for tasks requiring real-time performance on high-resolution inputs. To this end, we identify computationally expensive blocks in the original setup, and propose two modifications aimed to decrease the number of parameters and floating point operations. By doing that, we achieve more than twofold model reduction, while keeping the performance levels almost intact. Our fastest model undergoes a significant speed-up boost from 20 FPS to 55 FPS on a generic GPU card on 512 × 512 inputs with solid 81.1% mean iou performance on the test set of PASCAL VOC [18], while our slowest model with 32 FPS (from original 17 FPS) shows 82.7% mean iou on the same dataset. Alternatively, we showcase that our approach is easily mixable with light-weight classification networks: we attain 79.2% mean iou on PASCAL VOC using a model that contains only 3.3M parameters and performs only 9.3B floating point operations.

Semantic image segmentation with deep convolutional nets and fully connected CRFs

related paper

摘要
Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called ”semantic image segmentation”). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our “DeepLab” system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the ’hole’ algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

related paper

摘要
Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048×1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.

image


概述

本文的动机是语义分割通常同时要求空间信息和相当大的感受野,然而现代的方法通常为了获得实时推理速度妥协了空间分辨率导致了较差的性能。本文提出双向分割网络(BiSeNet),即一个空间Path(较小的stride)保留空间信息生成高分辨率特征,另一个上下文Path(快速下采样)高效获取感受野。基于这两个Path,引入了一种新的特征融合模块高效组合特征。

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

related paper

摘要
Convolutional neural networks (CNNs) have achieved great successes in many computer vision problems. Unlike existing works that designed CNN architectures to improve performance on a single task of a single domain and not generalizable, we present IBN-Net, a novel convolutional architecture, which remarkably enhances a CNN’s modeling ability on one domain (e.g. Cityscapes) as well as its generalization capacity on another domain (e.g. GTA5) without finetuning. IBN-Net carefully integrates Instance Normalization (IN) and Batch Normalization (BN) as building blocks, and can be wrapped into many advanced deep networks to improve their performances. This work has three key contributions. (1) By delving into IN and BN, we disclose that IN learns features that are invariant to appearance changes, such as colors, styles, and virtuality/reality, while BN is essential for preserving content related information. (2) IBN-Net can be applied to many advanced deep architectures, such as DenseNet, ResNet, ResNeXt, and SENet, and consistently improve their performance without increasing computational cost. 1 (3) When applying the trained networks to new domains, e.g. from GTA5 to Cityscapes, IBN-Net achieves comparable improvements as domain adaptation methods, even without using data from the target domain. With IBN-Net, we won the 1st place on the WAD 2018 Challenge Drivable Area track, with an mIoU of 86.18%.

Adaptive Affinity Fields for Semantic Segmentation

related paper

摘要
Semantic segmentation has made much progress with increasingly powerful pixel-wise classifiers and incorporating structural priors via Conditional Random Fields (CRF) or Generative Adversarial Networks (GAN). We propose a simpler alternative that learns to verify the spatial structure of segmentation during training only. Unlike existing approaches that enforce semantic labels on individual pixels and match labels between neighbouring pixels, we propose the concept of Adaptive Affinity Fields (AAF) to capture and match the semantic relations between neighbouring pixels in the label space. We use adversarial learning to select the optimal affinity field size for each semantic category. It is formulated as a minimax problem, optimizing our segmentation neural network in a best worst-case learning scenario. AAF is versatile for representing structures as a collection of pixel-centric relations, easier to train than GAN and more efficient than CRF without run-time inference. Our extensive evaluations on PASCAL VOC 2012, Cityscapes, and GTA5 datasets demonstrate its above-par segmentation performance and robust generalization across domains.

ICNet for Real-Time Semantic Segmentation on High-Resolution Images

related paper

摘要
We focus on the challenging task of realtime semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an compressed-PSPNet-based image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion to quickly achieve high-quality segmentation. Our system yields realtime inference on a single GPU card with decent quality results evaluated on challenging Cityscapes dataset.

Wider or deeper: Revisiting the resnet model for visual recognition

related paper

摘要
The trend towards increasingly deep neural networks has been driven by a general observation that increasing depth increases the performance of a network. Recently, however, evidence has been amassing that simply increasing depth may not be the best way to increase performance, particularly given other limitations. Investigations into deep residual networks have also suggested that they may not in fact be operating as a single deep network, but rather as an ensemble of many relatively shallow networks. We examine these issues, and in doing so arrive at a new interpretation of the unravelled view of deep residual networks which explains some of the behaviours that have been observed experimentally. As a result, we are able to derive a new, shallower, architecture of residual networks which significantly outperforms much deeper models such as ResNet-200 on the ImageNet classification dataset. We also show that this performance is transferable to other problem domains by developing a semantic segmentation approach which outperforms the state-of-the-art by a remarkable margin on datasets including PASCAL VOC, PASCAL Context, and Cityscapes. The architecture that we propose thus outperforms its comparators, including very deep ResNets, and yet is more efficient in memory use and sometimes also in training time. The code and models are available at https://github.com/itijyou/ademxapp.

The Cityscapes Dataset for Semantic Urban Scene Understanding

related paper

摘要
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20 000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.

Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation

related paper

摘要
Real-time semantic segmentation plays an important role in practical applications such as self-driving and robots. Most research working on semantic segmentation focuses on accuracy with little consideration for efficiency. Several existing studies that emphasize high-speed inference often cannot produce high-accuracy segmentation results. In this paper, we propose a novel convolutional network named Efficient Dense modules with Asymmetric convolution (EDANet), which employs an asymmetric convolution structure incorporating the dilated convolution and the dense connectivity to attain high efficiency at low computational cost, inference time, and model size. Compared to FCN, EDANet is 11 times faster and has 196 times fewer parameters, while it achieves a higher the mean of intersection-over-union (mIoU) score without any additional decoder structure, context module, post-processing scheme, and pretrained model. We evaluate EDANet on Cityscapes and CamVid datasets to evaluate its performance and compare it with the other state-of-art systems. Our network can run on resolution 512×1024 inputs at the speed of 108 and 81 frames per second on a single GTX 1080Ti and Titan X, respectively.

semantic segmentation benchmark

Here is some paper about semantic segmentation, and I list some performance for comparing. I will check the performance again.

image

Learning Deconvolution Network for Semantic Segmentation

related paper

摘要
We propose a novel semantic segmentation algorithm by learning a deep deconvolution network. We learn the network on top of the convolutional layers adopted from VGG 16-layer net. The deconvolution network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. We apply the trained network to each proposal in an input image, and construct the final semantic segmentation map by combining the results from all proposals in a simple manner. The proposed algorithm mitigates the limitations of the existing methods based on fully convolutional networks by integrating deep deconvolution network and proposal-wise prediction; our segmentation method typically identifies detailed structures and handles objects in multiple scales naturally. Our network demonstrates outstanding performance in PASCAL VOC 2012 dataset, and we achieve the best accuracy (72.5%) among the methods trained without using Microsoft COCO dataset through ensemble with the fully convolutional network.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.