Neural Semantic Segmentation

Implementations of neural network papers for semantic segmentation using Keras and TensorFlow.

Predictions from Tiramisu on CamVid video stream.

Installation

To install requirements for the project:

python -m pip install -r requirements.txt

Hardware Specification

Results were generated using a machine equipped with 128GB RAM, nVidia P100 GPU, and Intel Xeon CPU @ 2.10GHz. All results shown are from the testing dataset.

CamVid

32 classes generalized to 11 classes using mapping in 11_class.txt
- use 12 labels and ignore the Void class (i.e., 11 labels)
960 x 720 scaled down by factor of 2 to 480 x 360

Models

SegNet

The following table describes training hyperparameters.

Crop Size	Epochs	Batch Size	Patience	Optimizer	α	𝛃	α Decay
352 x 480	200	8	50	SGD	1e-3	0.9	0.95

batch normalization statistics computed per batch during training and using a rolling average computed over input batches for validation and testing
- original paper uses a static statistics computed over the training data
encoder transfer learning from VGG16 trained on ImageNet
best model in terms of validation accuracy is kept as final model
median frequency balancing of class labels (Eigen et al. 2014)
- weighted categorical cross-entropy loss function
local contrast normalization of inputs (Jarrett et al. 2009)
pooling indexes (Badrinarayanan et al. 2015)

Quantitative Results

The following table outlines the testing results from SegNet.

Metric	Value
Accuracy	0.888625
Mean Per Class Accuracy	0.722078
Mean I/U	0.577455
Bicyclist	0.435263
Building	0.743735
Car	0.700505
Column/Pole	0.254089
Fence	0.385431
Pedestrian	0.393298
Road	0.895652
Sidewalk	0.747693
Sign	0.219355
Sky	0.888208
Vegetation	0.688779

Qualitative Results

Bayesian SegNet

The following table describes training hyperparameters.

Crop Size	Epochs	Batch Size	Patience	Optimizer	α	𝛃	α Decay	Dropout	Samples
352 x 480	200	8	50	SGD	1e-3	0.9	0.95	50%	40

batch normalization statistics computed per batch during training and using a rolling average computed over input batches for validation and testing
- original paper uses a static statistics computed over the training data
encoder transfer learning from VGG16 trained on ImageNet
- note that VGG16 does not have any dropout by default; transfer from a Bayesian VGG16 model could improve results
best model in terms of validation accuracy is kept as final model
median frequency balancing of class labels (Eigen et al. 2014)
- weighted categorical cross-entropy loss function
local contrast normalization of inputs (Jarrett et al. 2009)
pooling indexes (Badrinarayanan et al. 2015)

Quantitative Results

The following table outlines the testing results from Bayesian SegNet.

Metric	Value
Accuracy	0.863547
Mean Per Class Accuracy	0.769486
Mean I/U	0.547227
Bicyclist	0.407042
Building	0.68995
Car	0.678854
Column/Pole	0.206012
Fence	0.376584
Pedestrian	0.305958
Road	0.88796
Sidewalk	0.727901
Sign	0.155895
Sky	0.888182
Vegetation	0.69516

Qualitative Results

The One Hundred Layers Tiramisu

The following table describes training hyperparameters.

Crop Size	Epochs	Batch Size	Patience	Optimizer	α	α Decay	Dropout
224 x 224	200	3	100	RMSprop	1e-3	0.995	20%
352 x 480	200	1	50	RMSprop	1e-4	1.000	20%

random horizontal flips of images during training
- the paper says vertical, but their implementation clearly shows horizontal flips (likely a typo). Horizontal make more sense than vertical anyway and produces empirically better test results
batch normalization statistics computed per batch during training, validation, and testing
skip connections between encoder and decoder (Jégou et al. 2016)

Quantitative Results

The following table outlines the testing results from 103 Layers Tiramisu.

Metric	Value
Accuracy	0.908092
Mean Per Class Accuracy	0.716523
Mean I/U	0.585788
Bicyclist	0.34839
Building	0.775576
Car	0.689861
Column/Pole	0.312897
Fence	0.261254
Pedestrian	0.4299
Road	0.918804
Sidewalk	0.802591
Sign	0.253895
Sky	0.91806
Vegetation	0.732444

Qualitative Results

Bayesian The One Hundred Layers Tiramisu

Bayesian Tiramisu

Aleatoric Uncertainty

The following table describes training hyperparameters.

Crop Size	Epochs	Batch Size	Patience	Optimizer	α	α Decay	Dropout
352 x 480	100	1	10	RMSprop	1e-4	1.000	20%

network split to predict targets and loss attenuation
- custom loss function to train the second head of the network (Kendall et al. 2017)
- our loss function samples through the Softmax function like their paper says (but contrary to the mathematics they present?). without applying the Softmax function, the loss is unstable and goes negative
pre-trained with fine weights from original Tiramisu
pre-trained network frozen while head to predict sigma is trained

Quantitative Results

The quantitative results are the same as the standard Tiramisu model.

Qualitative Results

Epistemic Uncertainty

pre-trained with fine weights from original Tiramisu
50 samples for Monte Carlo Dropout sampling at test time

Quantitative Results

The following table outlines the testing results from Epistemic Tiramisu.

Metric	Value
Accuracy	0.881144
Mean Per Class Accuracy	0.59509
Mean I/U	0.506473
Bicyclist	0.280771
Building	0.734256
Car	0.587708
Column/Pole	0.124245
Fence	0.164669
Pedestrian	0.322883
Road	0.886696
Sidewalk	0.724571
Sign	0.165528
Sky	0.88297
Vegetation	0.696909

Qualitative Results

Wall Clock Inference Time Metrics

The following box plot describes the mean and standard deviation in wall clock time execution of different segmentation models performing inference on images of size 352 x 480 pixels.

The following box plot describes the mean and standard deviation in wall clock time execution of different Bayesian segmentation models performing inference on images of size 352 x 480 pixels. Note that in this case, inference is probabilistic due to the test time dropout and Monte Carlo simulation over 50 network samples.