To solve this challenge I develop a novel approach to combine modern deep learning with classical computer vision techniques to achieve highest score on leader-board at the time of the submission. I implemented a highly parallel CPU/GPU pipeline to achieve highest FPS in the top 10 contestants.
I evaluated the following methods to use with this challenge:
I used Cityscapes as common starting point. DeepLab’s supports Cityscapes out of the box. FCN-Alexnet is my own customized implementation of this paper from fcn.berkeleyvision.org in TensorFlow. I adopted it to train on Cityscapes for my earlier work before this challenge.
This challenge required us to only label three unique classes. I settled for my FCN approach. Even though FCN is not state of the art when it comes to Semantic Segmentation, I’ve learned that FCN performs reasonably well with small number of unique labels. FCN also has modest GPU requirements compared to some other state of the art approaches that makes it easier to train and faster to infer on older Nvidia K80 GPU made available for this challenge. Just for comparison Google’s DeepLabv3+ required me to use 4x Nivida P100 GPUs before it started to train reasonably on Lyft datasheet.
Please see Appendix A for details about the Experimental Framework.
I implemented a preprocessing routine in Python to apply same training label to road markings and to road surface. Then I set ego car label as background. I used Cityscpae compatible labels while saving ground truth files because it can work with Google DeepLabv3 with pre-trained weighs for Cityscapes.
I modified my FCN Cityscapes implementation to restrict detection to :
- Road,
- Vehicle
- Background
I took the initial 1000 training images made available to us and trained for 40 epoch. The result was an Average F score 0.81 and FPS of 5. This established that the training datasheet is in correct format and there are no bugs in the implementation.
I tried classical data augmentation techniques, by randomly mirroring and jittering the images by translating them up to 32 pixels. I learned that there is not much benefit and this is also unnecessary since more training data can easily be generated from Carla simulator that is available freely. I noticed that more training data does improves score. Please see Appendix A.
Eventually I end up with 10k+ images trained on 200+ epochs, with following scores:
Car F score: 0.778 | Car Precision: 0.914 | Car Recall: 0.751 |
---|---|---|
Road F score: 0.987 | Road Precision: 0.998 | Road Recall: 0.944 |
Averaged F score: 0.883 |
This hinted that, to move further up the leaderboad I needed to change something else. Just ading more data and epoch are not enough.
Cars occupy much small area in images as compered to road. As I increase the training, the network develops a bias towards road.
An FCN for Semantic Segmentation essentially downsamples an image via pooling and striding convolution and then upsamples it. This introduces problems when trying to classify labels that occupies small area in the image. My Neural Network accepts input size 576x160. When I resize from original image dimension of 800x600, this also introduces quantization noise where small details in the image are lost. The results above show that I needed to be smart how I resize the image.
I noticed that there is no meaningful detail in the bottom of the images due to hood of the ego vehicle. Similarly sky has no information about cars or road. So I decided to crop my training data and inference images as follows:
I trained again and got better results:
Car F score: 0.841 | Car Precision: 0.928 | Car Recall: 0.821 |
---|---|---|
Road F score: 0.993 | Road Precision: 0.999 | Road Recall: 0.971 |
Averaged F score: 0.917 |
Cars F score is doing much better than the road. I recognize that cars’ relatively poor recall is pulling the overall score down.
Typically at this point I would try to address the class imbalance in the training data. Fully convolutional training can balance classes by weighting or sampling the loss. Considering the final goal of this challenge. I thought it might be faster to use classical computer vision technique as discussed below:
The score calculation places a higher penalty if my method fails to classify pixels that belong to car. The fact that cars are much less pixels in training data and the fact that FCN introduces quantization, it struggles to label all of car pixels in fine detail. I observe that Car has a high precision score, meaning when we see a car, we do label it but only some part of it. I made another observation that the detected pixels tend to be in the center of the car.
So I decided to expend the car detection pixels by classical computer vision technique called Dilation. It expands the detection area uniformly around the detection center. I perform dilation on the final predicted binary image for car only, just before encoding it to json. The below score are with dilation of 1 iteration.
Car F score: 0.852 | Car Precision: 0.907 | Car Recall: 0.839 |
---|---|---|
Road F score: 0.993 | Road Precision: 0.999 | Road Recall: 0.971 |
Averaged F score: 0.922 |
This improve Car’s Recall at the cost of reducing car’s Precision. This is ok since the scorning formula requires higher recall for car and higher precision for road. I can improve score further by making multiple iteration of Dilation operation.
Other ideas Final class score from my FCN are produced by a softmax function, which are probabilities of each pixel belonging to each class, that we care about. I feed it to an argmax function to choose the highest probability class. Next item on my list was to tweak the final class decision to have a lower threshold towards cars. Another option was to change my cross-entropy loss function to be same as Lyft's ranking formula. Since I'd already made it to the top at this point so I didn't try these instead I decided to invest my time in improving the run-time performance of the model.
Running on training graph give less than 5fps. I used the following techniques to improve performance
First I froze the graph, which is the process of converting TensorFlow variables into constants. During inference, variables become unnecessary since the values they store don’t change. computer can work with faster with constants. Additional benefits of freezing the graph are:
- Unnecessary nodes related to training are removed
- The model can be contained entirely in one protobuf file
- Simpler graph structure
- Easier to deploy due to everything being in one file.
Then I used Tensorflow tool to optimize graph for inference which does the following:
- Removes training-specific and debug-specific nodes
- Fuses common operations
- Removes entire sections of the graph that are never reached
Using optimized graph nearly doubled the speed to ~9fps.
I also tried quantizing the graph but noticed slight reduction in performance on my laptop. So I didn’t tried it on challenge workspace.
I profiled GPU activity and noticed that 50% of time GPU was idle!
Sending one frame to GPU for inference is really inefficient considering the high latency of uploading data to the GPU and fetching results. I noticed it even worse problem on GCP with virtualized GPU in sharing setup.
I re-coded my inference logic to accept a batch of images of configurable size. Sending 10~50 frames at once improve performance to more than 11fp.
When I run nvidia-smi, I notice patches of time when GPU is idle, so there is room for improvement.
I started with a naive pipeline that did not take advantage of multi-core CPU and GPU parallelism. This is evident by GPU utilization graph shown above.
Input to the Neural network was via scikit-video that uses FFmepg under the hood. I needed to crop and resize the images before feeding to the FCN. Doing this in python when video has already bean read to memory is inefficient since it block the GPU waiting for CPU to perform these prepossessing operations. To fix this I start the FFmpeg as Python sub-process early on. I use custom FFmepg command line to crop and resize the video at the same time while reading. FFmpeg continues to decode video in parallel as I go on to initialize TensorFlow and load graph. When I am finally ready to run inference, I read the video from FFmpeg using pipe. This bumps up fps to ~13.
nvidia-smi, still reports some patches of time when GPU is idle, so there still is room for improvement.
After segmentation map has been retuend by the GPU, there are following operations that still need to be performed.
- Uncrop - by append background label to hood and the sky
- Resize
- Binary image creation for car and road
- Binary Dilation for car
- JSON encode and print
This holds up the GPU while CPU finishes up these operation. I decided to use Python threading here instead of multiprocessing because lot of the time main thread is waiting for the GPU to finish. This reduces the overhead of GIL. Threading also allows me to easily share numpy arrays.
I implemented it in a way that as soon as Tensorflow session returns a batch of results, I post it to a queue. Anther thread waits on this queue and performs the above mentioned CPU bound operations in parallel while GPU goes on and processes the next batch.
This give me a whopping 16.6 fps.
I used Adam optimizer with minibatch size of 13 and learning rate of 1e-4. keep_prob was increased gradually from .5 to 1.0 during training.
Source | Link | count |
---|---|---|
Official | https://s3-us-west-1.amazonaws.com/udacity-selfdrivingcar/Lyft_Challenge/Training+Data/lyft_training_data.tar.gz | 1000 |
chinkiat | https://github.com/ongchinkiat/LyftPerceptionChallenge/releases/ | 3000 |
Mohamed Eltohamy | https://drive.google.com/open?id=1NimO26IH_Y8DziDMsgBCZeHlT3duj4-e | 1000 |
phmagic | https://www.dropbox.com/s/1etgf32uye2iy8q/world_2_w_cars.tar.gz?dl=0 | 2535 (after cleaning) |
My own | 3000 |
I set aside 200 images as validation set.
(1e-3) L2 regularization was added to transpose convolution layer.
- Ong Chin-Kiat (chinkiat), Phu Nguyen (phmagic), and Mohamed Eltohamy shared extra training data from CARLA.
- Phu Nguyen (phmagic) shared a nice tip that OpenCV was faster than PIL for encoding to PNG format.
To switch datasets, look at the start of run() in main.py and replace helper.KittiDataset with for exmaple helper.LyftData
Kitti
Download the Kitti Road dataset from here. Extract the dataset in the data
folder. This will create the folder data_road
with all the training a test images.
Cityscape
Register at The Cityscapes Dataset. Download gtFine_trainvaltest.zip and leftImg8bit_trainvaltest.zip. Extrac both to data
folder such that you have data/gtFine
and datas/leftImg8bit
.
kitti2015
Download data_semantics.zip and provide path to the folder containing image_2.
Robust Vision Challenge
Download and setup devkit and provide path to rob_devkit/segmentation/datasets_kitti2015 it brings in additional support for following datasets.
Run Run the following command to run the project:
python main.py
Kitti Results Mean IOU: 0.960
Cityscape Results
Cross-entropy Loss
Tips
- The link for the frozen
VGG16
model is hardcoded intohelper.py
. The model can be found here - The model is not vanilla
VGG16
, but a fully convolutional version, which already contains the 1x1 convolutions to replace the fully connected layers.