Code Monkey home page Code Monkey logo

car_classification_localization's Introduction

Stanford Cars Dataset Image Classification and Localization

Input Pipeline

For preprocessing, making the data zero-centered would be computationally expensive to load all images and calculate the mean. Therefore we will just scale the pixels to be in range [-0.5,0.5] and use batch normalization between layers. When a transfer learning model is used for training, the corresponding prerpocess_input function is applied instead.

There are only a few (~30-50) images per class so to combat overfitting, heavy augmentation was applied through the library imgaug. See data_loader.py for implementation details.

The data has a class imbalance so one way to handle this is through oversampling which is creating copies of our minority classes to match the majority ones. Fortunately we do not have to do this explicitly since we can get it for free by modifying the way we output images from the tf.data generator. Oversampling is achieved by splitting the data by their class labels and sampling from them uniformly. This preserves the underlying distribution of the minority classes but evens out the dataset without needing to collect more data!

The following is the imbalanced distribution of classes

The following is an example of the distribution after oversampling. The generator was tested for 10000 iterations @ 32 batch size each

Streamlit

To see an interactive exploration of the data, run the following command

streamlit run streamlit/streamlit.py

Sanity Checks

This step was done to help monitor training and adjust hyperparameters to get good learning results.

  1. When using softmax, the value of the loss when the weights are small and no regularization is used can be approximated by -ln(1/C) = ln(C) where C is the number of classes.

The entire dataset has 196 classes which means the softmax loss should be approximately ln(196)=5.278. After running one epoch on a neural net with 1 hidden layer, the loss did in fact match.

217/217 [==============================] - 17s 80ms/step - loss: 5.2780 - accuracy: 0.0049 - val_loss: 5.2947 - val_accuracy: 0.0032

The same process was repeated for a subset of the dataset using 2 labels. The loss should be ln(2)=0.693.

3/3 [==============================] - 1s 233ms/step - loss: 0.6933 - accuracy: 0.5625 - val_loss: 0.5985 - val_accuracy: 0.6875
  1. Adding regularization should make the loss go up. The following test adds l2 regularization of magnitude 1e2 which made the loss jump from 0.693 to 2.9.
3/3
 [==============================] - 1s 322ms/step - loss: 2.9040 - accuracy: 0.4375 - val_loss: 2.9195 - val_accuracy: 0.6875

Model Architectures

Custom Model

Traditional:

  1. Conv(64 filters, 5x5 kernel, 2 strides)|BatchNorm|Relu|MaxPool(2 pool size)
  2. [Conv(128 filters, 3x3 kernel, 1 strides)|BatchNorm|Relu]*2|MaxPool(2 pool size)
  3. [Conv(256 filters, 3x3 kernel, 1 strides)|BatchNorm|Relu]*2|MaxPool(2 pool size)
  4. [Drop|Dense(512 units)|BatchNorm|Relu]*2
  5. a.) Drop|Dense (196 units)
  6. b.) Drop|Dense (4 units)

Residual:

  1. Conv(64 filters, 3x3 kernel, 2 strides)|BatchNorm|Relu
  2. [Conv(64 filters, 3x3 kernel, 1 strides)|BatchNorm|Relu]*2|MaxPool(2 pool size)
  3. Residual(64 filters)*3|Residual(128 filters)*4|Residual(256 filters)*4|Residual(512 filters)*3
  4. GlobalAvgPool2D|[Drop|Dense(512 units)|BatchNorm|Relu]*2
  5. a.) Drop|Dense (196 units)
  6. b.) Drop|Dense (4 units)

Transfer Learning

  • ResNet50
  • MobileNetV2
  • EfficientNet-B3

Training Process

  1. Train on a small subset of data (eg. 20 samples) which should be easy to overfit and get a high training accuracy. The subset size used for this step was 73 images over 2 classes and ran for 200 epochs that resulted in 100% classifier accuracy.
Epoch 197/200
3/3 [==============================] - 3s 1s/step - loss: 0.0555 - classifier_loss: 0.0281 - localizer_loss: 0.1651 - classifier_accuracy: 1.0000 - localizer_accuracy: 0.4271 - val_loss: 15.2211 - val_classifier_loss: 11.0812 - val_localizer_loss: 31.7805 - val_classifier_accuracy: 0.3125 - val_localizer_accuracy: 0.8125
Epoch 198/200
3/3 [==============================] - 3s 1s/step - loss: 0.0425 - classifier_loss: 0.0163 - localizer_loss: 0.1473 - classifier_accuracy: 1.0000 - localizer_accuracy: 0.4479 - val_loss: 15.2499 - val_classifier_loss: 11.0812 - val_localizer_loss: 31.9246 - val_classifier_accuracy: 0.3125 - val_localizer_accuracy: 0.9062
Epoch 199/200
3/3 [==============================] - 3s 1s/step - loss: 0.0487 - classifier_loss: 0.0264 - localizer_loss: 0.1382 - classifier_accuracy: 1.0000 - localizer_accuracy: 0.3542 - val_loss: 15.2735 - val_classifier_loss: 11.0812 - val_localizer_loss: 32.0426 - val_classifier_accuracy: 0.3125 - val_localizer_accuracy: 0.8125
Epoch 200/200
3/3 [==============================] - 3s 1s/step - loss: 0.0572 - classifier_loss: 0.0329 - localizer_loss: 0.1546 - classifier_accuracy: 1.0000 - localizer_accuracy: 0.3958 - val_loss: 15.3056 - val_classifier_loss: 11.0812 - val_localizer_loss: 32.2035 - val_classifier_accuracy: 0.3125 - val_localizer_accuracy: 0.8125
  1. Train using full dataset, start with small regularization and find the learning rate that makes the loss go down. The model is able to overfit with train accuracy of 1 implying that it has enough capacity to learn the image features.

  1. Now that we know the model can overfit, we can increase regularization and tune hyperparameters.

Wandb was used for logging all experiments on the full dataset:

Initially custom models were used for training but these proved to be difficult for finding a good solution. Each experiment was time consuming since the validation loss would converge slowly and the best validation accuracy it achieved was only at 50%. Swapping the model to a pretrained one based on imagenet dramatically improved both the results and the time it took to reach a decent accuracy.

As shown on the image above, the blue line represents the transfer learning model and at epoch 5 it has already reached 65% label accuracy.

Results

The best results so far was achieved by EfficientNet-B3 with the following hyperparameters and using Focal Loss which decreases the loss contribution of easy examples so the model can focus on harder images.

Training vs validation metrics were close during model training so underfitting or overfitting did not occur

Labels Accuracy

Bounding Box Accuracy

Sample Activation Heatmap

Test Set Results

  • Test Labels Accuracy: 0.901342
  • Test Bounding Box Accuracy: 0.7349652
  • Test Loss: 0.3240524888868123

Test Set Classification Evaluations

Sample Test Predictions (True Bounding Box: Blue, Predicted Bounding Box: Red)

Challenges

  • Having a large number of classes but only a few (~30) images per class made it very difficult to train a custom model. To combat this, augmentation served as a means to artifically inflate the amount of images. The most effective method however, was to use transfer learning with a model already pre-trained on cars. Other potential methods I could try would be to gather more data either manually (e.g. scraping/api's) or generating it synthetically (e.g. GANs). Another option that could work effectively without needing to gather more data is to try few-show learning.

  • Another minor challenge of trying a dual headed model was finding a good loss weighting balance and the right metrics between classification and localization. In the current setup, I decided to use mean squared log error for the bounding box loss function in order to minimize its effect on the overall loss. The downside to this is that MSLE is biased to penalizing underestimates more than overestimates. One option to counter this is to scale the range of bounding box input to [0,1] and use MSE loss function with IoU(Intersection over Union) as a metric.

Dataset Citation

Krause, Jonathan, et al. "3d object representations for fine-grained categorization." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2013.

car_classification_localization's People

Contributors

peterbacalso avatar

Stargazers

Nathan Waters avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.