Code Monkey home page Code Monkey logo

datasetculling's Introduction

Results

MIT License

What is this?

This is the implementation of the paper:

Dataset Culling: Towards efficient training of distillation based domain specific models.

Kentaro Yoshioka, Edward Lee, Simon Wong, Mark Horowitz (Stanford)

IEEE ICIP 2019

https://arxiv.org/abs/1902.00173

Repo Progress

Upload initial commits (1/30/2019)

Update readme (2/10/2019)

Update initial models and dataset (2/10/2019)

Support Pytorch1.0 (2/17/2019)

  • Enable optResolution in pipeline (working..)

  • Clean up install files.

  • Fixed the download link for the models.

What is Dataset Culling?

By training a domain specific model (say, a specific model for each traffic camera) we can get high accuracy even with a small model. However, training of such models can be quite expensive.

So, we aim to speed up the training of domain-specific models 50x by Dataset Culling!

The idea behind this is simple: for training with domain-specific data, lots of easy data do not contribute to training. We simply cull those easy-to-classify data out in the proposed pipeline, gaining significant training speedups without accuracy loss.

Interestingly, for some data, we even find some accuracy improvements by Dataset Culling.

Results of Dataset Culling

We can get compared to large teacher models:

  1. upto 18x computation efficiency

  2. Similar or better detection accuracy

what does this repo do?

This repo lets you try the pipeline with some domain specific data (traffic camera from YoutubeLive) and some pretrained models (COCO trained res18, res101 based FR-CNNs.)

Pipeline Dataset Culling conducts three operations in its pipeline.

  1. Reduction of the dataset size (~50x) by confidence loss.
  2. Reduction of the dataset size (~6x) by precision loss.
  3. Optimization of the image resolution (optResolution).

FYI: It might be interesting to look at my previous project, training domain specific models if you have interests in domain specific models itself.

Github: https://github.com/kentaroy47/training-domain-specific-models

arXiv: https://arxiv.org/abs/1811.02689

Setting up Dataset Culling enviroment

Requirements

Python3

Pytorch 0.4.0

If you want to use Pytorch 1.0, plz go to the Pytorch 1.0 branch.

GPU enviroment required. can add CPU options but not scheduled..

Don't hesitate to post issues or PRs if you find bugs. Thx.

Instalation

  1. Clone this repo and pip dependencies.
git clone https://github.com/kentaroy47/DatasetCulling.git
cd DatasetCulling
pip install -r requirements.txt
  1. Download pretrained student and teacher models + image files to get started.
wget https://www.dropbox.com/s/ew47jhdu67bdocf/files.tar.gz

# extract in the repo dir.
tar -zxvf files.tar.gz

  1. Compile Cython scripts.
cd lib
sh make.sh

As pointed out by ruotianluo/pytorch-faster-rcnn, choose the right -arch in make.sh file, to compile the cuda code:

GPU model Architecture
TitanX (Maxwell/Pascal) sm_52
GTX 960M sm_50
GTX 1080 (Ti) sm_61
Grid K520 (AWS g2.2xlarge) sm_30
Tesla K80 (AWS p2.xlarge) sm_37

The faster R-CNN implementation is largely based on jwyang's repo and require complie of cython scripts.

Cython parts must be compiled using lib/make.sh. Please look at jwyang's readme for the details.

https://github.com/jwyang/faster-rcnn.pytorch

  1. Apply Dataset Culling and train student models! Everything is in the script.

The dataset will be constructed inside directory.

# Construct dataset with Dataset Culling. This takes about 15 minutes with 1080Ti.
# The training is done with horizontal flipped data-argumentation.
python dataset-culling.py

# change the number of training sample like this. default is 256.
python dataset-culling.py --topx 64

# Train wihout Dataset Culling. This will take about >3 hours with 1080Ti.
python dataset-culling.py --topx 3600
  1. Eval results. The test is also done in dataset-culling.

You can just do test by..

python dataset-culling.py --notrain

Try cleaning up data/cache or output/ folder if training does not start.

Contarct

Ken Yoshioka ([email protected])

datasetculling's People

Contributors

kentaroy47 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasetculling's Issues

script fails when culling with precision

wrote out training data
Traceback (most recent call last):
  File "xml_makelabels_domain.py", line 253, in <module>
    vals2 = pickle.load(open(valdir, "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'output/baseline/200-jackson2val-res101.pkl'

during demo-save-and-eval.py, get errors

The script fails in the last part of dataset-culling.py, during demo-save-and-eval.py.

load checkpoint models/res18/pascal_voc_200-jackson2/faster_rcnn_1_20.pth
load model successfully!
load checkpoint models/res18/pascal_voc_200-jackson2/faster_rcnn_1_20.pth
demo-and-eval-save.py:384: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  im_data = Variable(im_data, volatile=True)
demo-and-eval-save.py:385: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  im_info = Variable(im_info, volatile=True)
demo-and-eval-save.py:386: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  num_boxes = Variable(num_boxes, volatile=True)
demo-and-eval-save.py:387: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  gt_boxes = Variable(gt_boxes, volatile=True)
Loaded Photo: 10806 images.
detection for  __background__02s       
Traceback (most recent call last):
  File "demo-and-eval-save.py", line 694, in <module>
    BBGTs = truth_boxes[ncls][:]
IndexError: list index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.