Code Monkey home page Code Monkey logo

iciar2018's Introduction

ICIAR 2018 Grand Challenge

Our solution for ICIAR 2018 Grand Challenge on Breast Cancer Histology Images.

In this work, we propose a simple and effective method for the classification of H&E stained histological breast cancer images in the situation of very small training data (few hundred samples). To increase the robustness of the classifier we use strong data augmentation and deep convolutional features extracted at different scales with publicly available CNNs pretrained on ImageNet. On top of it, we apply highly accurate and prone to overfitting implementation of the gradient boosting algorithm. Unlike some previous works, we purposely avoid training neural networks on this amount of data to prevent suboptimal generalization.

Alexander Rakhlin, Alexey Shvets, Vladimir Iglovikov, Alexandr A. Kalinin

Rakhlin, A., Shvets, A., Iglovikov, V., Kalinin, A.: Deep Convolutional Neural Networks for Breast Cancer Histology Image Analysis. arXiv:1802.00752 [cs.CV], link

If you find this work useful for your publications, please consider citing:

@article{rakhlin2018deep,
  title={Deep Convolutional Neural Networks for Breast Cancer Histology Image Analysis},
  author={Rakhlin, Alexander and Shvets, Alexey and Iglovikov, Vladimir and Kalinin, Alexandr A},
  journal={arXiv preprint arXiv:1802.00752},
  year={2018}
}

Breast cancer is one of the main causes of cancer death worldwide. Early diagnostics significantly increases the chances of correct treatment and survival, but this process is tedious and often leads to a disagreement between pathologists. Computer-aided diagnosis systems showed potential for improving the diagnostic accuracy. In this challenge, we developed the computational approach based on deep convolution neural networks for breast cancer histology image classification. Hematoxylin and eosin stained breast histology microscopy image dataset is provided as a part of the ICIAR Grand Challenge on Breast Cancer Histology Images. Our approach utilizes several deep neural network architectures and gradient boosted trees classifier. For 4-class classification task, we report 87.2% accuracy. For 2-class classification task to detect carcinomas we report 93.8% accuracy, AUC 97.3%, and sensitivity/specificity 96.5/88.0% at the high-sensitivity operating point. To our knowledge, this approach outperforms other common methods in automated histopathological image classification.

The image dataset consists of 400 H&E stain images (2048 × 1536 pixels). All the images are digitized with the same acquisition conditions, with a magnification of 200 × and pixel size of 0.42 µm × 0.42 µm. Each image is labeled with one of the four balanced classes: normal, benign, in situ, and invasive, where class is defined as a predominant cancer type in the image. The image-wise annotation was performed by two medical experts. The goal of the challenge is to provide an automatic classification of each input image.

pics/classes_horizontal.png

Examples of microscopic biopsy images in the dataset: (A) normal; (B) benign; (C) in situ carcinoma; and (D) invasive carcinoma

Very deep CNN architectures that contain millions of parameters such as VGG, Inception and ResNet have achieved the state-of-the-art results in many computer vision tasks. However, the limited size of the dataset (400 images of 4 classes) poses a significant challenge for the training of a deep learning model. We purposely avoid training neural networks on this small amount of data to prevent suboptimal generalization. Instead, we employ 2-stage process using deep convolutional feature representation.

In the first stage deep CNNs, trained on large and general datasets like ImageNet (10M images, 20K classes), are used for unsupervised feature extraction. This unsupervised dimensionality reduction step mitigates the risk of overfitting in the next stage of supervised learning.

In the second stage we use LightGBM as a fast, distributed, high performance implementation of gradient boosted trees for supervised classification. Gradient boosting models are being extensively used in machine learning due to their speed, accuracy, and robustness against overfitting.

pics/nn_diagram.png

To bring the microscopy images into a common space, we normalize the amount of H&E stained on the tissue. For each image, we perform 50 random color augmentations. From every image we extract 20 random crops and encode them into 20 descriptors. Then, the set of 20 descriptors is combined into a single descriptor. For features extraction, we use standard pre-trained ResNet-50, InceptionV3 and VGG-16 networks from Keras distribution.

pics/pipeline.png

For cross-validation we split the data into 10 stratified folds to preserve class distribution. Augmentations increase the size of the dataset × 300 (2 patch sizes × 3 encoders × 50 color/affine augmentations). To prevent information leakage, all descriptors of an image must be contained in the same fold. For each combination of the encoder, crop size and scale we train 10 gradient boosting models with 10-fold cross-validation. Furthermore, we recycle each dataset 5 times with different random seeds in LightGBM adding augmentation on the model level. For the test data, we similarly extract 50 descriptors for each image and use them with all models trained for particular patch size and encoder. The predictions are averaged over all augmentations and models.

To validate the approach we use 10-fold stratified cross-validation. For 2-class non-carcinomas (normal and benign) vs. carcinomas (in situ and invasive) classification accuracy was 93.8 ± 2.3%, the area under the ROC curve was 0.973. Out of 200 carcinomas cases only 9 in situ and 5 invasive were missed. For 4-class classification accuracy averaged across all folds was 87.2 ± 2.6%.


pics/roc_conf.png

Left: non-carcinoma vs. carcinoma classification, ROC. 96.5% sensitivity at high sensitivity setpoint (green)
Right: Confusion matrix, without normalization. Vertical axis - ground truth, horizontal - predictions.


model f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 mean std
ResNet-400 92.0 77.5 86.5 87.5 79.5 84.0 85.0 83.0 84.0 82.5 84.2 4.2
ResNet-650 91.0 77.5 86.0 89.5 81.0 74.0 85.5 83.0 84.5 82.5 83.5 5.2
VGG-400 87.5 83.0 81.5 84.0 84.0 82.5 80.5 82.0 87.5 83.0 83.6 2.9
VGG-650 89.5 85.5 78.5 85.0 81.0 78.0 81.5 85.5 89.0 80.5 83.4 4.4
Inception-400 93.0 86.0 71.5 92.0 85.0 84.5 82.5 79.0 79.5 76.5 83.0 6.5
Inception-650 91.0 84.5 73.5 90.0 84.0 81.0 82.0 84.5 78.0 77.0 82.5 5.5
std (models) 1.8 3.5 5.7 2.8 2.0 3.7 1.8 2.1 3.9 2.7 3.0  
Model fusion 92.5 82.5 87.5 87.5 87.5 90.0 85.0 87.5 87.5 85.0 87.2 2.6

Accuracy (%) and standard deviation for 4-class classification evaluated over 10 folds via cross-validation.
Results for the blended model is in the bottom. Model name represented as (CNN)-(crop size).

  • Python 3
  • Keras and Theano libraries. We did not test with Tensorflow backend, however it should work too.
  • LightGBM package.
  • Standard scientific Python stack: NumPy, Pandas, SciPy, scikit-learn.
  • Other libraries: tqdm, six

For feature extraction we used Nvidia GeForce GTX 980 Graphics Card with 4GB memory. For less powerful GPU please consider decreasing BATCH_SIZE in feature_extractor.py

For command line options use -h, --help. If you use default directory structure, you can stick with default command line options. Default directory structure is:

└── ICIAR2018
    ├── submission
    ├── data
    │   ├── train
    │   │   ├── Benign
    │   │   └── ......
    │   ├── test
    │   └── preprocessed
    │       ├── train
    │       │   ├── Inception0.5-400
    │       │   └── ................
    │       └── test
    │           ├── Inception-0.5-400
    │           └── .................
    ├── models
    │   ├── LGBMs
    │   │   ├── Inception
    │   │   └── .........
    │   └── CNNs
    └── predictions
        ├── Inception
        └── .........

You can preprocess the data independently, or use downloaded features. In the former case place the competition microscopy images into data\train|test directories. Please note the competition rules disallow us to redistribute the data.

  1. Download feature files, trained models, and individual folded predictions and skip to 4:

    python download_models.py
    

Downloaded LightGBM models are being unpacked in ./models/LGBMs, CNN models - in ./models/CNNs directories. We provide CNN models just for reference: Keras loads them with its own distribution. Preprocessed features reside in ./data/preprocessed/train|test subdirectories. Crossvalidated predictions reside in ./predictions subdirs. Alternatively, you can skip this step and extract features and train models yourself.

  1. To extract features run this. You can skip this step if you are using preprocessed features:

    python feature_extractor.py --images <directory/containing/images/> --features <directory/to/store/features/>
    

By default preprocessed feature files are contained in directory data/preprocessed/[test|train]/model_name/.

  1. To train LightGBM models using cross-validation and to generate predictions for all models, crop sizes, seeds, augmentations and folds run this. You can skip this step if you are using LightGBM models we provided:

    python train_lgbm.py
    
  2. To combine predictions across all models, seeds and augmentations, and crossvalidate across all folds run:

    python crossvalidate_blending.py
    

In this step you can use predictions pre-saved in step 3 during training (or provided with our data). Or you can have LightGBM models generate predictions anew with command line option --predict. The latter increases running time, but does not affect result.

  1. To generate solution:

    python submission.py --features <directory/to/store/features/> --submission <path/to/submission.csv>
    

iciar2018's People

Contributors

alexander-rakhlin avatar jmsmkn avatar shvetsiya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

iciar2018's Issues

Input size of images

Hi, I want to transfer learn your trained models to another network, But from the paper I cannot grasp the size of the input data you have used to train your networks. If the data you used is of arbitrary size then what should be size of my dataset if I want to use your trained model weights? Or the weights cannot be used for transfer learning in other networks?

No such file or directory: 'data/folds-10.pkl'

python crossvalidate_blending.py
Traceback (most recent call last):
File "crossvalidate_blending.py", line 24, in
with open("data/folds-10.pkl", "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/folds-10.pkl'

Model weight files required

Disclaimer: Technically this doesn't qualify for an issue.

I am looking for histology images pre-trained networks, I am trying to figure out how much a breast cancer pretrained model can help in predicting prostate cancer. I'd really appreciate if you can share the model weights, given you still have them.

About train_lgbm.py

When I run the train_lgbm.py, I got a warning——[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Could you tell me how to solve it??

Breast cancer

I have breast cancer images, benign and malignant, is it possible to use this code?

Is there a bug in H&E augmentation?

I've notice a possible bug in line 127 of feature_extractor.py. img_aug = np.dot(C, M) * r
I guess it should be img_aug = np.dot(C * r, M)
where C is in the H&E color space and multiplication of HE channel should be applied on that.

Model folder in source code

Dear all,

I can not see models folder in your source code structure. Could you help to share it?

├── models
│ ├── LGBMs
│ │ ├── Inception
│ │ └── .........
│ └── CNNs

Best regards,
Canh Nguyen

about the dataset

hello,rakhlin, your have done a very great job, this is very helpful project. and i am wondering if you could share the dataset of iciar2018 , as download from officail web is not possible anymore.
thank you so much !!

question

Why is each picture represented by 20 crops?

Share the data

I have missed the registration deadline, so now I can't download the data. Could you please share the data? Thanks a lot !

Using 2 classes instead of 4

Hi! This is a really nice project. I'm trying to execute the pipeline with the trained models, but I want to use 2 classes. I have added that argument on step 4 (python crossvalidate_blending.py --n_classes 2) but I'm wondering how should I add this on the last step (5) where the file submission.py is used. I don't see that "n_classes" is an option, and I don't see either if there's any reference to the previous step.
Thanks in advance!

Could you give me the datasets in convenience?

Because I need to join with the project as a participant,then I can download the dataset .But the project has been closed! So could you give me the datasets in convenience?
Thanks a lot !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.