Code Monkey home page Code Monkey logo

liudan111 / histomorphological-phenotype-learning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from adalbertocq/histomorphological-phenotype-learning

0.0 0.0 0.0 74.29 MB

Corresponding code of 'Quiros A.C.+, Coudray N.+, Yeaton A., Yang X., Chiriboga L., Karimkhan A., Narula N., Pass H., Moreira A.L., Le Quesne J.*, Tsirigos A.*, and Yuan K.* Self-supervised learning discovers novel morphological clusters linked to patient outcome and molecular phenotypes. 2022'

Python 59.86% Jupyter Notebook 40.14%

histomorphological-phenotype-learning's Introduction

Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes


Abstract:

Histopathological images provide the definitive source of cancer diagnosis, containing information used by pathologists to identify and subclassify malignant disease, and to guide therapeutic choices. These images contain vast amounts of information, much of which is currently unavailable to human interpretation. Supervised deep learning approaches have been powerful for classification tasks, but they are inherently limited by the cost and quality of annotations. Therefore, we developed Histomorphological Phenotype Learning, an unsupervised methodology, which requires no annotations and operates via the self-discovery of discriminatory image features in small image tiles. Tiles are grouped into morphologically similar clusters which appear to represent recurrent modes of tumor growth emerging under natural selection. These clusters have distinct features which can be identified using orthogonal methods. Applied to lung cancer tissues, we show that they align closely with patient outcomes, with histopathologically recognised tumor types and growth patterns, and with transcriptomic measures of immunophenotype.


Citation

@misc{QuirosCoudray2022,
      title={Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes},
      author={Adalberto Claudio Quiros and Nicolas Coudray and Anna Yeaton and Xinyu Yang and Luis Chiriboga and Afreen Karimkhan and Navneet Narula and Harvey Pass and Andre L. Moreira and John Le Quesne and Aristotelis Tsirigos and Ke Yuan},
      year={2022},
      eprint={2205.01931},
      archivePrefix={arXiv},
      primaryClass={cs.CV}        
}

Demo Materials

Slides summarizing methodology and results:


Repository overview

In this repository you will find the following sections:

  1. WSI tiling process: Instructions on how to create H5 files from WSI tiles.
  2. Workspace setup: Details on H5 file content and directory structure.
  3. HPL instructions: Step-by-step instructions on how to run the complete methodology.
    1. Self-supervised Barlow Twins training.
    2. Tile vector representations.
    3. Combination of all sets into one H5.
    4. Fold cross validation files.
    5. Include metadata in H5 file.
    6. Leiden clustering.
    7. Removing background tiles.
    8. Logistic regression for lung type WSI classification.
    9. Cox proportional hazards for survival regression.
    10. Correlation between annotations and clusters.
    11. Get tiles and WSI samples for HPCs.
  4. Frequently Asked Questions.
  5. TCGA HPL files: HPL output files of paper results.
  6. Python Environment: Python version and packages.
  7. Dockers: Docker environments to run HPL steps.

WSI tiling process

This step divides whole slide images (WSIs) into 224x224 tiles and store them into H5 files. At the end of this step, you should have three H5 files. One per training, validation, and test sets. The training set will be used to train the self-supervised CNN, in our work this corresponded to 60% of TCGA LUAD & LUSC WSIs.

We used the framework provided in Coudray et al. 'Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning' Nature Medicine, 2018. The steps to run the framework are 0.1, 0.2.a, and 4 (end of readme). In our work we used Reinhardt normalization, which can be applied at the same time as the tiling is done through the '-N' option in step 0.1.

Workspace setup

This section specifies requirements on H5 file content and directory structure to run the flow.

In the instructions below we use the following variables and names:

  • dataset_name: TCGAFFPE_LUADLUSC_5x_60pc
  • marker_name: he
  • tile_size: 224

H5 file content specification.

If you are not familiar with H5 files, you can find documentation on the python package here.

This framework makes the assumption that datasets inside each H5 set will follow the format 'set_labelname'. In addition, all H5 files are required to have the same number of datasets. Example:

  • File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
    • Dataset names: train_img, train_tiles, train_slides, train_samples
  • File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_validation.h5
    • Dataset names: valid_img, valid_tiles, valid_slides, valid_samples
  • File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_test.h5
    • Dataset names: test_img, test_tiles, test_slides, test_samples

Directory Structure

The code will make the following assumptions with respect to where the datasets, model training outputs, and image representations are stored:

  • Datasets:
    • Dataset folder.
    • Follows the following structure:
      • datasets/dataset_name/marker_name/patches_htile_size_wtile_size
      • E.g.: datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224
    • Train, validation, and test sets:
      • Each dataset will assume that at least there is a training set.
      • Naming convention:
        • hdf5_dataset_name_marker_name_set_name.h5
        • E.g.: datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224/hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
  • Data_model_output:
    • Output folder for self-supervised trained models.
    • Follows the following structure:
      • data_model_output/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
      • E.g.: data_model_output/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128
  • Results:
    • Output folder for self-supervised representations results.
    • This folder will contain the representation, clustering data, and logistic/cox regression results.
    • Follows the following structure:
      • results/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
      • E.g.: results/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128

HPL Instructions

The flow consists in the following steps:

  1. Self-supervised Barlow Twins training.
  2. Tile vector representations.
  3. Combination of all sets into one H5.
  4. Fold cross validation files.
  5. Include metadata in H5 file.
  6. Leiden clustering.
  7. Removing background tiles.
  8. Logistic regression for lung type WSI classification.
  9. Cox proportional hazards for survival regression.
  10. Correlation between annotations and clusters.
  11. Get tiles and WSI samples for HPCs.

You can find the full details here.


Frequently Asked Questions

I want to reproduce the paper results.

You can find TCGA files, results, and commands to reproduce them here. For any questions regarding the New York University cohorts, please address reasonable requests to the corresponding authors.

I have my own cohort and I want to assign existing clusters to my own WSI.

You can follow steps on how to assign existing clusters in here. These instructions will give you assignation to the same clusters reported in the publication.

When I run the Leiden clustering step. I get an 'TypeError: can't pickle weakref objects' error in some folds.

Based on experience, this error occurs with non-compatible version on numba, umap-learn, and scanpy. The package versions in the python environment should work. But these alternative package combination works:

scanpy==1.7.1 
pynndescent==0.5.0 
numba==0.51.2

If you are having any issue running these scripts, please leave a message on the Issues Github tab.


TCGA HPL files

This section contains the following TCGA files produced by HPL:

  1. TCGA LUAD & LUSC WSI tile image datasets.
  2. TCGA Self-supervised trained weights.
  3. TCGA tile projections.
  4. TCGA cluster configurations.
  5. TCGA WSI & patient representations.

For the New York University cohorts, please send reasonable requests to the corresponding authors.

TCGA LUAD & LUSC WSI tile image datasets

You can find the WSI tile images at:

  1. LUAD & LUSC 60% Background max
  2. LUAD & LUSC 60% Background max 250K subsample for self-supervised model training.

TCGA Pretrained Models

Self-supervised model weights:

  1. Lung adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) model.
  2. PanCancer: BRCA, HNSC, KICH, KIRC, KIRP, LUSC, LUAD.

TCGA tile vector representations

You can find tile projections for TCGA LUAD and LUSC cohorts at the following locations. These are the projections used in the publication results.

  1. TCGA LUAD & LUSC tile vector representations (background and artifact tiles unfiltered)
  2. TCGA LUAD & LUSC tile vector representations

TCGA clusters

You can find cluster configurations used in the publication results at:

  1. Background and artifact removal
  2. LUAD vs LUSC type classification
  3. LUAD survival

TCGA WSI & patient vector representations

You can find WSI and patient vector representations used in the publication results at:

  1. LUAD vs LUSC type classification
  2. LUAD survival

Python Environment

The code uses Python 3.8 and the necessary packages can be found at requirements.txt

The flow uses TensorFlow 1.15 and according to TensorFlows Specs the closest CUDA and cuDNN version are cudatoolkits==10.0 and cudnn=7.6.0. However, depending on your GPU card you might need to use cudatoolkits==11.7 and cudnn=8.0 instead. Newer cards with Ampere architecture (Nvidia 30s or A100s) would only work with CUDA 11.X, Nvidia maintains this repo, so you can use TensorFlow 1.15 with the new version of CUDA.

These commands should get the right environment to run HPL:

conda create -n HPL python=3.8 \ 
conda activate HPL \
python3 -m pip install --user nvidia-pyindex \
python3 -m pip install --user nvidia-tensorflow \
python3 -m pip install -r requirements.txt \

Dockers

These are the dockers with the environments to run the steps of HPL. Step 'Leiden clustering' needs to be run with docker [2], all other steps can be run with docker [1]:

  1. Self-Supervised models training and projections:
  2. Leiden clustering:

If you want to run the docker image in your local machine. These commands should get you up and running. Please take into account that the image aclaudioquiros/tf_package:v16 uses CUDA 10.0, if your GPU card uses the Ampere architecture (Nvidia 30s or A100s) it won't work appropriately.
In addition, if you want to run the Step 6 - Leiden clustering in HPL, you would need to change the image name:

docker run -it --mount src=`pwd`,target=/tmp/Workspace,type=bind aclaudioquiros/tf_package:v16
cd Workspace
# Command you want to run here.

histomorphological-phenotype-learning's People

Contributors

adalbertocq avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.