Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes

Quiros A.C.⁺, Coudray N.⁺, Yeaton A., Yang X., Chiriboga L., Karimkhan A., Narula N., Pass H., Moreira A.L., Le Quesne J.^*, Tsirigos A.^*, and Yuan K.^* Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes. 2022

Abstract:

Histopathological images provide the definitive source of cancer diagnosis, containing information used by pathologists to identify and subclassify malignant disease, and to guide therapeutic choices. These images contain vast amounts of information, much of which is currently unavailable to human interpretation. Supervised deep learning approaches have been powerful for classification tasks, but they are inherently limited by the cost and quality of annotations. Therefore, we developed Histomorphological Phenotype Learning, an unsupervised methodology, which requires no annotations and operates via the self-discovery of discriminatory image features in small image tiles. Tiles are grouped into morphologically similar clusters which appear to represent recurrent modes of tumor growth emerging under natural selection. These clusters have distinct features which can be identified using orthogonal methods. Applied to lung cancer tissues, we show that they align closely with patient outcomes, with histopathologically recognised tumor types and growth patterns, and with transcriptomic measures of immunophenotype.

Citation

@misc{QuirosCoudray2022,
      title={Self-supervised learning in non-small cell lung cancer discovers novel morphological clusters linked to patient outcome and molecular phenotypes},
      author={Adalberto Claudio Quiros and Nicolas Coudray and Anna Yeaton and Xinyu Yang and Luis Chiriboga and Afreen Karimkhan and Navneet Narula and Harvey Pass and Andre L. Moreira and John Le Quesne and Aristotelis Tsirigos and Ke Yuan},
      year={2022},
      eprint={2205.01931},
      archivePrefix={arXiv},
      primaryClass={cs.CV}        
}

Demo Materials

Slides summarizing methodology and results:

Repository overview

In this repository you will find the following sections:

WSI tiling process: Instructions on how to create H5 files from WSI tiles.
Workspace setup: Details on H5 file content and directory structure.
HPL instructions: Step-by-step instructions on how to run the complete methodology.
1. Self-supervised Barlow Twins training.
2. Tile vector representations.
3. Combination of all sets into one H5.
4. Fold cross validation files.
5. Include metadata in H5 file.
6. Leiden clustering.
7. Removing background tiles.
8. Logistic regression for lung type WSI classification.
9. Cox proportional hazards for survival regression.
10. Correlation between annotations and clusters.
11. Get tiles and WSI samples for HPCs.
Frequently Asked Questions.
TCGA HPL files: HPL output files of paper results.
Python Environment: Python version and packages.
Dockers: Docker environments to run HPL steps.

WSI tiling process

This step divides whole slide images (WSIs) into 224x224 tiles and store them into H5 files. At the end of this step, you should have three H5 files. One per training, validation, and test sets. The training set will be used to train the self-supervised CNN, in our work this corresponded to 60% of TCGA LUAD & LUSC WSIs.

We used the framework provided in Coudray et al. 'Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning' Nature Medicine, 2018. The steps to run the framework are 0.1, 0.2.a, and 4 (end of readme). In our work we used Reinhardt normalization, which can be applied at the same time as the tiling is done through the '-N' option in step 0.1.

Workspace setup

This section specifies requirements on H5 file content and directory structure to run the flow.

In the instructions below we use the following variables and names:

dataset_name: TCGAFFPE_LUADLUSC_5x_60pc
marker_name: he
tile_size: 224

H5 file content specification.

If you are not familiar with H5 files, you can find documentation on the python package here.

This framework makes the assumption that datasets inside each H5 set will follow the format 'set_labelname'. In addition, all H5 files are required to have the same number of datasets. Example:

File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
- Dataset names: train_img, train_tiles, train_slides, train_samples
File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_validation.h5
- Dataset names: valid_img, valid_tiles, valid_slides, valid_samples
File: hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_test.h5
- Dataset names: test_img, test_tiles, test_slides, test_samples

Directory Structure

The code will make the following assumptions with respect to where the datasets, model training outputs, and image representations are stored:

Datasets:
- Dataset folder.
- Follows the following structure:
  - datasets/dataset_name/marker_name/patches_htile_size_wtile_size
  - E.g.: datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224
- Train, validation, and test sets:
  - Each dataset will assume that at least there is a training set.
  - Naming convention:
    - hdf5_dataset_name_marker_name_set_name.h5
    - E.g.: datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224/hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
Data_model_output:
- Output folder for self-supervised trained models.
- Follows the following structure:
  - data_model_output/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
  - E.g.: data_model_output/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128
Results:
- Output folder for self-supervised representations results.
- This folder will contain the representation, clustering data, and logistic/cox regression results.
- Follows the following structure:
  - results/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
  - E.g.: results/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128

HPL Instructions

The flow consists in the following steps:

Self-supervised Barlow Twins training.
Tile vector representations.
Combination of all sets into one H5.
Fold cross validation files.
Include metadata in H5 file.
Leiden clustering.
Removing background tiles.
Logistic regression for lung type WSI classification.
Cox proportional hazards for survival regression.
Correlation between annotations and clusters.
Get tiles and WSI samples for HPCs.

You can find the full details here.

Frequently Asked Questions

I want to reproduce the paper results.

You can find TCGA files, results, and commands to reproduce them here. For any questions regarding the New York University cohorts, please address reasonable requests to the corresponding authors.

I have my own cohort and I want to assign existing clusters to my own WSI.

You can follow steps on how to assign existing clusters in here. These instructions will give you assignation to the same clusters reported in the publication.

When I run the Leiden clustering step. I get an 'TypeError: can't pickle weakref objects' error in some folds.

Based on experience, this error occurs with non-compatible version on numba, umap-learn, and scanpy. The package versions in the python environment should work. But these alternative package combination works:

scanpy==1.7.1 
pynndescent==0.5.0 
numba==0.51.2

If you are having any issue running these scripts, please leave a message on the Issues Github tab.

TCGA HPL files

This section contains the following TCGA files produced by HPL:

TCGA LUAD & LUSC WSI tile image datasets.
TCGA Self-supervised trained weights.
TCGA tile projections.
TCGA cluster configurations.
TCGA WSI & patient representations.

For the New York University cohorts, please send reasonable requests to the corresponding authors.

TCGA LUAD & LUSC WSI tile image datasets

You can find the WSI tile images at:

LUAD & LUSC 60% Background max
LUAD & LUSC 60% Background max 250K subsample for self-supervised model training.

TCGA Pretrained Models

Self-supervised model weights:

TCGA tile vector representations

You can find tile projections for TCGA LUAD and LUSC cohorts at the following locations. These are the projections used in the publication results.

TCGA clusters

You can find cluster configurations used in the publication results at:

TCGA WSI & patient vector representations

You can find WSI and patient vector representations used in the publication results at:

Python Environment

The code uses Python 3.8 and the necessary packages can be found at requirements.txt

The flow uses TensorFlow 1.15 and according to TensorFlows Specs the closest CUDA and cuDNN version are cudatoolkits==10.0 and cudnn=7.6.0. However, depending on your GPU card you might need to use cudatoolkits==11.7 and cudnn=8.0 instead. Newer cards with Ampere architecture (Nvidia 30s or A100s) would only work with CUDA 11.X, Nvidia maintains this repo, so you can use TensorFlow 1.15 with the new version of CUDA.

These commands should get the right environment to run HPL:

conda create -n HPL python=3.8 \ 
conda activate HPL \
python3 -m pip install --user nvidia-pyindex \
python3 -m pip install --user nvidia-tensorflow \
python3 -m pip install -r requirements.txt \

Dockers

These are the dockers with the environments to run the steps of HPL. Step 'Leiden clustering' needs to be run with docker [2], all other steps can be run with docker [1]:

Self-Supervised models training and projections:
- aclaudioquiros/tf_package:v16
Leiden clustering:
- gcfntnu/scanpy:1.7.0

If you want to run the docker image in your local machine. These commands should get you up and running. Please take into account that the image aclaudioquiros/tf_package:v16 uses CUDA 10.0, if your GPU card uses the Ampere architecture (Nvidia 30s or A100s) it won't work appropriately.
In addition, if you want to run the Step 6 - Leiden clustering in HPL, you would need to change the image name:

docker run -it --mount src=`pwd`,target=/tmp/Workspace,type=bind aclaudioquiros/tf_package:v16
cd Workspace
# Command you want to run here.

liudan111 / histomorphological-phenotype-learning Goto Github PK

histomorphological-phenotype-learning's Introduction