jamesdolezal / slideflow Goto Github PK

Deep learning library for digital pathology, with both Tensorflow and PyTorch support.

License: GNU General Public License v3.0

Python 89.13% Groovy 0.15% Makefile 0.02% HTML 0.78% Shell 0.09% Dockerfile 0.11% JavaScript 2.24% CSS 7.46% Jinja 0.04%

python machine-learning deep-learning pathology histology computational-pathology whole-slide-imaging histopathology

slideflow's Introduction

ArXiv | Docs | Slideflow Studio | Cite

🔬 Overview

Slideflow is a deep learning library for digital pathology, offering a user-friendly interface for model development.

Designed at University of Chicago for both medical researchers and AI enthusiasts, the goal of Slideflow is to provide an accessible, easy-to-use interface for developing state-of-the-art pathology models. Slideflow has been built with the future in mind, offering a scalable platform for digital biomarker development that bridges the gap between ever-evolving, sophisticated methods and the needs of a clinical researcher. For developers, Slideflow provides multiple endpoints for integration with other packages and external training paradigms, allowing you to leverage highly optimized, pathology-specific processes with the latest ML methodologies.

🚀 Features

Easy-to-use, highly customizable training pipelines
Robust slide processing and stain normalization toolkit
Support for training with weakly-supervised or strongly-supervised labels
Multiple-instance learning (MIL)
Self-supervised learning (SSL)
Generative adversarial networks (GANs)
Explainability tools: Heatmaps, mosaic maps, saliency maps, synthetic histology
Robust layer activation analysis tools
Uncertainty quantification
Interactive user interface for model deployment
... and more!

Full documentation with example tutorials can be found at slideflow.dev.

Requirements

Python >= 3.7 (<3.10 if using cuCIM)
Tensorflow 2.5-2.11 or PyTorch 1.9-2.1
- GAN functions require PyTorch <1.13

Optional

Libvips >= 8.9 (alternative slide reader, adds support for *.scn, *.mrxs, *.ndpi, *.vms, and *.vmu files).
Linear solver (for preserved-site cross-validation)
- CPLEX 20.1.0 with Python API
- or Pyomo with Bonmin solver

📥 Installation

Slideflow can be installed with PyPI, as a Docker container, or run from source.

Method 1: Install via pip

pip3 install --upgrade setuptools pip wheel
pip3 install slideflow[cucim] cupy-cuda11x

The cupy package name depends on the installed CUDA version; see here for installation instructions. cupy is not required if using Libvips.

Method 2: Docker image

Alternatively, pre-configured docker images are available with OpenSlide/Libvips and the latest version of either Tensorflow and PyTorch. To install with the Tensorflow backend:

docker pull jamesdolezal/slideflow:latest-tf
docker run -it --gpus all jamesdolezal/slideflow:latest-tf

To install with the PyTorch backend:

docker pull jamesdolezal/slideflow:latest-torch
docker run -it --shm-size=2g --gpus all jamesdolezal/slideflow:latest-torch

Method 3: From source

To run from source, clone this repository, install the conda development environment, and build a wheel:

git clone https://github.com/jamesdolezal/slideflow
cd slideflow
conda env create -f environment.yml
conda activate slideflow
python setup.py bdist_wheel
pip install dist/slideflow* cupy-cuda11x

⚙️ Configuration

Deep learning (PyTorch vs. Tensorflow)

Slideflow supports both PyTorch and Tensorflow, defaulting to PyTorch if both are available. You can specify the backend to use with the environmental variable SF_BACKEND. For example:

export SF_BACKEND=tensorflow

Slide reading (cuCIM vs. Libvips)

By default, Slideflow reads whole-slide images using cuCIM. Although much faster than other openslide-based frameworks, it supports fewer slide scanner formats. Slideflow also includes a Libvips backend, which adds support for *.scn, *.mrxs, *.ndpi, *.vms, and *.vmu files. You can set the active slide backend with the environmental variable SF_SLIDE_BACKEND:

export SF_SLIDE_BACKEND=libvips

Getting started

Slideflow experiments are organized into Projects, which supervise storage of whole-slide images, extracted tiles, and patient-level annotations. The fastest way to get started is to use one of our preconfigured projects, which will automatically download slides from the Genomic Data Commons:

import slideflow as sf

P = sf.create_project(
    root='/project/destination',
    cfg=sf.project.LungAdenoSquam,
    download=True
)

After the slides have been downloaded and verified, you can skip to Extract tiles from slides.

Alternatively, to create a new custom project, supply the location of patient-level annotations (CSV), slides, and a destination for TFRecords to be saved:

import slideflow as sf
P = sf.create_project(
  '/project/path',
  annotations="/patient/annotations.csv",
  slides="/slides/directory",
  tfrecords="/tfrecords/directory"
)

Ensure that the annotations file has a slide column for each annotation entry with the filename (without extension) of the corresponding slide.

Extract tiles from slides

Next, whole-slide images are segmented into smaller image tiles and saved in *.tfrecords format. Extract tiles from slides at a given magnification (width in microns size) and resolution (width in pixels) using sf.Project.extract_tiles():

P.extract_tiles(
  tile_px=299,  # Tile size, in pixels
  tile_um=302   # Tile size, in microns
)

If slides are on a network drive or a spinning HDD, tile extraction can be accelerated by buffering slides to a SSD or ramdisk:

P.extract_tiles(
  ...,
  buffer="/mnt/ramdisk"
)

Training models

Once tiles are extracted, models can be trained. Start by configuring a set of hyperparameters:

params = sf.ModelParams(
  tile_px=299,
  tile_um=302,
  batch_size=32,
  model='xception',
  learning_rate=0.0001,
  ...
)

Models can then be trained using these parameters. Models can be trained to categorical, multi-categorical, continuous, or time-series outcomes, and the training process is highly configurable. For example, to train models in cross-validation to predict the outcome 'category1' as stored in the project annotations file:

P.train(
  'category1',
  params=params,
  save_predictions=True,
  multi_gpu=True
)

Evaluation, heatmaps, mosaic maps, and more

Slideflow includes a host of additional tools, including model evaluation and prediction, heatmaps, analysis of layer activations, mosaic maps, and more. See our full documentation for more details and tutorials.

📚 Publications

Slideflow has been used by:

Dolezal et al, Modern Pathology, 2020
Rosenberg et al, Journal of Clinical Oncology [abstract], 2020
Howard et al, Nature Communications, 2021
Dolezal et al Nature Communications, 2022
Storozuk et al, Modern Pathology [abstract], 2022
Partin et al Front Med, 2022
Dolezal et al Journal of Clinical Oncology [abstract], 2022
Dolezal et al Mediastinum [abstract], 2022
Howard et al npj Breast Cancer, 2023
Dolezal et al npj Precision Oncology, 2023
Hieromnimon et al [bioRxiv], 2023
Carrillo-Perez et al Cancer Imaging, 2023

🔓 License

This code is made available under the GPLv3 License and is available for non-commercial academic purposes.

🔗 Reference

If you find our work useful for your research, or if you use parts of this code, please consider citing as follows:

Dolezal, J.M., Kochanny, S., Dyer, E. et al. Slideflow: deep learning for digital histopathology with real-time whole-slide visualization. BMC Bioinformatics 25, 134 (2024). https://doi.org/10.1186/s12859-024-05758-x

@Article{Dolezal2024,
    author={Dolezal, James M. and Kochanny, Sara and Dyer, Emma and Ramesh, Siddhi and Srisuwananukorn, Andrew and Sacco, Matteo and Howard, Frederick M. and Li, Anran and Mohan, Prajval and Pearson, Alexander T.},
    title={Slideflow: deep learning for digital histopathology with real-time whole-slide visualization},
    journal={BMC Bioinformatics},
    year={2024},
    month={Mar},
    day={27},
    volume={25},
    number={1},
    pages={134},
    doi={10.1186/s12859-024-05758-x},
    url={https://doi.org/10.1186/s12859-024-05758-x}
}

slideflow's People

Contributors

Stargazers

Watchers

slideflow's Issues

KeyError: 'acc' on end of epoch

Started training fine, but at end of first epoch ran into KeyError. Looks like something with the new exponential moving validation. Thoughts?

logging not currently working (not urgent)

I thought we talked about this at some point, but the log capability is currently broken. doesn't look like anything is being logged to log.log at all. This has not worked in some time, actually, possibly since January/early spring (I think that is the last time stamp I have in the H&N project log).
Given that nothing is being logged, I think it might just be an issue with actually opening the file and writing to it, or else we'd see stuff partially being logged. But i'm not seeing the tile extractions logging either.
we just had a conversation about how you're busy but I'm logging this issue now before I forget (as I have kept forgetting for the past few months). So this is just a heads up for the next time you are working on bugs.

fix early_stop = False

Right now I think you only have the early_stop T/F hyperparameter to be set for early stopping after a set number of epochs, but if I don't want to do the early_stop exp moving average either, how do I turn that off? Can you have early stop HP set that as well? Hanna is training models now and I don't think she'll need that.

Include examples of more complex functions in documentation

The documentation should be updated to 1.4 and include more examples of how the different functions can be used. Perhaps an “examples” page with subheaders for different types of functions (Eg training, activations visualizations, etc).

Adding L2 regularization and other miscellaneous functions

Goals per discussion with Sara

Add options for L2 regularization
Enable filtering of TFRecords during training based on minimum and maximum number of tiles
Function to extract tiles from TFRecords, callable from actions.py

First implementation performed by Sara, in branch sara

Reverse order GPU selection

Have it always select to run on GPU #1 first instead of GPU #0 so we can continue to work on the workstation and not have it be so slow

UMAP parameter customization

Support should be implemented for passing custom parameters to UMAPs, both directly to the TFRecordMap object as well as through the higher-level SlideflowProject.generate_mosaic() function. Parameters to support include:

n_neighbors
min_dist

Docker environment

Less of a python code base issue and more of a project goal - we should make a Docker environment for easier, cleaner, and more rapid deployment.

Improvements to activations calculation

There are several improvements to the calculation of activations that should be implemented, including:

Calculations from model. Current implementation only uses 30-50% of GPU, likely due to the manual batching (~line 289). Keras predict() supports entire datasets at once, so manual batching shouldn't be required, and using automatic backend batching is likely to increase efficiency / GPU utilization.
Memory footprint. Current implementation has very high memory footprint, higher than is expected for the objects in memory. Footprint is reduced when loading from pre-calculated PKL. This should be investigated and improved.
Tile-point calculations. When doing these calculations for "expanded" mosaic maps, tile-point calculations need to be performed on a long array of data. Multiprocessing should be implemented but failed at last attempt, as too many threads were spawned.
Expanded maps. Current "expanded" option forces tile-point calculations for all points across the map, which is a tremendous waste of resources. Instead, calculations should only be done for points in the current grid space and surrounding grids.

evaluation stats generated at early stop?

As seen in the screenshot above, early stop triggers here, a checkpoint is saved, and we go onto the next k-fold. However, it doesn't look like validation statistics are created/results generated at this step. That might be totally normal, but would it be useful to save a model at that point? The most recent results/model save was from epoch 1. But let's say we just want the best model before things become redundant > should we be saving model/results at early stop end?

logits before softmax

Which do you consider as the logits layer when you get the raw predictions before softmax? I just copy pasted the last several tensors.

name: import/xception_2/global_max_pooling2d/Max/reduction_indices
values: (<tf.Tensor 'import/xception_2/global_max_pooling2d/Max/reduction_indices:0' shape=(2,) dtype=int32>,)

name: import/xception_2/global_max_pooling2d/Max
values: (<tf.Tensor 'import/xception_2/global_max_pooling2d/Max:0' shape=(?, 2048) dtype=float32>,)

name: import/dense_3/kernel
values: (<tf.Tensor 'import/dense_3/kernel:0' shape=(2048, 500) dtype=float32>,)

name: import/dense_3/bias
values: (<tf.Tensor 'import/dense_3/bias:0' shape=(500,) dtype=float32>,)

name: import/dense_3/MatMul/ReadVariableOp
values: (<tf.Tensor 'import/dense_3/MatMul/ReadVariableOp:0' shape=(2048, 500) dtype=float32>,)

name: import/dense_3/MatMul
values: (<tf.Tensor 'import/dense_3/MatMul:0' shape=(?, 500) dtype=float32>,)

name: import/dense_3/BiasAdd/ReadVariableOp
values: (<tf.Tensor 'import/dense_3/BiasAdd/ReadVariableOp:0' shape=(500,) dtype=float32>,)

name: import/dense_3/BiasAdd
values: (<tf.Tensor 'import/dense_3/BiasAdd:0' shape=(?, 500) dtype=float32>,)

name: import/dense_3/Relu
values: (<tf.Tensor 'import/dense_3/Relu:0' shape=(?, 500) dtype=float32>,)

name: import/dense_1_2/kernel
values: (<tf.Tensor 'import/dense_1_2/kernel:0' shape=(500, 2) dtype=float32>,)

name: import/dense_1_2/bias
values: (<tf.Tensor 'import/dense_1_2/bias:0' shape=(2,) dtype=float32>,)

name: import/dense_1_2/MatMul/ReadVariableOp
values: (<tf.Tensor 'import/dense_1_2/MatMul/ReadVariableOp:0' shape=(500, 2)
dtype=float32>,)

name: import/dense_1_2/MatMul
values: (<tf.Tensor 'import/dense_1_2/MatMul:0' shape=(?, 2) dtype=float32>,)

name: import/dense_1_2/BiasAdd/ReadVariableOp
values: (<tf.Tensor 'import/dense_1_2/BiasAdd/ReadVariableOp:0' shape=(2,) dtype=float32>,)

name: import/dense_1_2/BiasAdd
values: (<tf.Tensor 'import/dense_1_2/BiasAdd:0' shape=(?, 2) dtype=float32>,)

name: import/dense_1_2/Softmax
values: (<tf.Tensor 'import/dense_1_2/Softmax:0' shape=(?, 2) dtype=float32>,)

Thanks!

Precision-recall curves

Precision-recall curves should be implemented in statistics package as part of routine model evaluation, as many categories are sparse and typical ROCs are not well-optimized for sparse data.

This should also include additional helpful statistics including average precision, F1 score, and cohen kappa.

export flag for summarize.py

"so your summarize.py script obviously summarizes all of the results very prettily across all of the nested folders in the PROJECTS folder. I would like to play around with the results for the projects myself, could you potentially export all of the model folders and the csvs/jsons/manifest from the model folders into one folder for me? i don’t need any of the actual models or the jpegs/pngs i just need the raw results for each model.
maybe like an export flag for the summarize.py script and then instead of displaying it just sends the results to a specific directory"

Thanks! :)

Segmentation fault during tile extraction

When tile extraction is performed with a high thread count, segmentation faults are common. Traceback highlights the fault as originating in slide/__init__.py, line 729:

region = region.resize(float(self.size_px) / self.extract_px)

Tracing the call through a debugger identifies vips/voperation.py, line 290 as the culprit:

vop = vips_lib.vips_cache_operation_build(op.pointer)

Which gives the following backtrace:

#1  0x00007fc9b55e3859 in __GI_abort () at abort.c:79
#2  0x00007fc9b564e3ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fc9b5778285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007fc9b565647c in malloc_printerr (str=str@entry=0x7fc9b577a670 "double free or corruption (out)") at malloc.c:5347
#4  0x00007fc9b5658120 in _int_free (av=0x7fc9b57a9b80 <main_arena>, p=0x7fc85c057fb0, have_lock=<optimized out>) at malloc.c:4314
#5  0x00007fc87cf056c5 in vips_cache_trim () at cache.c:775
#6  0x00007fc87cf05f44 in vips_cache_operation_add (operation=<optimized out>) at cache.c:879
#7  0x00007fc87cf06080 in vips_cache_operation_buildp (operation=0x7fc8f2ffb460) at cache.c:921
#8  vips_cache_operation_buildp (operation=operation@entry=0x7fc8f2ffb460) at cache.c:894
#9  0x00007fc87cf0c44d in vips_call_required_optional (operation=operation@entry=0x7fc8f2ffb460, required=required@entry=0x7fc8f2ffb490, optional=optional@entry=0x7fc8f2ffb570)
    at operation.c:880
#10 0x00007fc87cf0cb77 in vips_call_by_name
    (operation_name=<optimized out>, option_string=option_string@entry=0x0, required=required@entry=0x7fc8f2ffb490, optional=optional@entry=0x7fc8f2ffb570) at operation.c:920
#11 0x00007fc87cf0cfe0 in vips_call_split (operation_name=operation_name@entry=0x7fc87cf41e0c "shrinkh", optional=optional@entry=0x7fc8f2ffb570) at operation.c:1024
#12 0x00007fc87cdddbcb in vips_shrinkh (in=in@entry=0x7fc84c3f54d0, out=out@entry=0x7fc84c3f3898, hshrink=hshrink@entry=2) at shrinkh.c:367
#13 0x00007fc87cddaa2a in vips_resize_build (object=0x7fc79828e2d0) at resize.c:195
#14 0x00007fc87cef9c8d in vips_object_build (object=0x7fc79828e2d0) at object.c:367
#15 0x00007fc87cf06070 in vips_cache_operation_buildp (operation=0x7fc8f2ffb718) at cache.c:918
#16 vips_cache_operation_buildp (operation=0x7fc8f2ffb718) at cache.c:894
#17 0x00007fc87cf060c0 in vips_cache_operation_build (operation=<optimized out>) at cache.c:948
#18 0x00007fc882952dec in ffi_call_unix64 () at /home/shawarma/venv/lib/python3.8/site-packages/cffi.libs/libffi-806b1a9d.so.6.0.4
#19 0x00007fc882951f55 in ffi_call () at /home/shawarma/venv/lib/python3.8/site-packages/cffi.libs/libffi-806b1a9d.so.6.0.4
#20 0x00007fc882b74a96 in cdata_call (cd=0x7fc95722e330, args=<optimized out>, kwds=<optimized out>) at c/_cffi_backend.c:3153

The issue appears to be that the vips resize() method is somehow thread unsafe. I think the libvips operation cache may be the core issue.

I have confirmed that disabling the resize() call in line 729 of slide/__init__.py prevents the segmentation faults.

Things to try include:

Disabling the libvips operation cache with pyvips.cache_set_max(0) (per libvips/pyvips#41)
Using vips thumbnail instead of resize

Single slide per patient

Current implementation of the pipeline only utilizes one slide per patient. While the current model training / evaluation structure can handle multiple slides per patient, the tile extraction phase ignores multiple slides per patient.

AttributeError: 'list' object has no attribute 'lookup' - categorical models in most recent master commit

I tried running a LUAD categorical model and got the following error:
"AttributeError: 'list' object has no attribute 'lookup'
Looks like an issue with the annotations file. I rolled back to previous commit 166bd12 and it worked then.

Traceback:

SFP.extract_tiles_from_tfrecords getting "module 'slideflow.io' has no attribute 'tfrecords'" error

trying to run this function to extract tiles from tfrecords but running into no module error. even though tfrecords.py is clearly in the slideflow.io folder and should function as a module. very strange and I have never run into this error from what I can remember.

Saving calculations during mosaic map & final layer activations

Currently, each mosaic map generation re-calculates final layer activations even if these have been calculated previously. In order to save time, calculated final layer activations should be saved to disk and re-used if the model and tfrecords list are the same.

Proposal: record model & tfrecord list into a JSON file and associate this combination with a filename for the saved final layer activations (CSV). During subsequent calls, this file should be loaded directly instead of re-calculated

Automatic project backups

Lower priority.

Have automatic backups for projects. Like a function to run to backup our results, with or without the models (because they are so big).
We have the lab share available for a reason, which is to serve as a backup for our important results and models we do not want to have to redo.
I don’t know if this would be using CPU power or just transferring to the labshare with read/write speeds separate from the pipeline running… that is, can we have the backups being written while the GPUs are running the models? We can do them simultaneously then.

An issue with this is that the lab share write speed is VERY slow right now. Only 4MB/s, so we'd want to have the option of only saving the bare minimum data we'd need.

Might be good to have an option to delete models that did not work.

Questions on function generate_heatmaps

Hi James,

This is Shenglai Li from CTDS (Dr. Grossman's group). We are trying to take over the HistoXai project from Justina.

I believe Justina was working on an extremely old version of slideflow, which I just pushed it as justina-dev.

Recently, I was able to run her code with the legacy version of slideflow to generate the final concept visualization report.

Right now, I am trying to update the slideflow package to the latest stable version (v1.10.1) as part of the improvements to the project.

However, I encountered a problem when I tried to generate the report. The legacy code ran the function called generate_concept_viz but the latest slideflow had the function called generate_heatmaps. The two functions have pretty different inputs, especially seg_to_concepts and n_segments_show are missing from the generate_heatmaps.

I'm wondering if now these two inputs should be presented in filters, or if you could find the change log you made that I could check. I want to learn more about how we used to run it with generate_concept_viz and now generate_heatmaps so I could adjust the input.

I was trying to run the code as much as it could, so I simply adjust the inputs for generate_heatmaps by removing the old ones, and making something like this

SFP.generate_heatmaps("/path/to/trained_model.h5", filters={"slide": [slide]}, directory=discover_dir, resolution="low", model_format="slideflow.model.MODEL_FORMAT_LEGACY")

Another issue occurred

 ! [WARN] Unable to find TFRecords in the directory /mnt/gpucluster/hnsc/data/UCH_HNSC_HPV/tfrecord/299px_302um
Spawning heatmaps process (PID: 18015)
Generating heatmap... |--------------------| 0.0% (ETA: ?)
Generating heatmap... |████████████████████| 100.0% 15.5 tiles/sec (ETA: 00:00:00)
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/gpucluster/packages/slideflow/source/slideflow/__init__.py", line 184, in _heatmap_generator
    model_format=model_format)
  File "/mnt/gpucluster/packages/slideflow/source/slideflow/activations.py", line 1321, in __init__
    postconv, logits = self.model.predict(batch_images)
TypeError: cannot unpack non-iterable NoneType object

I'm not sure if it's related to that the filters was set wrong, but when I check the batch_images above, it showed

<PrefetchDataset shapes: (None, 299, 299, 3), types: tf.float32>

If it's not related to generate_heatmaps, then I'm wondering if it means I would need to go back and get new activation/clustering stuff with the requirements from new slideflow so that they can be compatible.

Sorry if it doesn't make a lot sense to you, but please let me know if you have any questions on my side so I could help debug and bring her code alive.

Thanks a ton!

Shenglai

Update Documentation for v1.3

Documentation (jmd172.bitbucket.io) needs to be updated to reflect changes in versions 1.1 - 1.3. Documentation should also include steps for updating between versions, as well as a description of the "dataset" configuration for organizing slides, ROIs, TFRecords, and extracted tiles.

README should be updated to include instructions on how to add a dataset.

Separate module for final layer weights generation

So we are looking at all of these various ways of doing explainable AI, right? We may want to do these things on the fly, but not want to have to wait for the results to be generated again like we do for heatmaps and mosaic maps right now. Also, both of those currently generate final layer weights, which is redundant, so we were thinking that final layer weight generation be its own method we could call on the Slideflow project or model, separate from heatmaps & mosaic maps.
This is related to the first issue below about saving final layer weights during model training.

Remove global variables

Remove all non-static global variables, as multiprocessing will reset these variables and relying on them could be dangerous. Many of these global variables should be moved into a class and be instanced.

Examples:

sf.slide
- SKIP_MISSING_ROI
sf.trainer.model
- TEST_MODE
sf.util
- PROJECT_DIR
- ANNOTATIONS
- log (keeping as global)
- LOGGING_LEVEL.SILENT
sf
- GPU_LOCK
- USE_COMET
- SKIP_VERIFICATION
- EVAL_BATCH_SIZE

Tiles as individual tfrecord strings?

We were discussing improving the tiling from tfrecords and you commented how it took time to go through all of tfrecords sequentially to find the tile that we need. Is there any benefit to having individual tiles as tfrecords themselves? I don't know anything about the tfrecord format though so this might just not work.

Tile Extraction Stopping due to KeyError

Tile extraction will randomly stop due to a KeyError. This will leave a portion of slides with tiles extracted.
Best way would be to fix error, but easy fix is to just reconcile which slides should have tiles extracted and don't yet (use ROI list) and which were already done, and only do the incomplete ones.

Tile extraction hangs / incomplete if no tiles can be extracted from slide

*.unfinished file left intact
Empty tfrecord remains

Should be treated as corrupt or otherwise unavailable slide (no tfrecord)

Implement Cox proportional hazards outcome

As per https://github.com/mexchy1000/DeepSurv_Keras/blob/master/Survival_Keras_lifelineExample.py

heatmap generation folders

When generating heatmaps for one dataset (like HNSC), the heatmaps are saved in the "heatmaps" folder in the dataset project page, by slide name & integer outcome value. However, when generating heatmaps for multiple models for the same slide (returning multiple heatmaps), the subsequent heatmap will overwrite the prior heatmap. When generating heatmaps, it is possible to create a subfolder for the specific model within the general heatmaps folder? All you would need to do is either create a new folder using the "model_name" variable, which we must provide in the actions.py file (alternatively we can provide the full path, which I have been doing) and then just drop the heatmaps into there.

also related, each time a heatmap is generated a final layer weights csv is also created. Can you make that an optional argument True/False to generate with the generate_heatmaps function?

Multiple slide/TFRecord sources

Currently, projects use a single directory for slide and TFRecord sources. This requires unnecessary duplication of data for projects that will use shared sources (e.g. projects using slides from other cohorts). Projects using slides from multiple cohorts should be able to use slide/TFRecord data from multiple directories.

Example:

LUSC cohort has 600 slides, stored in ~/slides/LUSC. The extracted TFRecords are stored in ~/tfrecords/LUSC.
HNSC cohort has 400 slides, stored in ~/slides/HNSC. Extracted TFRecords are stored in ~/tfrecords/HNSC.
A project using slides from both HNSC and LUSC should be able to use the TFRecords stored in both ~/tfrecords/HNSC and ~/tfrecords/LUSC

Challenges:

Validation plans (k-folds) are currently stored in the TFRecord folders. If multiple folders are used as input, these validation plans should be moved to the project folder.
Tile extraction could not be coordinated across multiple dataset sources within a combined project without linking slides and TFRecords into a dataset object (a separate project)

Restart models from checkpoint to finish epochs and and k-fold validation

Right now we are able to use the checkpoint feature to reload model and weights from a checkpoint. However when it does this it will restart from epoch 1 again and do the full set of epochs specified in the batch_train file. It doesn't look like the model will restart from that same epoch and continue finishing the k-folds. So if we're in the middle of k-fold 2 it is not straightforward to restart from there.

Possible issue with AUCs at slide/patient level?

Hey James, this is related to what I was just texting you about. Hanna described the issue as follows:

"Hey, I think there may be an issue with the pipeline/my actions file. All of the discrete models I have run have an AUC of .5 and are assigning all patients to false. I tried re-runing for PI3K pathway activation, which previously had an AUC of .7, but I got the same issues."

And a second update:
"So I looked at the other k folds for PI3K pathway activation and the AUC for the tile level is actually still .7 but still all of the predictions are negative except 2 tiles. When I look at the results from the run back in October the AUC at the tile level is similar but slide/patient level are much better (~.8 in Oct, .5 now). I also double checked and the annotations are the same."

The early stop method was set to "accuracy" (I wonder if that had something to do with it, I am going to rerun with method as "loss" and see what we get).
I took a look at the results log and indeed the averages were all around 0.5 but training acc was in the 0.8s
This is TCGA_HNSC_598px_604um, model rna_PIK3activated_HPSweep0_kfold1

Do you have any thoughts?

De-identify slide images

We need to de-identify our slide images. Most of them are scanned with the slide part as the main image and the identifying label included as part of the meta-data, as a .jpeg thumbnail type thing. We need to remove those in order to properly de-identify our slides as they can contain patient identifying information. This will be required as part of the IRB.

There will be multiple different file types that need to be accounted for (.svs, .tiff, .mrxs). It may be that someone has already written this code for Openslide and it just needs to be found.

This does not matter for TCGA, only institutional images.

All results summarized in one report

Want all of our results in one report, along with the best summaries we can produce for the model.
For example: ROCs for k-folds overlaid, cleaned up prettier & more interpretable (like with the category values re-assigned as labels instead of being 0 and 1).
All results summarized in one report, for example. Confusion matrices included, along with other relevant summary statistics (precision, recall), we can define these in the .evaluation module.
We could send these to comet.ml and save them there as they allow HTML reports to be tracked.

Auto HP optimizations

Automatic hyperparameter optimizations should be implemented, as per https://github.com/autonomio/talos

Finish development of JPG slide compatibility

On branch dev-overflow, refinement and bug fixing of JPG slide compatibility has started. We are moving from imageio to PIL Image for consistency. X/Y coordinate systems are slightly different, so region extraction will need to be tested.

Mixed precision

Implement mixed precision as described in https://www.tensorflow.org/guide/keras/mixed_precision

"Unable to use multiple outcome variables with categorical model type"

I tried running a prostate model (TFRecords were created properly & had a manifest), I double-checked the settings files and all and things seemed in order but it's saying that multiple outcome variables with categorical model type aren't allowed. In the actions file, I definitely only had one outcome variable assigned and it looks like it correctly recognized that, per the successful output saying "for each of 1 outcome variable". In your new documentation it says default for multi-outcome should be False but I'm wondering if there's an error somewhere with that.

UMAP accuracy heatmap

On UMAP, different areas will have different accuracies. As we move across the UMAP, we'll want to make a heatmap of those accuracies using Euclidean distance between the prediction and the actual.

Slide scanner file support (MRXS)

Include support for all openslide compatible slides, including MRXS, NDPI, and more: https://openslide.org/api/python/

Stain normalization failure when installed from wheel

When slideflow is installed from a wheel, stain normalization fails with the following traceback:

  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/slideflow/__init__.py", line 293, in _trainer
    num_slide_input=num_slide_input)
  File "/usr/local/lib/python3.7/dist-packages/slideflow/model.py", line 342, in __init__
    self.normalizer = None if not normalizer else StainNormalizer(method=normalizer, source=normalizer_source)
  File "/usr/local/lib/python3.7/dist-packages/slideflow/util/__init__.py", line 82, in __init__
    self.n.fit(cv2.imread(source))
  File "/usr/local/lib/python3.7/dist-packages/slideflow/slide/stainNorm_Reinhard.py", line 75, in fit
    target = ut.standardize_brightness(target)
  File "/usr/local/lib/python3.7/dist-packages/slideflow/slide/stain_utils.py", line 123, in standardize_brightness
    p = np.percentile(I, 90)
  File "<__array_function__ internals>", line 6, in percentile
  File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 3733, in percentile
    a, q, axis, out, overwrite_input, interpolation, keepdims)
  File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 3853, in _quantile_unchecked
    interpolation=interpolation)
  File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 3429, in _ureduce
    r = func(a, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 3967, in _quantile_ureduce_func
    x1 = take(ap, indices_below, axis=axis) * weights_below
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'```

Multiprocessing / multithreading during tile-point calculation (mosaic maps)

Multiprocessing or multithreading should be implemented. This step currently takes up to 20 minutes and uses only a single core.

Poor validation performance (GPU underutilization)

During final performance assessment during training (sf.statistics.generate_performance_metrics()), GPU utilization is around 50% during generation of predictions. Alleviation of the bottleneck will help speed up training iterations.

unhashable list when training models with two datasets

I tried running a LUAD LUSC model on binary 1 year survival on Friday and I was met with this bug (see below). I double-checked everything but nothing changed and I saw no errors. Any ideas? I am almost positive that I've seen this error somewhere before.

Automate the pipeline

The queue with actions.py files functions to automatically run multiple projects at once. However, it requires a decent bit of leg work to prepare everything, and then it has been running into some issues. There are a few different aspects to this.
Ideally, I image that we may have multiple annotations files that will contain columns of variables that will either be outcome variables or variables we use to subset our patients. I imagine that we will want to be able to loop through and train models for multiple outcome vars in these annotation files.
So issues:

Currently, for each different data subset or outcome variable, we need to have a separate action.py file and batch_train file. We can't have separate batch train files, and so when we have many models it becomes super confusing. The folder gets very crowded with files. It sucks. Is there a way we can combine actions.py and batch train? Shailesh suggested perhaps we could autowrite the actions.py files. That way we would not need to do away with them entirely, but rather make it easier to generate them and prevent typos.
We can only have one annotation file per project right now, since we can only list one annotation file in the settings. However, we want to schedule projects with outcomes from multiple annotations files.
Then, for each annotation file, preferably, we'd want to be able to loop through multiple outcome variables and train models for each question that annotations file contains data for.

Like I mentioned, I would love just an Excel file that I can enter everything into, then run a verification check for typos or initialization errors, then it just goes off and does its thing.

Too many files open

When training with validate_on_batch > 0, a “too many files open” OSError is often eventually triggered. This is probably due to unclosed TFRecords files after evaluate() is called.

One possible solution may be to call the evaluate() function in a separate thread, so all associated open files are closed once the evaluation is complete.

Train() with a single HP combination

Create a train() function in SlideflowProject to allow for training with a single hyperparameter object (slideflow.model.HyperParameters), rather than loading a batch training CSV file.

This will allow direct calling of a training function and allow for direct return of raw Keras training results, which will be used when another class or function is supervising training coordination and expects raw Keras training results (e.g. CANDLE's run() function).

Terminology fixes

Ensure final layer "weights" are renamed to "activations" thoughout the code

Publish documentation with a more official domain name (slideflow.io)

Rather than leaving the documentation on a personal repository (jmd172.bitbucket.io), the official documentation should be moved to a more official project-specific domain (e.g. slideflow.io or slideflow.dev)

'percent_tiles_positive0' not in list - summarize.py error

when running summarize.py, this is a linear model but something wacky is going on with the outcomes.
you may have already fixed this, but I have not updated to latest nightly yet.

SARC
mutation_load_stand
Subset 0 (3-fold cross-validation): 425 slides
Filters: None
Outcomes: {'0': 'm', '1': 'u', '2': 't', '3': 'a', '4': 't', '5': 'i', '6': 'o', '7': 'n', '8': '', '9': 'l', '10': 'o', '11': 'a', '12': 'd', '13': '', '14': 's', '15': 't', '16': 'a',
'17': 'n', '18': 'd'}
Traceback (most recent call last):
File "summarize.py", line 605, in
load_from_directory(args.dir, args.nested, args.since, args.names)
File "summarize.py", line 585, in load_from_directory
dataset.print_summary(grouped=True, show_model_names=shownames)
File "summarize.py", line 63, in print_summary
outcome.print_summary(grouped=grouped, show_model_names=show_model_names)
File "summarize.py", line 110, in print_summary
subset.print_summary(grouped=grouped, show_model_names=show_model_names)
File "summarize.py", line 200, in print_summary
group.gen_combined_roc()
File "summarize.py", line 353, in gen_combined_roc
pred = model.get_predictions(epochs[0])
File "summarize.py", line 486, in get_predictions
predictions = sfstats.read_predictions(predictions, level)
File "/lambda_stor/homes/skochanny/slideflow/source/slideflow/util/statistics.py", line 118, in read_predictions
ypi = header.index(f'{y_pred_label}{label}')
ValueError: 'percent_tiles_positive0' is not in list

Ensure that balance_by_category is not possible with linear outcomes

This needs to be enforced