dana-farber-aios / pathml Goto Github PK

Tools for computational pathology

License: GNU General Public License v2.0

Python 99.49% Dockerfile 0.51%

machine-learning digital-pathology computational-pathology biomedical-image-processing pathology histopathology spatial-transcriptomics image-analysis microscopy fluorescence-microscopy-imaging

pathml's Introduction

🤖🔬 PathML: Tools for computational pathology

⭐ PathML objective is to lower the barrier to entry to digital pathology

Imaging datasets in cancer research are growing exponentially in both quantity and information density. These massive datasets may enable derivation of insights for cancer research and clinical care, but only if researchers are equipped with the tools to leverage advanced computational analysis approaches such as machine learning and artificial intelligence. In this work, we highlight three themes to guide development of such computational tools: scalability, standardization, and ease of use. We then apply these principles to develop PathML, a general-purpose research toolkit for computational pathology. We describe the design of the PathML framework and demonstrate applications in diverse use cases.

🚀 The fastest way to get started?

docker pull pathml/pathml && docker run -it -p 8888:8888 pathml/pathml

done, what analyses can I write now? 👉

This AI will:

🤖 write digital pathology analyses for you
🔬 walk you through the code, step-by-step
🎓 be your teacher, as you embark on your digital pathology journey ❤️

More usage examples here.

📖 Official PathML Documentation

View the official PathML Documentation on readthedocs

🔥 Examples! Examples! Examples!

↴ Jump to the gallery of examples below

1. Installation

There are several ways to install PathML:

pip install from PyPI (recommended for users)
Clone repo to local machine and install from source (recommended for developers/contributors)
Use the PathML Docker container
Install in Google Colab

Options (1), (2), and (4) require that you first install all external dependencies:

openslide
JDK 8

We recommend using conda for environment management. Download Miniconda here

1.1 Installation option 1: pip install

Create conda environment, this step is common to all platforms (Linux, Mac, Windows):

conda create --name pathml python=3.8
conda activate pathml

Install external dependencies (for Linux) with Apt:

sudo apt-get install openslide-tools g++ gcc libblas-dev liblapack-dev

Install external dependencies (for MacOS) with Brew:

brew install openslide

Install external dependencies (for Windows) with vcpkg:

vcpkg install openslide

Install OpenJDK 8, this step is common to all platforms (Linux, Mac, Windows):

conda install openjdk==8.0.152

Optionally install CUDA (instructions here)

Install PathML from PyPI:

pip install pathml

1.2 Installation option 2: clone repo and install from source

Clone repo:

git clone https://github.com/Dana-Farber-AIOS/pathml.git
cd pathml

Create conda environment:

conda env create -f environment.yml
conda activate pathml

Optionally install CUDA (instructions here)

Install PathML from source:

pip install -e .

1.3 Installation option 3: Docker

First, download or build the PathML Docker container:

Step 1: download PathML container from Docker Hub
```
docker pull pathml/pathml:latest
```
Optionally specify a tag for a particular version, e.g. docker pull pathml/pathml:2.0.2. To view possible tags, please refer to the PathML DockerHub page.

Alternative Step 1 if you have custom hardware: build docker container from source

git clone https://github.com/Dana-Farber-AIOS/pathml.git
cd pathml
docker build -t pathml/pathml .

Step 2: Then connect to the container:

docker run -it -p 8888:8888 pathml/pathml

The above command runs the container, which is configured to spin up a jupyter lab session and expose it on port 8888. The terminal should display a URL to the jupyter lab session starting with http://127.0.0.1:8888/lab?token=<.....>. Navigate to that page and you should connect to the jupyter lab session running on the container with the pathml environment fully configured. If a password is requested, copy the string of characters following the token= in the url.

Note that the docker container requires extra configurations to use with GPU.
Note that these instructions assume that there are no other processes using port 8888.

Please refer to the Docker run documentation for further instructions on accessing the container, e.g. for mounting volumes to access files on a local machine from within the container.

1.4 Installation option 4: Google Colab

To get PathML running in a Colab environment:

import os
!pip install openslide-python
!apt-get install openslide-tools
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version
!pip install pathml

PathML Tutorials we published in Google Colab

PathML Tutorial Colab #1 - Load an SVS image in PathML and see the image descriptors
Now that you have PathML installed, all our other examples would work too - Only make sure you select an appropriately sized backend or VM in CoLab (i.e., RAM, CPU, Disk, and GPU if necessary)

Thanks to all of our open-source collaborators for helping maintain these installation instructions!
Please open an issue for any bugs or other problems during installation process.

1.5 CUDA (optional)

To use GPU acceleration for model training or other tasks, you must install CUDA. This guide should work, but for the most up-to-date instructions, refer to the official PyTorch installation instructions.

Check the version of CUDA:

nvidia-smi

Install correct version of cudatoolkit:

# update this command with your CUDA version number
conda install cudatoolkit=11.0

After installing PyTorch, optionally verify successful PyTorch installation with CUDA support:

python -c "import torch; print(torch.cuda.is_available())"

2. Using with Jupyter (optional)

Jupyter notebooks are a convenient way to work interactively. To use PathML in Jupyter notebooks:

2.1 Set JAVA_HOME environment variable

PathML relies on Java to enable support for reading a wide range of file formats. Before using PathML in Jupyter, you may need to manually set the JAVA_HOME environment variable specifying the path to Java. To do so:

Get the path to Java by running echo $JAVA_HOME in the terminal in your pathml conda environment (outside of Jupyter)

Set that path as the JAVA_HOME environment variable in Jupyter:

import os
os.environ["JAVA_HOME"] = "/opt/conda/envs/pathml" # change path as needed

2.2 Register environment as an IPython kernel

conda activate pathml
conda install ipykernel
python -m ipykernel install --user --name=pathml

This makes the pathml environment available as a kernel in jupyter lab or notebook.

3. Examples

Now that you are all set with PathML installation, let's get started with some analyses you can easily replicate:

4. Citing & known uses

If you use PathML please cite:

J. Rosenthal et al., "Building tools for machine learning and artificial intelligence in cancer research: best practices and a case study with the PathML toolkit for computational pathology." Molecular Cancer Research, 2022.

So far, PathML was referenced in 20+ manuscripts:

5. Users

This is where in the world our most enthusiastic supporters are located:

and this is where they work:

Source: https://ossinsight.io/analyze/Dana-Farber-AIOS/pathml#people

6. Contributing

PathML is an open source project. Consider contributing to benefit the entire community!

There are many ways to contribute to PathML, including:

Submitting bug reports
Submitting feature requests
Writing documentation and examples
Fixing bugs
Writing code for new features
Sharing workflows
Sharing trained model parameters
Sharing PathML with colleagues, students, etc.

See contributing for more details.

7. License

The GNU GPL v2 version of PathML is made available via Open Source licensing. The user is free to use, modify, and distribute under the terms of the GNU General Public License version 2.

Commercial license options are available also.

8. Contact

Questions? Comments? Suggestions? Get in touch!

[email protected]

pathml's People

Contributors

Stargazers

Watchers

Forkers

mohamedomar2020 collinarnett jiesun1990 antoniofaneite srinidhipy dana-farber lindvalllab re73 grenkoca surya-narayanan dnbaker iphyer al3n70rn visionpathology nauyan curlup aditya964 mubashermohammed akhil4rajan nhatipoglu mehbob ghislainadon derekkaknes wiherewini hugging-face-supporter dfci-codex-group irtyamine msk-mph zolkko yu-anchen raghavendrasri kevingalacha kanedev m081429 lxc-dolphin astorfi dmbrundage sancakozdemir doc-r2 mike575 rect-war sreekarreddydfci beegass musc-pathology-informatics histopathology jnirschl ydeh22 paulscemama eng-rsmy venkatapathy yirenheihei tddough98 gmnamra ckv1110 daniya-sohail26 xellnaga abdulkarimab akihikoueda jackzhousz drsei shbrief ge-yl priyanshumahey zaloch imlxw jamesgwen chiaracorti geeks-sid shitoudidi sgoggins xiachenrui onerai fantashi099 shatadg krejiba rimanb varunullanat cowmonkeybrain hungvo304ml

pathml's Issues

Hosting Model Weights

We want to be able to share pre-trained models. The trained model weights can be saved to disk, e.g. in .pth files for pytorch. However, these files can be quite big - too big to put in the GitHub repo itself..

We need to find a solution for hosting these large files of model parameters.
E.g. we could have a GCP bucket, or S3 bucket.
Need to evaluate the costs of different options.

Transition to Google-/numpy-style docstrings

Docstrings are currently written in basic Sphinx format. However, basic Sphinx doesn't support a References section so I had to start using the Napoleon extension. Since we are already using Napoleon, we may as well stick with Google or numpy docstring format moving forward, since it is more readable for humans.

Improve example notebooks

The example notebooks should be more comprehensive and better organized. For many people, the example notebooks will be their first experience with PathML, so we want them to be really good.

Some ideas:

Update the example pipeline notebooks to use chunk processing (much more efficient than the current basic example notebook which loads the whole slide into memory)
Create a notebook that shows examples of every transform
Create a notebook showing how to use DataModules with Pytorch

Making docs in Linux fails without additional dependencies

Provided instructions:
conda install sphinx # install sphinx package for generating docs
cd docs # enter docs directory
make html # build docs in html format

fail in Linux (tested Linux Mint 19.2). Additionally required:
pip install nbsphinx
pip install nbsphinx_link
pip install sphinx_rtd_theme
pandoc https://pandoc.org/installing.html

Add test for multiprocessing

Pipeline.run() uses concurrent.futures.ProcessPoolExecutor to implement multiprocessing when run on a dataset.
I'm not sure how to write a test for this though. Everything I've tried so far has caused Pytest to hang.

Publish package on PyPI

This will allow people to pip install pathml

Review tutorial here: https://packaging.python.org/tutorials/packaging-projects/
Make sure that README is correctly formatted: https://packaging.python.org/guides/making-a-pypi-friendly-readme/
Set up a github actions workflow to automatically prepare package and upload to PyPI: https://packaging.python.org/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/

We should look into using release tags: https://docs.github.com/en/free-pro-team@latest/github/administering-a-repository/about-releases

create class Mask

wrap dict of masks
each mask stores pixel-wise int8
repr method (keys, dimensions)
getitem method
len method

by default should be multiparametric single plane
subclass volumetric

Pannuke masks and images don't match

The masks in PanNuke dataloaders don't match the images.
This is obviously a big problem for training models...

from pathml.datasets.pannuke import PanNukeDataModule

pannuke = PanNukeDataModule(
    data_dir="../data/pannuke/", 
    download=False,
    nucleus_type_labels=True, 
    batch_size=8, 
    hovernet_preprocess=True,
    split=1
)

train = pannuke.train_dataloader

images, masks, hvs, types = next(iter(train))

fig, ax = plt.subplots(nrows=1, ncols=2)
im = np.moveaxis(images[0, ...].numpy(), 0, 2)
ax[0].imshow(im)
mask = masks.argmax(dim=1)[0, ...]
ax[1].imshow(mask)
plt.show()

I think this may be happening because the lists of filepaths for masks and images are created separately using pathlib.Path.glob(), but glob is unordered.

Set up automated testing

Use GitHub actions to automatically run tests when code is pushed

See:

Fold and out-of-focus detection

Add support for tile-level

fold detection
out of focus detection

wsi.py: pad and black edge issue

When stride is small, the last few tiles lie on the edge of slides would have smaller size than (tile size, tile size). Openslide would zero padding automatically. When set pad=False this would output undesired tiles with black edges. Recalculated tile numbers to solve this issue

Haven't got a chance to use this specific svs data yet. Randomly picked tile size and a small stride for now. Will double check this.

Psudocodes:
example_image_path = "../data/CMU-1.svs
class MySlideLoader(BaseSlideLoader):
def apply(self, path):
return HESlide(path).chunks(level=0, size=1024, stride=400)
data = MySlideLoader().apply(example_image_path)

Tile repr wrong when i or j = 0

Tile repr incorrectly shows "i=None" or "j=None" when i or j = 0.

Add pip to environment.yml

Pip is currently not listed in environment.yml file.
Conda gives the following warning:

(base) jupyter@rosenthal-dxvm:~/pathml$ conda env create -f environment.yml

Warning: you have pip-installed dependencies in your environment file, but you do not list pip 
itself as one of your conda dependencies. Conda may not use the correct pip to install your 
packages, and they may end up in the wrong place.  Please add an explicit pip dependency. 
I'm adding one for you, but still nagging you.

MultiparametricSlide java

I am able to import bioformats and javabridge, although javabridge.start_vm(class_path=bioformats.JARS) fails.
This error should be caught so that we can give a message to the user telling them how to resolve

On MacOS 10.15.4

>>> from pathml.preprocessing.multiparametricslide import MultiparametricSlide
Could not find Java JRE compatible with x86_64 architecture
>>> wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
Could not find Java JRE compatible with x86_64 architecture
Could not find Java JRE compatible with x86_64 architecture
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 114, in _find_mac_lib
    cmd = ["find", os.path.dirname(jvm_dir), "-name", library+extension]
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/posixpath.py", line 156, in dirname
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 278, in start_thread
    library_path = _find_mac_lib("libjvm")
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 125, in _find_mac_lib
    (cmd, library), exc_info=1)
UnboundLocalError: local variable 'cmd' referenced before assignment
Failed to create Java VM
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-81b83c29b5ca>", line 1, in <module>
    wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
  File "/Users/jacobrosenthal/PycharmProjects/pathml/pathml/preprocessing/multiparametricslide.py", line 46, in __init__
    javabridge.start_vm(class_path=bioformats.JARS)
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 319, in start_vm
    raise RuntimeError("Failed to start Java VM")
RuntimeError: Failed to start Java VM

Call transforms on tiles by getattr

This will let us do things like tile.Blur(kernel_size = 7) for arbitrary transforms

Here's a code snipped that I was trying but couldn't get to work:

class Transform:
    def __init__(self, test):
        self.test = test

    def apply(self, target):
        print(f"applying on target of type {type(target)}. kwargs: {self.test}")


class Target:
    def __init__(self, name):
        self.name = name

    def __getattr__(self, item):
        print(f"type of item: {type(item)}")
        print(str(item))
        t = item(**kwargs)
        t.apply(self)

target = Target(name = "testtarget")

target.Transform(test = "testitem")

See: https://rosettacode.org/wiki/Respond_to_an_unknown_method_call#Python

Create DataSet class

DataSet object for whole-slide images.

This should be:

Lightweight, i.e. only holding paths to the images rather than the entire images themselves.
Also link paths to corresponding tiles, after preprocessing is applied and the tiles are saved to disk.
Also support masks and other types of labels.

When a dataset is downloaded from the datasets module, it should return a DataSet object. Users should also be able to create a DataSet object from files that have locally.

see: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class

CAMELYON Datasets

Datamodule and Dataloader for https://camelyon17.grand-challenge.org/

Multiple Instance Learning

We need to support multiple instance learning. For example, if we only have slide-level labels, we can treat each slide as a bag of tiles and use the slide label as the bag label.
Need to do more research on what the best way to implement this in pytorch is.

Reorganize slide classes

Slide classes should be reorganized based on dimensions and slide type.
This hierarchical class structure is more logical and will also help with making sure that the transforms work properly (#18 ). For example, some transforms may work for all 2d images regardless of number of channels, but others may only be applicable for RGB images, and others may be specific to certain types (e.g. H&E stain deconvolution).

Test ML models

Add some tests to make sure that the ML models are working correctly.
For example, this may involve overfitting on a toy dataset and verifying that performance is above some threshold.

Create Style Guidelines

Guidelines for code

Guidelines for commits

Chunk generator

Slide objects should have a method that returns an iterator over "chunks" so that the image can be processed chunk-wise instead of loading the entire thing into memory.
Abstract method should be implemented in BaseSlide, but each slide type (e.g. HESlide, MultiparametricSlide) may have to be implement differently based on backend (e.g. openslide or bioformats)

Pseudocode:

slide = HESlide("/path/to/image.svs")

for chunk in slide.generate_chunks(level=0, size=1024, ...):
  # operate on each 1024x1024 chunk
  preprocess(chunk)

Create DataLoader class

Create a dataloader class similar to torch.utils.data.DataLoader

TMA support

Add support for tissue microarray (TMA) images.
This probably means adding functionality to take an input image and identify the separate cores.

We may be able to use TMAJ software, either directly through javabridge or as inspiration:

Example of TMA slides (source here):

Trouble installing dependencies for multiparametricslide

I just tried installing PathML and running the tests on a new VM and ran into problems with multiparametricslide.

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ python -m pytest
============================================= test session starts =============================================
platform linux -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/pathml
collected 74 items / 1 error / 73 selected                                                                    

=================================================== ERRORS ====================================================
___________________ ERROR collecting tests/preprocessing_tests/test_multiparametricslide.py ___________________
ImportError while importing test module '/home/jupyter/pathml/tests/preprocessing_tests/test_multiparametricslide.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pathml/preprocessing/multiparametricslide.py:10: in <module>
    import bioformats
E   ModuleNotFoundError: No module named 'bioformats'

During handling of the above exception, another exception occurred:
/opt/conda/envs/pathml/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/preprocessing_tests/test_multiparametricslide.py:5: in <module>
    from pathml.preprocessing.multiparametricslide import MultiparametricSlide2d
pathml/preprocessing/multiparametricslide.py:25: in <module>
    raise ImportError("MultiparametricSlide2d requires javabridge and bioformats")
E   ImportError: MultiparametricSlide2d requires javabridge and bioformats
============================================== warnings summary ===============================================
pathml/preprocessing/multiparametricslide.py:16
  /home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
              See: https://pythonhosted.org/javabridge/installation.html. You can install using:
                  
                  sudo apt-get install openjdk-8-jdk
                  pip install javabridge
                  pip install python-bioformats
          
    warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================================== short test summary info ===========================================
ERROR tests/preprocessing_tests/test_multiparametricslide.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================= 1 warning, 1 error in 0.62s =========================================

So then I tried to install openjdk using the instructions, but that didn't work:

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ sudo apt-get install openjdk-8-jdk
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package openjdk-8-jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'openjdk-8-jdk' has no installation candidate

OS info:

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
(pathml) jupyter@shared-dxvm-gpu:~/pathml$ uname -a
Linux shared-dxvm-gpu 4.19.0-12-cloud-amd64 #1 SMP Debian 4.19.152-1 (2020-10-18) x86_64 GNU/Linux

After some searching online, it seems like openjdk-8-jdk is not supported anymore (see here for example). I think the issue is that python-javabridge is not really being actively developed (see here). We need to find a better solution - either give users very detailed instructions for how to install openjdk-8 (which doesn't seem like a great solution since it isn't officially supported anymore) or drop bioformats/javabridge dependency and use a different tool to support multiparametric slides.

I am also confused why the tests pass successfully, which also use sudo apt-get install openjdk-8-jdk

Run pipelines on datasets

Implement a method to run a preprocessing pipeline on a dataset.
Should basically be a convenience function for running pipeline on each individual image.

Pseudocode:

mydataset = pathml.datasets.PESO.download()
mypipeline = pathml.preprocessing.default_pipeline()

mypipeline.run(mydataset)
### should be equivalent to:
for wsi in mydataset:
    mypipeline.run(wsi)

Add repr for every class

These should be informative and clear

Specify output directory in Pipeline init

Currently, tiles are written to disk in the tile_level_preprocessor component of the Pipeline.
It would be better to pass a path to the output directory when running the Pipeline object, and then write all tiles to that directory. This would allow for better integration with DataModuleclass, since the entire DataModule could be initialized pointing to one directory and can then:

download images there
pass the directory path as input to Pipeline.run() and write all the tiles there
create dataset and dataloader objects, since the full filepath is known.

Pseudocode:

# initialize pipeline
my_pipeline = Pipeline(
    slide_loader       = MySlideLoader(),
    slide_preprocessor = MySlidePreprocessor(),
    tile_extractor     = SimpleTileExtractor(tile_size=224),
    tile_preprocessor  = MyTilePreprocessor()
)

# initialize slide
slide = HESlide("/path/to/image.svs")

# run pipeline on slide
my_pipeline.run(slide, out_dir = "./data/preprocessed")

Support for DICOM Integration

Digital Imaging and Communications in Medicine (DICOM) is the standard for the representation, storage, and communication of medical images and related information. A DICOM file format and communication protocol for pathology have been defined. Whole slide image data can be encoded together with relevant patient and specimen-related metadata as DICOM objects.

As DICOM is more widely adopted in Digital Pathology support for this file format may need to be included in PathML. Creating a class that inherits from BaseSlide and which can ingest the DICOM files. The class could also implement methods specific to DICOM, like reading metadata.

Use abstract classes

Classes which aren't meant to be instantiated should be abstract classes (i.e. inherit from abc.ABC).
This is probably cleaner than current implementation of raising NotImplementedError, since abstract classes can't be initialized by users (they will get an error).

Documentation doesn't compile

Need to fix bugs in documentation. Should also add tests to make sure that docs compile successfully.

(pathml) jupyter@shared-dxvm-gpu:~/pathml/docs$ make html
Running Sphinx v3.4.2
making output directory... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 17 source files that are out of date
updating environment: [new config] 17 added, 0 changed, 0 removed
/home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
            See: https://pythonhosted.org/javabridge/installation.html. You can install using:
                
                sudo apt-get install openjdk-8-jdk
                pip install javabridge
                pip install python-bioformats
        
  warn(
WARNING: autodoc: failed to import module 'multiparametricslide' from module 'pathml.preprocessing'; the following exception was raised:
MultiparametricSlide2d requires javabridge and bioformats

Notebook error:
Problems with linked notebook "examples/link_advanced_HE_chunks" path:
InputError: [Errno 2] No such file or directory: '../examples/advanced_HE_chunks.ipynb'.
make: *** [Makefile:20: html] Error 2

Note that this issue is about the error in compilation. The javabridge warning should be fixed when we fix #48

Add CONTRIBUTING file

The CONTRIBUTING file gives instructions and guidelines for contributors. We can start with something basic and expand as needed down the road.

There is a contributing.rst file in the documentation, but we should have it in the root directory so it is easily accessible. Also, if it is in the root directory, GitHub will automatically link to this file when a contributor creates an issue or opens a pull request.

Helpful resources:

Convenience methods for slide manipulation

Currently SlideData design emphasizes methods required to run pipeline.

Implement convenience functions for plotting, slicing, otherwise manipulating slides.

Create Issue Templates

Create templates for certain types of issues (e.g. feature request, bug report, etc.)

Codecov badge

We can set up an automated workflow to measure code coverage and add it in a badge on the project readme.

https://github.com/codecov/codecov-action

This is not high priority at the moment but filing here to do later

Extend multiparametric support to large images

Currently bioformats limits file size to 2GB because of java array size limitations.

Two options:

Instantiate multiple 2GB chunks and build numpy array piecewise
Support common filetypes (like .tif) with python dependencies, revert to java only when user provides a rare proprietary microscope file format.

Pipeline save method

We need a way to share pipelines by writing them to a file

Pseudocode:

my_pipeline = Pipeline(**kwargs)
my_pipeline.save("/path/to/disk/pipeline.pickle")

## someone else can then load and use:

pipeline = load("/path/to/local/downloads/pipeline.pickle")
pipeline.run(local_slide)

Add save_tiles method to SlideData class

Writing tiles to disk should be a method for SlideData class. This makes more sense than writing tiles as part of the tile-level preprocessor in a Pipeline. Directory to write tiles to should be specified in argument

Pseudocode:

data = HESlide("/path/to/image.svs").load_data()
data = my_pipeline.run(data)
data.write_tiles("/path/to/tiled/images/")

repr for notebooks

We can define a SlideData._repr_html_() method (or maybe SlideData._repr_jpg_()) which would let us do pretty outputs in JupyterNotebook.
For example we could make this method display a thumbnail of the image by default, along with some text describing it.
This would be nice for users since you could see the slide without having to call any methods.

This is lower priority but seems straightforward to implement

see: https://ipython.readthedocs.io/en/stable/config/integrating.html#rich-display

Sprint SlideData Class

Rewrite SlideData Class.

name: str
shape <- dict of shapes of slide, masks, tiles
slide: Slide (if slide loaded using Bioformats)
masks: pathml.Masks
tiles: pathml.Tiles
labels: (Masks, str, int, floats)
history: list
slidetype: str (e.g. "HE" or "IHC"). Set when SlideData class is initialized

init - use appropriate backend (openslide or bioformats)
repr
read_region(level)
make_tiles(Pipeline, optional)
chunks(shape, stride) --> generator of chunk objects
plot() --> matplotlib (also handle masks in plot)
save()

Refactor

Make SlideData the core pathml object, combine pipeline and transforms into methods in SlideData

Preprocessing has become a catch-all directory, improve directory structure

HoVer-Net

Implement HoVer-Net (https://arxiv.org/pdf/1812.06499.pdf)

Standardize Openslide pixel resolution level

Different slides may have different microns per pixel (MPP) depending on the physical parameters of the scanner.
This means that for any two slides, level 0 may be at different pixel resolution.
We should provide a way to standardize pixel resolution of slides, so that we know that all images in a dataset are the same resolution.

Openslide objects have slide.properties["openslide.mpp-x"] and slide.properties["openslide.mpp-y"]which we can use

Should we use PyTorchLightning? [discussion]

Should we use PyTorch Lightning in PathML?

Pros:

More logical code organization structure
May be easier for less technical users
- Don't need to write training loops
- Automatically handles multiGPU
- Automatically handles mixed precision training
Popular framework actively being developed

Cons:

Overhead to refactor code to be compatible
One more external dependency
Committing to a specific framework may make PathML less flexible, decreasing utility

Other thoughts:

Would it be easy to support both? i.e. have models in base PyTorch, but also have lightning-compatible versions?
I haven't used pytorch-lightning myself, but @jmnyman has made the switch and it sounds like he has really benefited from it
If we decide to go the pytorch-lightning route, it probably won't be in v0.1 initial release. Pushing to first open-source release is top priority at the moment

Sprint TODO

TODO weekend sprint:
repo structure:
pathml

utils.py
-> core
-> preprocessing
-> ml
-> datasets

refactor:

Adding HistoQC

Adding https://github.com/choosehappy/HistoQC module to perform rigorous removal of unwanted artifacts in the data

Check image compatibility for transforms

Each transform should check that the input image is compatible. For example, colorspace conversions are not applicable for multiplex images, though blurring transforms that operate on each channel may still be. This should probably just be an assert statement at the beginning of apply() method of each transform.

dimensions wrong

pannuke_dset = PanNukeDataset(
    data_dir = "../data/pannuke",
    fold_ix = None,
    hovernet_preprocess = True,
    nucleus_type_labels = True,
)

im, mask, hv, t = pannuke_dset[0]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-14-c5513ce6bd51> in <module>
----> 1 im, mask, hv, t = pannuke_dset[0]

~/pathml/pathml/datasets/pannuke.py in __getitem__(self, ix)
    110                 # sum across mask channels to squash mask channel dim to size 1
    111                 # don't sum the last channel, which is background!
--> 112                 mask_1c = pannuke_multiclass_mask_to_nucleus_mask(mask)
    113             else:
    114                 mask_1c = mask

~/pathml/pathml/datasets/pannuke.py in pannuke_multiclass_mask_to_nucleus_mask(multiclass_mask)
    135     """
    136     # verify shape of input
--> 137     assert multiclass_mask.ndim == 3 and multiclass_mask.shape[0] == 6, \
    138         f"Expecting a batch of masks with dims (6, 256, 256). Got input of shape {multiclass_mask.shape}"
    139     assert multiclass_mask.shape[1] == 256 and multiclass_mask.shape[2] == 256, \

AssertionError: Expecting a batch of masks with dims (6, 256, 256). Got input of shape (256, 6, 256)

2. _clean_up_download_pannuke() problem

pannuke = PanNukeDataModule(
    data_dir="../data/pannuke/", 
    download=True,
    nucleus_type_labels=True, 
    batch_size=8, 
    hovernet_preprocess=True,
    split=1,
    transforms=None,
)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-426a1c4fd670> in <module>
----> 1 pannuke = PanNukeDataModule(
      2     data_dir="../data/pannuke/",
      3     download=True,
      4     nucleus_type_labels=True,
      5     batch_size=8,

~/pathml/pathml/datasets/pannuke.py in __init__(self, data_dir, download, shuffle, transforms, nucleus_type_labels, split, batch_size, hovernet_preprocess)
    198         self.download = download
    199         if download:
--> 200             self._download_pannuke(self.data_dir)
    201         else:
    202             # make sure that subdirectories exist

~/pathml/pathml/datasets/pannuke.py in _download_pannuke(self, download_dir)
    241 
    242         self._process_downloaded_pannuke(download_dir)
--> 243         self._clean_up_download_pannuke(download_dir)
    244 
    245     @staticmethod

~/pathml/pathml/datasets/pannuke.py in _clean_up_download_pannuke(pannuke_dir)
    306             downloaded_dir = p / f"Fold {fold_ix}"
    307             zip_file.unlink()
--> 308             downloaded_dir.rmdir()
    309 
    310 

/opt/conda/envs/pathml/lib/python3.8/pathlib.py in rmdir(self)
   1333         if self._closed:
   1334             self._raise_closed()
-> 1335         self._accessor.rmdir(self)
   1336 
   1337     def lstat(self):

OSError: [Errno 39] Directory not empty: '../data/pannuke/Fold 1'

dana-farber-aios / pathml Goto Github PK

pathml's Introduction

1. Installation

1.1 Installation option 1: pip install

1.2 Installation option 2: clone repo and install from source

1.3 Installation option 3: Docker

1.4 Installation option 4: Google Colab

1.5 CUDA (optional)

2. Using with Jupyter (optional)

2.1 Set JAVA_HOME environment variable

2.2 Register environment as an IPython kernel

3. Examples

4. Citing & known uses

5. Users

6. Contributing

7. License

8. Contact

pathml's People

Contributors

Stargazers

Watchers

Forkers

pathml's Issues

Recommend Projects

Recommend Topics

Recommend Org