Code Monkey home page Code Monkey logo

pathml's Introduction

๐Ÿค–๐Ÿ”ฌ PathML: Tools for computational pathology

Downloads Documentation Status codecov Code style: black PyPI version tests dev-tests

โญ PathML objective is to lower the barrier to entry to digital pathology

Imaging datasets in cancer research are growing exponentially in both quantity and information density. These massive datasets may enable derivation of insights for cancer research and clinical care, but only if researchers are equipped with the tools to leverage advanced computational analysis approaches such as machine learning and artificial intelligence. In this work, we highlight three themes to guide development of such computational tools: scalability, standardization, and ease of use. We then apply these principles to develop PathML, a general-purpose research toolkit for computational pathology. We describe the design of the PathML framework and demonstrate applications in diverse use cases.

๐Ÿš€ The fastest way to get started?

docker pull pathml/pathml && docker run -it -p 8888:8888 pathml/pathml

done, what analyses can I write now? ๐Ÿ‘‰

This AI will:

  • ๐Ÿค– write digital pathology analyses for you
  • ๐Ÿ”ฌ walk you through the code, step-by-step
  • ๐ŸŽ“ be your teacher, as you embark on your digital pathology journey โค๏ธ

More usage examples here.

๐Ÿ“– Official PathML Documentation

View the official PathML Documentation on readthedocs

๐Ÿ”ฅ Examples! Examples! Examples!

โ†ด Jump to the gallery of examples below


1. Installation

There are several ways to install PathML:

  1. pip install from PyPI (recommended for users)
  2. Clone repo to local machine and install from source (recommended for developers/contributors)
  3. Use the PathML Docker container
  4. Install in Google Colab

Options (1), (2), and (4) require that you first install all external dependencies:

  • openslide
  • JDK 8

We recommend using conda for environment management. Download Miniconda here

1.1 Installation option 1: pip install

Create conda environment, this step is common to all platforms (Linux, Mac, Windows):

conda create --name pathml python=3.8
conda activate pathml

Install external dependencies (for Linux) with Apt:

sudo apt-get install openslide-tools g++ gcc libblas-dev liblapack-dev

Install external dependencies (for MacOS) with Brew:

brew install openslide

Install external dependencies (for Windows) with vcpkg:

vcpkg install openslide

Install OpenJDK 8, this step is common to all platforms (Linux, Mac, Windows):

conda install openjdk==8.0.152

Optionally install CUDA (instructions here)

Install PathML from PyPI:

pip install pathml

1.2 Installation option 2: clone repo and install from source

Clone repo:

git clone https://github.com/Dana-Farber-AIOS/pathml.git
cd pathml

Create conda environment:

conda env create -f environment.yml
conda activate pathml

Optionally install CUDA (instructions here)

Install PathML from source:

pip install -e .

1.3 Installation option 3: Docker

First, download or build the PathML Docker container:

pathml-docker-installation

  • Step 1: download PathML container from Docker Hub

    docker pull pathml/pathml:latest
    

    Optionally specify a tag for a particular version, e.g. docker pull pathml/pathml:2.0.2. To view possible tags, please refer to the PathML DockerHub page.

  • Alternative Step 1 if you have custom hardware: build docker container from source

    git clone https://github.com/Dana-Farber-AIOS/pathml.git
    cd pathml
    docker build -t pathml/pathml .
    
  • Step 2: Then connect to the container:

    docker run -it -p 8888:8888 pathml/pathml
    

The above command runs the container, which is configured to spin up a jupyter lab session and expose it on port 8888. The terminal should display a URL to the jupyter lab session starting with http://127.0.0.1:8888/lab?token=<.....>. Navigate to that page and you should connect to the jupyter lab session running on the container with the pathml environment fully configured. If a password is requested, copy the string of characters following the token= in the url.

Note that the docker container requires extra configurations to use with GPU.
Note that these instructions assume that there are no other processes using port 8888.

Please refer to the Docker run documentation for further instructions on accessing the container, e.g. for mounting volumes to access files on a local machine from within the container.

1.4 Installation option 4: Google Colab

To get PathML running in a Colab environment:

import os
!pip install openslide-python
!apt-get install openslide-tools
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version
!pip install pathml

PathML Tutorials we published in Google Colab

  1. PathML Tutorial Colab #1 - Load an SVS image in PathML and see the image descriptors
  2. Now that you have PathML installed, all our other examples would work too - Only make sure you select an appropriately sized backend or VM in CoLab (i.e., RAM, CPU, Disk, and GPU if necessary)

Thanks to all of our open-source collaborators for helping maintain these installation instructions!
Please open an issue for any bugs or other problems during installation process.

1.5 CUDA (optional)

To use GPU acceleration for model training or other tasks, you must install CUDA. This guide should work, but for the most up-to-date instructions, refer to the official PyTorch installation instructions.

Check the version of CUDA:

nvidia-smi

Install correct version of cudatoolkit:

# update this command with your CUDA version number
conda install cudatoolkit=11.0

After installing PyTorch, optionally verify successful PyTorch installation with CUDA support:

python -c "import torch; print(torch.cuda.is_available())"

2. Using with Jupyter (optional)

Jupyter notebooks are a convenient way to work interactively. To use PathML in Jupyter notebooks:

2.1 Set JAVA_HOME environment variable

PathML relies on Java to enable support for reading a wide range of file formats. Before using PathML in Jupyter, you may need to manually set the JAVA_HOME environment variable specifying the path to Java. To do so:

  1. Get the path to Java by running echo $JAVA_HOME in the terminal in your pathml conda environment (outside of Jupyter)
  2. Set that path as the JAVA_HOME environment variable in Jupyter:
    import os
    os.environ["JAVA_HOME"] = "/opt/conda/envs/pathml" # change path as needed
    

2.2 Register environment as an IPython kernel

conda activate pathml
conda install ipykernel
python -m ipykernel install --user --name=pathml

This makes the pathml environment available as a kernel in jupyter lab or notebook.

3. Examples

Now that you are all set with PathML installation, let's get started with some analyses you can easily replicate:

  1. Load over 160+ different types of pathology images using PathML
  2. H&E Stain Deconvolution and Color Normalization
  3. Brightfield imaging pipeline: load an image, preprocess it on a local cluster, and get it read for machine learning analyses in PyTorch
  4. Multiparametric Imaging: Quickstart & single-cell quantification
  5. Multiparametric Imaging: CODEX & nuclei quantization
  6. Train HoVer-Net model to perform nucleus detection and classification, using data from PanNuke dataset
  7. Gallery of PathML preprocessing and transformations

4. Citing & known uses

If you use PathML please cite:

So far, PathML was referenced in 20+ manuscripts:

5. Users

This is where in the world our most enthusiastic supporters are located:

and this is where they work:

Source: https://ossinsight.io/analyze/Dana-Farber-AIOS/pathml#people

6. Contributing

PathML is an open source project. Consider contributing to benefit the entire community!

There are many ways to contribute to PathML, including:

  • Submitting bug reports
  • Submitting feature requests
  • Writing documentation and examples
  • Fixing bugs
  • Writing code for new features
  • Sharing workflows
  • Sharing trained model parameters
  • Sharing PathML with colleagues, students, etc.

See contributing for more details.

7. License

The GNU GPL v2 version of PathML is made available via Open Source licensing. The user is free to use, modify, and distribute under the terms of the GNU General Public License version 2.

Commercial license options are available also.

8. Contact

Questions? Comments? Suggestions? Get in touch!

[email protected]

pathml's People

Contributors

beegass avatar dana-farber avatar dependabot[bot] avatar dmbrundage avatar ella-dfci avatar jacob-rosenthal avatar karenxzr avatar mohamedomar2020 avatar ryanccarelli avatar sreekarreddydfci avatar surya-narayanan avatar tddough98 avatar yu-anchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pathml's Issues

Hosting Model Weights

We want to be able to share pre-trained models. The trained model weights can be saved to disk, e.g. in .pth files for pytorch. However, these files can be quite big - too big to put in the GitHub repo itself..

We need to find a solution for hosting these large files of model parameters.
E.g. we could have a GCP bucket, or S3 bucket.
Need to evaluate the costs of different options.

Transition to Google-/numpy-style docstrings

Docstrings are currently written in basic Sphinx format. However, basic Sphinx doesn't support a References section so I had to start using the Napoleon extension. Since we are already using Napoleon, we may as well stick with Google or numpy docstring format moving forward, since it is more readable for humans.

Improve example notebooks

The example notebooks should be more comprehensive and better organized. For many people, the example notebooks will be their first experience with PathML, so we want them to be really good.

Some ideas:

  • Update the example pipeline notebooks to use chunk processing (much more efficient than the current basic example notebook which loads the whole slide into memory)
  • Create a notebook that shows examples of every transform
  • Create a notebook showing how to use DataModules with Pytorch

Add test for multiprocessing

Pipeline.run() uses concurrent.futures.ProcessPoolExecutor to implement multiprocessing when run on a dataset.
I'm not sure how to write a test for this though. Everything I've tried so far has caused Pytest to hang.

Publish package on PyPI

This will allow people to pip install pathml

We should look into using release tags: https://docs.github.com/en/free-pro-team@latest/github/administering-a-repository/about-releases

create class Mask

wrap dict of masks
each mask stores pixel-wise int8
repr method (keys, dimensions)
getitem method
len method

by default should be multiparametric single plane
subclass volumetric

Pannuke masks and images don't match

The masks in PanNuke dataloaders don't match the images.
This is obviously a big problem for training models...

from pathml.datasets.pannuke import PanNukeDataModule

pannuke = PanNukeDataModule(
    data_dir="../data/pannuke/", 
    download=False,
    nucleus_type_labels=True, 
    batch_size=8, 
    hovernet_preprocess=True,
    split=1
)

train = pannuke.train_dataloader

images, masks, hvs, types = next(iter(train))

fig, ax = plt.subplots(nrows=1, ncols=2)
im = np.moveaxis(images[0, ...].numpy(), 0, 2)
ax[0].imshow(im)
mask = masks.argmax(dim=1)[0, ...]
ax[1].imshow(mask)
plt.show()

Screen Shot 2021-01-10 at 2 33 16 PM

I think this may be happening because the lists of filepaths for masks and images are created separately using pathlib.Path.glob(), but glob is unordered.

wsi.py: pad and black edge issue

When stride is small, the last few tiles lie on the edge of slides would have smaller size than (tile size, tile size). Openslide would zero padding automatically. When set pad=False this would output undesired tiles with black edges. Recalculated tile numbers to solve this issue

Haven't got a chance to use this specific svs data yet. Randomly picked tile size and a small stride for now. Will double check this.

Psudocodes:
example_image_path = "../data/CMU-1.svs
class MySlideLoader(BaseSlideLoader):
def apply(self, path):
return HESlide(path).chunks(level=0, size=1024, stride=400)
data = MySlideLoader().apply(example_image_path)

Add pip to environment.yml

Pip is currently not listed in environment.yml file.
Conda gives the following warning:

(base) jupyter@rosenthal-dxvm:~/pathml$ conda env create -f environment.yml

Warning: you have pip-installed dependencies in your environment file, but you do not list pip 
itself as one of your conda dependencies. Conda may not use the correct pip to install your 
packages, and they may end up in the wrong place.  Please add an explicit pip dependency. 
I'm adding one for you, but still nagging you.

MultiparametricSlide java

I am able to import bioformats and javabridge, although javabridge.start_vm(class_path=bioformats.JARS) fails.
This error should be caught so that we can give a message to the user telling them how to resolve

On MacOS 10.15.4

>>> from pathml.preprocessing.multiparametricslide import MultiparametricSlide
Could not find Java JRE compatible with x86_64 architecture
>>> wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
Could not find Java JRE compatible with x86_64 architecture
Could not find Java JRE compatible with x86_64 architecture
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 114, in _find_mac_lib
    cmd = ["find", os.path.dirname(jvm_dir), "-name", library+extension]
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/posixpath.py", line 156, in dirname
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 278, in start_thread
    library_path = _find_mac_lib("libjvm")
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 125, in _find_mac_lib
    (cmd, library), exc_info=1)
UnboundLocalError: local variable 'cmd' referenced before assignment
Failed to create Java VM
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-81b83c29b5ca>", line 1, in <module>
    wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
  File "/Users/jacobrosenthal/PycharmProjects/pathml/pathml/preprocessing/multiparametricslide.py", line 46, in __init__
    javabridge.start_vm(class_path=bioformats.JARS)
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 319, in start_vm
    raise RuntimeError("Failed to start Java VM")
RuntimeError: Failed to start Java VM

Call transforms on tiles by __getattr__

This will let us do things like tile.Blur(kernel_size = 7) for arbitrary transforms

Here's a code snipped that I was trying but couldn't get to work:

class Transform:
    def __init__(self, test):
        self.test = test

    def apply(self, target):
        print(f"applying on target of type {type(target)}. kwargs: {self.test}")


class Target:
    def __init__(self, name):
        self.name = name

    def __getattr__(self, item):
        print(f"type of item: {type(item)}")
        print(str(item))
        t = item(**kwargs)
        t.apply(self)

target = Target(name = "testtarget")

target.Transform(test = "testitem")

See: https://rosettacode.org/wiki/Respond_to_an_unknown_method_call#Python

Create DataSet class

DataSet object for whole-slide images.

This should be:

  • Lightweight, i.e. only holding paths to the images rather than the entire images themselves.
  • Also link paths to corresponding tiles, after preprocessing is applied and the tiles are saved to disk.
  • Also support masks and other types of labels.

When a dataset is downloaded from the datasets module, it should return a DataSet object. Users should also be able to create a DataSet object from files that have locally.

see: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class

Multiple Instance Learning

We need to support multiple instance learning. For example, if we only have slide-level labels, we can treat each slide as a bag of tiles and use the slide label as the bag label.
Need to do more research on what the best way to implement this in pytorch is.

Reorganize slide classes

Slide classes should be reorganized based on dimensions and slide type.
This hierarchical class structure is more logical and will also help with making sure that the transforms work properly (#18 ). For example, some transforms may work for all 2d images regardless of number of channels, but others may only be applicable for RGB images, and others may be specific to certain types (e.g. H&E stain deconvolution).

Screen Shot 2020-10-27 at 6 10 17 PM

Test ML models

Add some tests to make sure that the ML models are working correctly.
For example, this may involve overfitting on a toy dataset and verifying that performance is above some threshold.

Chunk generator

Slide objects should have a method that returns an iterator over "chunks" so that the image can be processed chunk-wise instead of loading the entire thing into memory.
Abstract method should be implemented in BaseSlide, but each slide type (e.g. HESlide, MultiparametricSlide) may have to be implement differently based on backend (e.g. openslide or bioformats)

Pseudocode:

slide = HESlide("/path/to/image.svs")

for chunk in slide.generate_chunks(level=0, size=1024, ...):
  # operate on each 1024x1024 chunk
  preprocess(chunk)

TMA support

Add support for tissue microarray (TMA) images.
This probably means adding functionality to take an input image and identify the separate cores.

We may be able to use TMAJ software, either directly through javabridge or as inspiration:

Example of TMA slides (source here):

image

Trouble installing dependencies for multiparametricslide

I just tried installing PathML and running the tests on a new VM and ran into problems with multiparametricslide.

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ python -m pytest
============================================= test session starts =============================================
platform linux -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/pathml
collected 74 items / 1 error / 73 selected                                                                    

=================================================== ERRORS ====================================================
___________________ ERROR collecting tests/preprocessing_tests/test_multiparametricslide.py ___________________
ImportError while importing test module '/home/jupyter/pathml/tests/preprocessing_tests/test_multiparametricslide.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pathml/preprocessing/multiparametricslide.py:10: in <module>
    import bioformats
E   ModuleNotFoundError: No module named 'bioformats'

During handling of the above exception, another exception occurred:
/opt/conda/envs/pathml/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/preprocessing_tests/test_multiparametricslide.py:5: in <module>
    from pathml.preprocessing.multiparametricslide import MultiparametricSlide2d
pathml/preprocessing/multiparametricslide.py:25: in <module>
    raise ImportError("MultiparametricSlide2d requires javabridge and bioformats")
E   ImportError: MultiparametricSlide2d requires javabridge and bioformats
============================================== warnings summary ===============================================
pathml/preprocessing/multiparametricslide.py:16
  /home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
              See: https://pythonhosted.org/javabridge/installation.html. You can install using:
                  
                  sudo apt-get install openjdk-8-jdk
                  pip install javabridge
                  pip install python-bioformats
          
    warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================================== short test summary info ===========================================
ERROR tests/preprocessing_tests/test_multiparametricslide.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================= 1 warning, 1 error in 0.62s =========================================

So then I tried to install openjdk using the instructions, but that didn't work:

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ sudo apt-get install openjdk-8-jdk
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package openjdk-8-jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'openjdk-8-jdk' has no installation candidate

OS info:

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
(pathml) jupyter@shared-dxvm-gpu:~/pathml$ uname -a
Linux shared-dxvm-gpu 4.19.0-12-cloud-amd64 #1 SMP Debian 4.19.152-1 (2020-10-18) x86_64 GNU/Linux

After some searching online, it seems like openjdk-8-jdk is not supported anymore (see here for example). I think the issue is that python-javabridge is not really being actively developed (see here). We need to find a better solution - either give users very detailed instructions for how to install openjdk-8 (which doesn't seem like a great solution since it isn't officially supported anymore) or drop bioformats/javabridge dependency and use a different tool to support multiparametric slides.

I am also confused why the tests pass successfully, which also use sudo apt-get install openjdk-8-jdk

Run pipelines on datasets

Implement a method to run a preprocessing pipeline on a dataset.
Should basically be a convenience function for running pipeline on each individual image.

Pseudocode:

mydataset = pathml.datasets.PESO.download()
mypipeline = pathml.preprocessing.default_pipeline()

mypipeline.run(mydataset)
### should be equivalent to:
for wsi in mydataset:
    mypipeline.run(wsi)

Specify output directory in Pipeline init

Currently, tiles are written to disk in the tile_level_preprocessor component of the Pipeline.
It would be better to pass a path to the output directory when running the Pipeline object, and then write all tiles to that directory. This would allow for better integration with DataModuleclass, since the entire DataModule could be initialized pointing to one directory and can then:

  1. download images there
  2. pass the directory path as input to Pipeline.run() and write all the tiles there
  3. create dataset and dataloader objects, since the full filepath is known.

Pseudocode:

# initialize pipeline
my_pipeline = Pipeline(
    slide_loader       = MySlideLoader(),
    slide_preprocessor = MySlidePreprocessor(),
    tile_extractor     = SimpleTileExtractor(tile_size=224),
    tile_preprocessor  = MyTilePreprocessor()
)

# initialize slide
slide = HESlide("/path/to/image.svs")

# run pipeline on slide
my_pipeline.run(slide, out_dir = "./data/preprocessed")

Support for DICOM Integration

Digital Imaging and Communications in Medicine (DICOM) is the standard for the representation, storage, and communication of medical images and related information. A DICOM file format and communication protocol for pathology have been defined. Whole slide image data can be encoded together with relevant patient and specimen-related metadata as DICOM objects.

As DICOM is more widely adopted in Digital Pathology support for this file format may need to be included in PathML. Creating a class that inherits from BaseSlide and which can ingest the DICOM files. The class could also implement methods specific to DICOM, like reading metadata.

Use abstract classes

Classes which aren't meant to be instantiated should be abstract classes (i.e. inherit from abc.ABC).
This is probably cleaner than current implementation of raising NotImplementedError, since abstract classes can't be initialized by users (they will get an error).

Documentation doesn't compile

Need to fix bugs in documentation. Should also add tests to make sure that docs compile successfully.

(pathml) jupyter@shared-dxvm-gpu:~/pathml/docs$ make html
Running Sphinx v3.4.2
making output directory... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 17 source files that are out of date
updating environment: [new config] 17 added, 0 changed, 0 removed
/home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
            See: https://pythonhosted.org/javabridge/installation.html. You can install using:
                
                sudo apt-get install openjdk-8-jdk
                pip install javabridge
                pip install python-bioformats
        
  warn(
WARNING: autodoc: failed to import module 'multiparametricslide' from module 'pathml.preprocessing'; the following exception was raised:
MultiparametricSlide2d requires javabridge and bioformats

Notebook error:
Problems with linked notebook "examples/link_advanced_HE_chunks" path:
InputError: [Errno 2] No such file or directory: '../examples/advanced_HE_chunks.ipynb'.
make: *** [Makefile:20: html] Error 2

Note that this issue is about the error in compilation. The javabridge warning should be fixed when we fix #48

Add CONTRIBUTING file

The CONTRIBUTING file gives instructions and guidelines for contributors. We can start with something basic and expand as needed down the road.

There is a contributing.rst file in the documentation, but we should have it in the root directory so it is easily accessible. Also, if it is in the root directory, GitHub will automatically link to this file when a contributor creates an issue or opens a pull request.

Helpful resources:

Create Issue Templates

Create templates for certain types of issues (e.g. feature request, bug report, etc.)

Extend multiparametric support to large images

Currently bioformats limits file size to 2GB because of java array size limitations.

Two options:

  1. Instantiate multiple 2GB chunks and build numpy array piecewise
  2. Support common filetypes (like .tif) with python dependencies, revert to java only when user provides a rare proprietary microscope file format.

Pipeline save method

We need a way to share pipelines by writing them to a file

Pseudocode:

my_pipeline = Pipeline(**kwargs)
my_pipeline.save("/path/to/disk/pipeline.pickle")

## someone else can then load and use:

pipeline = load("/path/to/local/downloads/pipeline.pickle")
pipeline.run(local_slide)

Add save_tiles method to SlideData class

Writing tiles to disk should be a method for SlideData class. This makes more sense than writing tiles as part of the tile-level preprocessor in a Pipeline. Directory to write tiles to should be specified in argument

Pseudocode:

data = HESlide("/path/to/image.svs").load_data()
data = my_pipeline.run(data)
data.write_tiles("/path/to/tiled/images/")

repr for notebooks

We can define a SlideData._repr_html_() method (or maybe SlideData._repr_jpg_()) which would let us do pretty outputs in JupyterNotebook.
For example we could make this method display a thumbnail of the image by default, along with some text describing it.
This would be nice for users since you could see the slide without having to call any methods.

This is lower priority but seems straightforward to implement

see: https://ipython.readthedocs.io/en/stable/config/integrating.html#rich-display

Sprint SlideData Class

Rewrite SlideData Class.

name: str
shape <- dict of shapes of slide, masks, tiles
slide: Slide (if slide loaded using Bioformats)
masks: pathml.Masks
tiles: pathml.Tiles
labels: (Masks, str, int, floats)
history: list
slidetype: str (e.g. "HE" or "IHC"). Set when SlideData class is initialized

init - use appropriate backend (openslide or bioformats)
repr
read_region(level)
make_tiles(Pipeline, optional)
chunks(shape, stride) --> generator of chunk objects
plot() --> matplotlib (also handle masks in plot)
save()

Refactor

Make SlideData the core pathml object, combine pipeline and transforms into methods in SlideData

Preprocessing has become a catch-all directory, improve directory structure

Standardize Openslide pixel resolution level

Different slides may have different microns per pixel (MPP) depending on the physical parameters of the scanner.
This means that for any two slides, level 0 may be at different pixel resolution.
We should provide a way to standardize pixel resolution of slides, so that we know that all images in a dataset are the same resolution.

Openslide objects have slide.properties["openslide.mpp-x"] and slide.properties["openslide.mpp-y"]which we can use

Should we use PyTorchLightning? [discussion]

Should we use PyTorch Lightning in PathML?

Pros:

  • More logical code organization structure
  • May be easier for less technical users
    • Don't need to write training loops
    • Automatically handles multiGPU
    • Automatically handles mixed precision training
  • Popular framework actively being developed

Cons:

  • Overhead to refactor code to be compatible
  • One more external dependency
  • Committing to a specific framework may make PathML less flexible, decreasing utility

Other thoughts:

  • Would it be easy to support both? i.e. have models in base PyTorch, but also have lightning-compatible versions?
  • I haven't used pytorch-lightning myself, but @jmnyman has made the switch and it sounds like he has really benefited from it
  • If we decide to go the pytorch-lightning route, it probably won't be in v0.1 initial release. Pushing to first open-source release is top priority at the moment

Sprint TODO

TODO weekend sprint:
repo structure:
pathml

utils.py
-> core
-> preprocessing
-> ml
-> datasets

refactor:

  • SlideData
    • tests
    • in run add multiprocessing pool for tiles
    • save method
  • Slide and Subclasses
    • tests
  • Masks and Tiles
    • tests
    • polish docstrings
    • documentation
  • Refactor transforms
    • tests
    • docs
  • Rewrite documentation
    • update examples
  • Integration tests

Check image compatibility for transforms

Each transform should check that the input image is compatible. For example, colorspace conversions are not applicable for multiplex images, though blurring transforms that operate on each channel may still be. This should probably just be an assert statement at the beginning of apply() method of each transform.

Pipeline.run() argument

The argument of Pipeline.run()should be an object inheriting from BaseSlide, rather than a file path.
This means that whenever we run a pipeline, we can trust that it implements everything from BaseSlide.

If we just pass a path, it may be ambiguous how to read it (is it a H&E slide, or a multiplex slide, or...?). All the work in reading the file, etc. should happen when creating the slide object, not in the pipeline object.

pannuke bugs

Some bugs I ran into with PanNuke dataset implementation.
Putting here to track fixing and also adding new tests.

  1. dimensions wrong
pannuke_dset = PanNukeDataset(
    data_dir = "../data/pannuke",
    fold_ix = None,
    hovernet_preprocess = True,
    nucleus_type_labels = True,
)

im, mask, hv, t = pannuke_dset[0]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-14-c5513ce6bd51> in <module>
----> 1 im, mask, hv, t = pannuke_dset[0]

~/pathml/pathml/datasets/pannuke.py in __getitem__(self, ix)
    110                 # sum across mask channels to squash mask channel dim to size 1
    111                 # don't sum the last channel, which is background!
--> 112                 mask_1c = pannuke_multiclass_mask_to_nucleus_mask(mask)
    113             else:
    114                 mask_1c = mask

~/pathml/pathml/datasets/pannuke.py in pannuke_multiclass_mask_to_nucleus_mask(multiclass_mask)
    135     """
    136     # verify shape of input
--> 137     assert multiclass_mask.ndim == 3 and multiclass_mask.shape[0] == 6, \
    138         f"Expecting a batch of masks with dims (6, 256, 256). Got input of shape {multiclass_mask.shape}"
    139     assert multiclass_mask.shape[1] == 256 and multiclass_mask.shape[2] == 256, \

AssertionError: Expecting a batch of masks with dims (6, 256, 256). Got input of shape (256, 6, 256)

2. _clean_up_download_pannuke() problem

pannuke = PanNukeDataModule(
    data_dir="../data/pannuke/", 
    download=True,
    nucleus_type_labels=True, 
    batch_size=8, 
    hovernet_preprocess=True,
    split=1,
    transforms=None,
)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-426a1c4fd670> in <module>
----> 1 pannuke = PanNukeDataModule(
      2     data_dir="../data/pannuke/",
      3     download=True,
      4     nucleus_type_labels=True,
      5     batch_size=8,

~/pathml/pathml/datasets/pannuke.py in __init__(self, data_dir, download, shuffle, transforms, nucleus_type_labels, split, batch_size, hovernet_preprocess)
    198         self.download = download
    199         if download:
--> 200             self._download_pannuke(self.data_dir)
    201         else:
    202             # make sure that subdirectories exist

~/pathml/pathml/datasets/pannuke.py in _download_pannuke(self, download_dir)
    241 
    242         self._process_downloaded_pannuke(download_dir)
--> 243         self._clean_up_download_pannuke(download_dir)
    244 
    245     @staticmethod

~/pathml/pathml/datasets/pannuke.py in _clean_up_download_pannuke(pannuke_dir)
    306             downloaded_dir = p / f"Fold {fold_ix}"
    307             zip_file.unlink()
--> 308             downloaded_dir.rmdir()
    309 
    310 

/opt/conda/envs/pathml/lib/python3.8/pathlib.py in rmdir(self)
   1333         if self._closed:
   1334             self._raise_closed()
-> 1335         self._accessor.rmdir(self)
   1336 
   1337     def lstat(self):

OSError: [Errno 39] Directory not empty: '../data/pannuke/Fold 1'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.