microsoft / innereye-deeplearning Goto Github PK

View Code? Open in Web Editor NEW

538.0 32.0 141.0 8.75 MB

Medical Imaging Deep Learning library to train and deploy 3D segmentation models on Azure Machine Learning

Home Page: https://aka.ms/innereyeoss

License: MIT License

Python 98.99% Jupyter Notebook 0.94% Shell 0.06%

azure medical-imaging healthcare deep-learning

innereye-deeplearning's Introduction

This project is now archived

This project is no longer under active maintenance. It is read-only, but you can still clone or fork the repo. Check here for further info. Please contact [email protected] if you run into trouble with the "Archived" state of the repo.

InnerEye-DeepLearning

InnerEye-DeepLearning (IE-DL) is a toolbox for easily training deep learning models on 3D medical images. Simple to run both locally and in the cloud with AzureML, it allows users to train and run inference on the following:

Segmentation models.
Classification and regression models.
Any PyTorch Lightning model, via a bring-your-own-model setup.

In addition, this toolbox supports:

Cross-validation using AzureML, where the models for individual folds are trained in parallel. This is particularly important for the long-running training jobs often seen with medical images.
Hyperparameter tuning using Hyperdrive.
Building ensemble models.
Easy creation of new models via a configuration-based approach, and inheritance from an existing architecture.

Documentation

For all documentation, including setup guides and APIs, please refer to the IE-DL Read the Docs site.

Quick Setup

This quick setup assumes you are using a machine running Ubuntu with Git, Git LFS, Conda and Python 3.7+ installed. Please refer to the setup guide for more detailed instructions on getting InnerEye set up with other operating systems and installing the above prerequisites.

Clone the InnerEye-DeepLearning repo by running the following command:

git clone --recursive https://github.com/microsoft/InnerEye-DeepLearning && cd InnerEye-DeepLearning

Create and activate your conda environment:

conda env create --file environment.yml && conda activate InnerEye

Verify that your installation was successful by running the HelloWorld model (no GPU required):
```
python InnerEye/ML/runner.py --model=HelloWorld
```

If the above runs with no errors: Congratulations! You have successfully built your first model using the InnerEye toolbox.

If it fails, please check the troubleshooting page on the Wiki.

Full InnerEye Deployment

We offer a companion set of open-sourced tools that help to integrate trained CT segmentation models with clinical software systems:

The InnerEye-Gateway is a Windows service running in a DICOM network, that can route anonymized DICOM images to an inference service.
The InnerEye-Inference component offers a REST API that integrates with the InnerEye-Gateway, to run inference on InnerEye-DeepLearning models.

Details can be found here.

Benefits of InnerEye-DeepLearning

In combiniation with the power of AzureML, InnerEye provides the following benefits:

Traceability: AzureML keeps a full record of all experiments that were executed, including a snapshot of the code. Tags are added to the experiments automatically, that can later help filter and find old experiments.
Transparency: All team members have access to each other's experiments and results.
Reproducibility: Two model training runs using the same code and data will result in exactly the same metrics. All sources of randomness are controlled for.
Cost reduction: Using AzureML, all compute resources (virtual machines, VMs) are requested at the time of starting the training job and freed up at the end. Idle VMs will not incur costs. Azure low priority nodes can be used to further reduce costs (up to 80% cheaper).
Scalability: Large numbers of VMs can be requested easily to cope with a burst in jobs.

Despite the cloud focus, InnerEye is designed to be able to run locally too, which is important for model prototyping, debugging, and in cases where the cloud can't be used. Therefore, if you already have GPU machines available, you will be able to utilize them with the InnerEye toolbox.

Licensing

MIT License

You are responsible for the performance, the necessary testing, and if needed any regulatory clearance for any of the models produced by this toolbox.

Acknowledging usage of Project InnerEye OSS tools

When using Project InnerEye open-source software (OSS) tools, please acknowledge with the following wording:

This project used Microsoft Research's Project InnerEye open-source software tools (https://aka.ms/InnerEyeOSS).

Contact

If you have any feature requests, or find issues in the code, please create an issue on GitHub.

Please send an email to [email protected] if you would like further information about this project.

Publications

Oktay O., Nanavati J., Schwaighofer A., Carter D., Bristow M., Tanno R., Jena R., Barnett G., Noble D., Rimmer Y., Glocker B., O’Hara K., Bishop C., Alvarez-Valle J., Nori A.: Evaluation of Deep Learning to Augment Image-Guided Radiotherapy for Head and Neck and Prostate Cancers. JAMA Netw Open. 2020;3(11):e2027426. doi:10.1001/jamanetworkopen.2020.27426

Bannur S., Oktay O., Bernhardt M, Schwaighofer A., Jena R., Nushi B., Wadhwani S., Nori A., Natarajan K., Ashraf S., Alvarez-Valle J., Castro D. C.: Hierarchical Analysis of Visual COVID-19 Features from Chest Radiographs. ICML 2021 Workshop on Interpretable Machine Learning in Healthcare. https://arxiv.org/abs/2107.06618

Bernhardt M., Castro D. C., Tanno R., Schwaighofer A., Tezcan K. C., Monteiro M., Bannur S., Lungren M., Nori S., Glocker B., Alvarez-Valle J., Oktay. O: Active label cleaning for improved dataset quality under resource constraints. https://www.nature.com/articles/s41467-022-28818-3. Accompanying code InnerEye-DataQuality

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Maintenance

This toolbox is maintained by the Microsoft Medical Image Analysis team.

innereye-deeplearning's People

Contributors

Stargazers

Watchers

Forkers

zhiliangpersonal danieldeloro dcthang maxcodextc nolll77 transwert adbmd sajalroychowdhury gabeta mldl rishirelan simplesoftmx xrosliang shaunholt herryliq neeltjeb portldong hansenms gogitter24 luispacs habibmrad philippwitte asutosh7hota gechunqiang yuehchuan leriomaggio shivp950 alonsov67 ujjwalk isabellarossi yuan776 nippurg3 yangsenwxy husnejahan drvalchova mkoivi-ms csiebler tangyumou rataxe anigasan shreyasingh1 bolinsong1995 datashinobi jacopoteneggi medical-projects mmachua 00mjk kh296 dlreseach stevehaigh psriramula baldhakal fz16336 alangulo eliseha ruiyangzhao alfonsosempoalt nskostas faziloub edmontdants jfabriciocp ayeaton ktakeda1 yvonnelu abnsy albernsurya anushilinee lc52520 xdotproduct doytsujin byaman14 sajeevck anilgavade vardaan-raj propelwise jsousa-uol gregoryperkins johngulliver hpullen standardgalactic scriptzero robinmarshall55 thousandpom hansleyn pyth0nliu big-data-ai jeffra hemmingway radiotherapyai aliyild cacof1 fepegar donjon86 furtheraway scottcearley vykhand mfeldman143 olajideajayi chascaneperiana python-repository-hub

innereye-deeplearning's Issues

Documentation: debugging and monitoring models

Improve monitoring usability

Fix docu around monitoring: There is no azure_runner.py. Ensure that monitor.py is well documented.
Enable monitory.py to pick up most_recent_run.txt
Add output to runner that explains how to run monitor
Add --monitor to runner
Rename to --tensorboard? tensorboard.py?
Check if monitor.py can be used on local runs - if it can, add that to the explanation when we build the small local model.

How do you create a model that is not segmentation, classification or regression?

Docs?

AB#3928

Run recovery of a typical PR model fails with a cuda/cpu error

Running training recovery on a BasicModel2Epochs fails with

2020-09-04T21:10:54Z ERROR    Model training/testing failed. Exception: expected device cpu but got device cuda:0
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/runner.py", line 313, in run_in_situ
    self.create_ml_runner().run()
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/run_ml.py", line 199, in run
    model_train(self.model_config, run_recovery)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 147, in model_train
    train_epoch_results = train_or_validate_epoch(training_steps)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 283, in train_or_validate_epoch
    sample, batch_index, train_val_params.epoch)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training_steps.py", line 610, in forward_and_backward_minibatch
    mask=mask)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 101, in forward_pass_patches
    result = self._forward_pass_with_anomaly_detection(patches=patches, mask=mask, labels=labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 119, in _forward_pass_with_anomaly_detection
    return self._forward_pass(patches, mask, labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 144, in _forward_pass
    single_optimizer_step(self.config, loss, self.optimizer)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 188, in single_optimizer_step
    optimizer.step(closure=None)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 51, in wrapper
    return wrapped(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/_initialize.py", line 242, in new_step
    output = old_step(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/adam.py", line 95, in step
    exp_avg.mul_(beta1).add_(1 - beta1, grad)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/wrap.py", line 101, in wrapper
    return orig_fn(arg0, *args, **kwargs)
RuntimeError: expected device cpu but got device cuda:0

Can we retire blobxfer and use AzureML APIs for download?

Using blobxfer means we need to have lots of key storage infra in place. It would be better if we could use one identity to access AML and download datasets.
The big advantage would be that there is no longer a need to maintain storage account datasets and access keys, simplifying the setup on the user side.

Documentation: testing

Coordinate documentation

Work item for managing documentation as a whole, as opposed to specific documents which should have their own work items.

Suggested outline:

README.md

Description of the project

Documentation

Setting up environment

Creating datasets

Building segmentation models on AML
Lung Challenge

Building classification models on AML
Glaucoma

Debugging and monitoring models

Testing
How to do pull requests
Roadmap
How to deploy a model on AML
How to deploy a model on ASH

Allow recovery from training runs on a local machine.

Right now, there's no easy way to recover training/run inference on a local run.

Add a script to start an MLFlow server

Add a script that consumes the project settings file, retrieves the AzureML workspace, and with that starts an MLFlow server on localhost.

AB#3929

Apply new branch name guidelines

https://github.com/github/renaming, default branch name should be 'main'

Investigate if we can use Pytorch webdataset library for accessing data

https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/

AB#3938

Documentation: building segmentation models on Azure ML

(1) How to build a model with the existing Lung configuration
(2) How to specify your own configurations

Automate the creation of the Azure setup

Using Azure Resource Manager, automate the creation of a workspace and a storage account.
Can we also create the Service Principal and the training clusters?

Upgrade to PyTorch 1.6

Register models on the run, rather than in the workspace

At present, all models are registered in AzureML by Model.register. This means that from the list of models we can't have a reference back to which run generated the model.
Instead, in AzureML runs, register the models by run.register_model. Caveat here is that the API does not allow child paths, hence have to copy into a separate folder.
For runs on the commandline, continue to register by Model.register.

Segmentation models should generate a report in a jupyter notebook instead of a collection of files

Doc link to creating datasets from setting up AML is broken

The link at end of AML setup doc to creating datasets is broken

For segmentation models, adapt crop_size, inference_crop_size and stride automatically based on the GPUs available.

AB#3931

Ensemble aggregation is using storage_account instead of Run context

2020-09-12T03:23:17Z INFO
Starting the daemon thread to refresh tokens in background for process with pid = 134

The experiment failed. Finalizing run...
2020-09-12T03:23:17Z INFO Exiting context: TrackUserError
2020-09-12T03:23:17Z INFO Exiting context: RunHistory
[2020-09-12T03:23:17.966397] TimeoutHandler init
[2020-09-12T03:23:17.966479] TimeoutHandler enter
Cleaning up all outstanding Run operations, waiting 300.0 seconds
9 items cleaning up...
Cleanup took 0.9337708950042725 seconds
[2020-09-12T03:23:19.147607] TimeoutHandler exit
2020-09-12T03:23:19Z INFO Exiting context: Dataset
Enter exit of DatasetContextManager
Exit exit of DatasetContextManager
2020-09-12T03:23:19Z INFO Exiting context: ProjectPythonPath
Traceback (most recent call last):
File "InnerEye/ML/runner.py", line 391, in
main()
File "InnerEye/ML/runner.py", line 387, in main
post_cross_validation_hook=default_post_cross_validation_hook)
File "InnerEye/ML/runner.py", line 381, in run
return runner.run()
File "InnerEye/ML/runner.py", line 249, in run
self.run_in_situ()
File "InnerEye/ML/runner.py", line 329, in run_in_situ
self.wait_for_cross_val_runs_to_finish_and_aggregate()
File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/stopit/utils.py", line 148, in wrapper
return func(*args, **kwargs)
File "InnerEye/ML/runner.py", line 124, in wait_for_cross_val_runs_to_finish_and_aggregate
self.create_ensemble_model()
File "InnerEye/ML/runner.py", line 158, in create_ensemble_model
self.azure_config, self.model_config, PARENT_RUN_CONTEXT, output_subdir_name=OTHER_RUNS_SUBDIR_NAME)
File "/mnt/batch/tasks/shared/LS_root/jobs/radnetinnereyedev/azureml/hd_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/mounts/workspaceblobstore/azureml/HD_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/InnerEye/ML/utils/run_recovery.py", line 92, in download_checkpoints_from_run
run=run
File "/mnt/batch/tasks/shared/LS_root/jobs/radnetinnereyedev/azureml/hd_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/mounts/workspaceblobstore/azureml/HD_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/InnerEye/Azure/azure_config.py", line 273, in download_outputs_from_run
raise ValueError("self.storage_account cannot be None")
ValueError: self.storage_account cannot be None

2020/09/12 03:23:31 logger.go:293: Failed to run the wrapper cmd with err: exit status 1
2020/09/12 03:23:31 sysutils_linux.go:221: mpirun version string: {
mpirun (Open MPI) 3.1.2

Inconsistent model registration if you use the code as submodule

When you train the model with submodule it looks like this:

• Environment.yml
• model_inference_config.json
• innereye-deeplearning/score.py

However, when you train a model directly with innereye-deeplearning:

• Environment.yml
• model_inference_config.json
• score.py

Docu: Clarify the need for Service Principal

Docu should clearly say: In which cases is that needed, and in which not? Where does the access key go (secrets file)?

Clean history

Distributed data parallel

VM sizes for clusters

Add in the docs for cluster creation the VM sizes required for radiotherapy models

Documentation: roadmap

Clean up spurious module loading errors

We see most builds showing repeated errors saying "Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception cannot import name '_DistributedTraining' from 'azureml.train._distributed_training' (/home/jaalvare/miniconda3/envs/InnerEye/lib/python3.7/site-packages/azureml/train/_distributed_training.py).". Can those be suppressed or avoided altogether?

AB#3927

Add test coverage for submit_for_inference

After PR build, run an inference job on the model that was just created, on a simple NII.gz file

Documentation: how to create a dataset

Possibility to run without Azure

It'll be interesting for development purpose to be able to run it entirely on localhost without needing Azure.

In addition, a company without the right to use (specific or any) cloud provider can later be authorized and during this time can use InnerEye

Modify package such that scripts are available as commandline tools

tensorboard monitoring is something that could be made available as a commandline tool, without having to invoke it via "python ...". Same for other helpers
https://python-packaging.readthedocs.io/en/latest/command-line-scripts.html

AB#3924

Enable monitor.py on local runs?

Tensorboard monitoring is presently hooked up to AzureML runs. It should be possible to point tensorboard to a local folder, and start the monitoring script for local runs. Upon job start, instructions should be printed out.

AB#3926

Data augmentation

GPU augmentations at training time should be configurable
Document the benefits of this feature

AB#3925

HelloWorld fails on WSL1/Ubuntu distribution

(InnerEye) maher@PC:~/InnerEye-DeepLearning$ python InnerEye/ML/runner.py --model=HelloWorld
Setting up logging to stdout.
Setting logging level to 20
2020-09-25T13:50:35Z INFO rpdb is handling traps. To debug: identify the main runner.py process, then as root: kill -TRAP <process_id>; nc 127.0.0.1 4444
2020-09-25T13:50:37Z INFO Found class HelloWorld in file /home/maher/InnerEye-DeepLearning/InnerEye/ML/configs/segmentation/HelloWorld.py
2020-09-25T13:50:37Z INFO Creating the default output folder structure.
2020-09-25T13:50:37Z INFO Running outside of AzureML.
2020-09-25T13:50:37Z INFO All results will be written to a subfolder of the project root folder.
2020-09-25T13:50:37Z INFO Run outputs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld
2020-09-25T13:50:37Z INFO Logs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/logs
2020-09-25T13:50:37Z INFO Creating the adjusted output folder structure.
2020-09-25T13:50:37Z INFO Running outside of AzureML.
2020-09-25T13:50:37Z INFO All results will be written to a subfolder of the project root folder.
2020-09-25T13:50:37Z INFO Run outputs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld
2020-09-25T13:50:37Z INFO Logs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/logs
2020-09-25T13:50:37Z INFO extra_code_directory is unset
Setting logging level to 20
Setting up logging with level 20 to file /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/logs/stdout.txt
2020-09-25T13:50:47Z INFO Setting multiprocessing start method to 'forkserver'
2020-09-25T13:50:47Z INFO Model training will use the local dataset provided in /home/maher/InnerEye-DeepLearning/Tests/ML/test_data
2020-09-25T13:50:47Z INFO
Arguments:
__center_size_param_value: None
__dataset_data_frame_param_value: None
__inference_stride_size_param_value: None
__largest_connected_component_foreground_classes_param_value: None
__min_l_rate_param_value: 0
__model_category_param_value: ModelCategory.Segmentation
__model_name_param_value: HelloWorld
__overrides_param_value: None
__use_gpu_param_value: False
_architecture_param_value: UNet3D
_class_weights_param_value: [0.02, 0.49, 0.49]
_colours_param_value: [(130, 183, 14), (238, 127, 26)]
_comparison_blob_storage_paths_param_value: None
_crop_size_param_value: (64, 64, 64)
_datasets_for_inference: None
_datasets_for_training: None
_feature_channels_param_value: [4]
_file_system_config_param_value:
_fill_holes_param_value: [True, True]
_ground_truth_ids_display_names_param_value: ['region', 'region_1']
_ground_truth_ids_param_value: ['region', 'region_1']
_image_channels_param_value: ['channel1', 'channel2']
_instance__params : {}
_l_rate_multi_step_milestones_param_value: None
_level_param_value: 50
_local_dataset_param_value: /home/maher/InnerEye-DeepLearning/Tests/ML/test_data
_mask_id_param_value: mask
_multiprocessing_start_method_param_value: MultiprocessingStartMethod.forkserver
_name_param_value : HelloWorld00008
_norm_method_param_value: PhotometricNormalizationMethod.CtWindow
_num_dataload_workers_param_value: 0
_num_epochs_param_value: 2
_param_watchers : {}
_save_start_epoch_param_value: 1
_save_step_epochs_param_value: 1
_slice_exclusion_rules_param_value: []
_start_epoch_param_value: 0
_summed_probability_rules_param_value: []
_tail_param_value : None
_test_crop_size_param_value: (64, 64, 64)
_test_diff_epochs_param_value: 1
_test_start_epoch_param_value: 2
_test_step_epochs_param_value: 1
_train_batch_size_param_value: 2
_use_mixed_precision_param_value: True
_window_param_value: 200
initialized : True
param : <param.parameterized.Parameters object at 0x7fceaeff2ef0>

2020-09-25T13:50:47Z INFO
2020-09-25T13:50:47Z INFO **** STARTING: Model training **********************************************************************
2020-09-25T13:50:47Z INFO
2020-09-25T13:50:47Z INFO Train: 3, Test: 1, and Val: 2. Total subjects: 6
2020-09-25T13:50:47Z INFO Model Training: Random seed set to: 42
2020-09-25T13:50:47Z INFO Starting to read and parse the datasets.
2020-09-25T13:50:47Z INFO Processing dataset (name=None)
2020-09-25T13:50:47Z INFO Processing dataset (name=None)
2020-09-25T13:50:47Z INFO Creating the data loader for the training set.
2020-09-25T13:50:47Z INFO Creating the data loader for the validation set.
2020-09-25T13:50:47Z INFO Finished creating the data loaders.
2020-09-25T13:50:48Z INFO Models are saved at /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints
2020-09-25T13:50:48Z INFO Writing model summary to: logs/model_summaries/model_log001.txt
Attempted to log scalar metric LoggingColumns.NumTrainableParameters:
506139
2020-09-25T13:50:48Z INFO Making no adjustments to the model because no GPU was found.
2020-09-25T13:50:50Z INFO Starting training
2020-09-25T13:50:50Z INFO Starting epoch 1
2020-09-25T13:50:50Z INFO Loaded the first minibatch of training data in 0.15 sec.
2020-09-25T13:50:57Z INFO Epoch 1 training took 6.51 sec of which data loading took 0.21 sec
2020-09-25T13:50:57Z INFO Model Training: Random seed set to: 42
2020-09-25T13:50:57Z INFO Loaded the first minibatch of validation data in 0.10 sec.
2020-09-25T13:50:58Z INFO Epoch 1 validation took 1.69 sec of which data loading took 0.10 sec
2020-09-25T13:50:58Z INFO Saved model checkpoint for epoch 1 to /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints/1_checkpoint.pth.tar
2020-09-25T13:50:58Z INFO Starting epoch 2
2020-09-25T13:50:59Z INFO Loaded the first minibatch of training data in 0.11 sec.
2020-09-25T13:51:04Z INFO Epoch 2 training took 5.88 sec of which data loading took 0.16 sec
2020-09-25T13:51:05Z INFO Model Training: Random seed set to: 42
2020-09-25T13:51:05Z INFO Loaded the first minibatch of validation data in 0.11 sec.
2020-09-25T13:51:06Z INFO Epoch 2 validation took 1.75 sec of which data loading took 0.11 sec
2020-09-25T13:51:06Z INFO Saved model checkpoint for epoch 2 to /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints/2_checkpoint.pth.tar
2020-09-25T13:51:06Z INFO Finished training
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** FINISHED: Model training after 19.01 seconds **************************************************
2020-09-25T13:51:06Z INFO
Attempted to log scalar metric Train epochs:
2
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** STARTING: Registering default model ***********************************************************
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z WARNING Not registering a model, because the run has no associated experiment
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** FINISHED: Registering default model after 0.00 seconds ****************************************
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** STARTING: Running default model on test set ***************************************************
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO Train: 3, Test: 1, and Val: 2. Total subjects: 6
2020-09-25T13:51:06Z INFO Model Training: Random seed set to: 42
2020-09-25T13:51:06Z INFO Results directory: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/epoch_002/Test
2020-09-25T13:51:06Z INFO Starting evaluation of model HelloWorld on epoch 2 Test set
2020-09-25T13:51:06Z INFO Processing dataset (name=None)
2020-09-25T13:51:06Z INFO Processing dataset (name=None)
2020-09-25T13:51:06Z INFO Processing dataset (name=None)
2020-09-25T13:51:06Z INFO Loading checkpoint /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints/2_checkpoint.pth.tar
2020-09-25T13:51:06Z INFO Loaded checkpoint (epoch: 2)
2020-09-25T13:51:06Z INFO Writing model summary to: logs/model_summaries/model_log001.txt
Setting up logging with level 20 to file logs/model_summaries/model_log001.txt
2020-09-25T13:51:07Z INFO Making no adjustments to the model because no GPU was found.
2020-09-25T13:51:09Z INFO Predicting for image 1 of 1...
2020-09-25T13:51:09Z INFO Inference pipeline (0), Predicting patient: 6
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Traceback (most recent call last):
File "", line 1, in
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/forkserver.py", line 186, in main
with socket.socket(socket.AF_UNIX, fileno=listener_fd) as listener,
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/socket.py", line 151, in init
_socket.socket.init(self, family, type, proto, fileno)
OSError: [Errno 22] Invalid argument: 'protocol'
2020-09-25T13:51:26Z ERROR Model training/testing failed. Exception: unexpected EOF
Traceback (most recent call last):
File "InnerEye/ML/runner.py", line 360, in run_in_situ
self.create_ml_runner().run()
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 318, in run
best_epoch = self.run_inference_and_register_model(run_recovery, ModelProcessing.DEFAULT)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 340, in run_inference_and_register_model
test_metrics, val_metrics, _ = self.model_inference_train_and_test(RUN_CONTEXT, run_recovery, model_proc)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 663, in model_inference_train_and_test
test_metrics = run_model_test(ModelExecutionMode.TEST)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 659, in run_model_test
return model_test(config, data_split=data_split, run_recovery=run_recovery, model_proc=model_proc)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/model_testing.py", line 72, in model_test
return segmentation_model_test(config, data_split, run_recovery, model_proc)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/model_testing.py", line 102, in segmentation_model_test
run_recovery=run_recovery)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/model_testing.py", line 180, in segmentation_model_test_epoch
with Pool(processes=num_workers) as pool:
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/pool.py", line 176, in init
self._repopulate_pool()
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/popen_forkserver.py", line 35, in init
super().init(process_obj)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/popen_forkserver.py", line 55, in _launch
self.pid = forkserver.read_signed(self.sentinel)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/forkserver.py", line 312, in read_signed
raise EOFError('unexpected EOF')
EOFError: unexpected EOF
(InnerEye) maher@PC:~/InnerEye-DeepLearning$

Documentation: how to do pull requests

Allow custom experiment names

We presently hardcode the AzureML experiment name to be the git branch name. We could add a switch to make that configurable on the commandline.

Tags in dataset.csv should be optional

Python script to call an AML model with a nifti file

Script will run segmentation and download the nifti output

Make the name of the AzureML datasets store configurable

We presently have a constant AZUREML_DATASTORE_NAME that points to innereyedatasets. Make that configurable in YAML

Choice of whether to use pin_memory in datasets should be a switch

Store number of trainable parameters in config for later use

At the moment, the model summary code in generate_and_print_model_summary does not store any of its results. The number of trainable parameters is logged to AzureML, but not anywhere else.
It would be helpful to store the key output of the model summary, the number of trainable parameters, inside the model config, or at some other place so that it can be accessed later: In particular, it is important when we create a model dashboard in a post-crossvalidation-hook

AB#3923

Documentation: setting up your environment

This is more or less the content of the current README - task is to make sure it's appropriate for a general audience.

Documentation: how to deploy a model on Azure Stack Hub

Documentation: building classification models on Azure ML

This should probably be done after the corresponding "segmentation models" section, to take advantage of commonalities.

Training starts from epoch 0 when using a run recovery object.

When continuing to train from a recovered run, the training run starts again from epoch 0. This has the side effect of causing the the inference run to use the run recovery object instead of the weights from the last epoch on the validation and test datasets.

Docu: How should the GPU quotas look like to create the cluster?

Need to have both dedicated and low priority quota

Documentation: how to deploy a model on Azure ML

Pick up git information from local repository, rather than via commandline args

Git commit ID, author, etc. are presently expected in commandline arguments, because we have been using the runner from DevOps pipelines. Going forward, we should expect that most people will use the code from their local boxes. To get better traceability, we should change the code to also pick up git information from there.
This will also make calling the runner in DevOps pipelines a lot easier.
Can we use something like gitpython to achieve that? https://github.com/gitpython-developers/GitPython

AML outputs do not contain reports or metrics for ensemble runs

All metrics and reports should be accessible for ensemble runs and cross validation without opening storage explorer.

"conda env remove --name InnerEye" fails because `tqdm` is overloaded

Once the PYTHONPATH is set to the repository root, it is no longer possible to remove environments (or quite possibly do anything with environments). Conda uses the tqdm package, and we have a hack for that at repository root to avoid further dependencies.

antonsc@MSR:/mnt/c/git$ conda env remove -n InnerEye
Traceback (most recent call last):
  File "/home/antonsc/miniconda2/bin/conda-env", line 6, in <module>
    from conda_env.cli.main import main
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda_env/cli/main.py", line 13, in <module>
    import conda.exports  # noqa
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/exports.py", line 25, in <module>
    from . import plan  # NOQA
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/plan.py", line 26, in <module>
    from .core.link import PrefixSetup, UnlinkLinkTransaction
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/core/link.py", line 43, in <module>
    from ..resolve import MatchSpec
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/resolve.py", line 9, in <module>
    from tqdm import tqdm
  File "/mnt/c/git/InnerEye-DeepLearning/tqdm.py", line 14
    def tqdm(arg: Any, *_rest: Any) -> Any:
                ^
SyntaxError: invalid syntax

Create a small public example repository for use via submodules

At present, we only have written instructions for how to set up the submodules. It might be easier to have a github project that contains all that sample code, ready to use. In particular, it should contain the settings for extra_code_directory and model_configs_namespace

AB#3934

microsoft / innereye-deeplearning Goto Github PK

innereye-deeplearning's Introduction

This project is now archived

InnerEye-DeepLearning

Documentation

Quick Setup

Full InnerEye Deployment

Benefits of InnerEye-DeepLearning

Licensing

Acknowledging usage of Project InnerEye OSS tools

Contact

Publications

Contributing

Maintenance

innereye-deeplearning's People

Contributors

Stargazers

Watchers

Forkers

innereye-deeplearning's Issues

Recommend Projects

Recommend Topics

Recommend Org