Code Monkey home page Code Monkey logo

innereye-deeplearning's Introduction

This project is now archived

This project is no longer under active maintenance. It is read-only, but you can still clone or fork the repo. Check here for further info. Please contact [email protected] if you run into trouble with the "Archived" state of the repo.

InnerEye-DeepLearning

Build Status

InnerEye-DeepLearning (IE-DL) is a toolbox for easily training deep learning models on 3D medical images. Simple to run both locally and in the cloud with AzureML, it allows users to train and run inference on the following:

In addition, this toolbox supports:

  • Cross-validation using AzureML, where the models for individual folds are trained in parallel. This is particularly important for the long-running training jobs often seen with medical images.
  • Hyperparameter tuning using Hyperdrive.
  • Building ensemble models.
  • Easy creation of new models via a configuration-based approach, and inheritance from an existing architecture.

Documentation

For all documentation, including setup guides and APIs, please refer to the IE-DL Read the Docs site.

Quick Setup

This quick setup assumes you are using a machine running Ubuntu with Git, Git LFS, Conda and Python 3.7+ installed. Please refer to the setup guide for more detailed instructions on getting InnerEye set up with other operating systems and installing the above prerequisites.

  1. Clone the InnerEye-DeepLearning repo by running the following command:

    git clone --recursive https://github.com/microsoft/InnerEye-DeepLearning && cd InnerEye-DeepLearning
  2. Create and activate your conda environment:

    conda env create --file environment.yml && conda activate InnerEye
  3. Verify that your installation was successful by running the HelloWorld model (no GPU required):

    python InnerEye/ML/runner.py --model=HelloWorld

If the above runs with no errors: Congratulations! You have successfully built your first model using the InnerEye toolbox.

If it fails, please check the troubleshooting page on the Wiki.

Full InnerEye Deployment

We offer a companion set of open-sourced tools that help to integrate trained CT segmentation models with clinical software systems:

  • The InnerEye-Gateway is a Windows service running in a DICOM network, that can route anonymized DICOM images to an inference service.
  • The InnerEye-Inference component offers a REST API that integrates with the InnerEye-Gateway, to run inference on InnerEye-DeepLearning models.

Details can be found here.

docs/deployment.png

Benefits of InnerEye-DeepLearning

In combiniation with the power of AzureML, InnerEye provides the following benefits:

  • Traceability: AzureML keeps a full record of all experiments that were executed, including a snapshot of the code. Tags are added to the experiments automatically, that can later help filter and find old experiments.
  • Transparency: All team members have access to each other's experiments and results.
  • Reproducibility: Two model training runs using the same code and data will result in exactly the same metrics. All sources of randomness are controlled for.
  • Cost reduction: Using AzureML, all compute resources (virtual machines, VMs) are requested at the time of starting the training job and freed up at the end. Idle VMs will not incur costs. Azure low priority nodes can be used to further reduce costs (up to 80% cheaper).
  • Scalability: Large numbers of VMs can be requested easily to cope with a burst in jobs.

Despite the cloud focus, InnerEye is designed to be able to run locally too, which is important for model prototyping, debugging, and in cases where the cloud can't be used. Therefore, if you already have GPU machines available, you will be able to utilize them with the InnerEye toolbox.

Licensing

MIT License

You are responsible for the performance, the necessary testing, and if needed any regulatory clearance for any of the models produced by this toolbox.

Acknowledging usage of Project InnerEye OSS tools

When using Project InnerEye open-source software (OSS) tools, please acknowledge with the following wording:

This project used Microsoft Research's Project InnerEye open-source software tools (https://aka.ms/InnerEyeOSS).

Contact

If you have any feature requests, or find issues in the code, please create an issue on GitHub.

Please send an email to [email protected] if you would like further information about this project.

Publications

Oktay O., Nanavati J., Schwaighofer A., Carter D., Bristow M., Tanno R., Jena R., Barnett G., Noble D., Rimmer Y., Glocker B., O’Hara K., Bishop C., Alvarez-Valle J., Nori A.: Evaluation of Deep Learning to Augment Image-Guided Radiotherapy for Head and Neck and Prostate Cancers. JAMA Netw Open. 2020;3(11):e2027426. doi:10.1001/jamanetworkopen.2020.27426

Bannur S., Oktay O., Bernhardt M, Schwaighofer A., Jena R., Nushi B., Wadhwani S., Nori A., Natarajan K., Ashraf S., Alvarez-Valle J., Castro D. C.: Hierarchical Analysis of Visual COVID-19 Features from Chest Radiographs. ICML 2021 Workshop on Interpretable Machine Learning in Healthcare. https://arxiv.org/abs/2107.06618

Bernhardt M., Castro D. C., Tanno R., Schwaighofer A., Tezcan K. C., Monteiro M., Bannur S., Lungren M., Nori S., Glocker B., Alvarez-Valle J., Oktay. O: Active label cleaning for improved dataset quality under resource constraints. https://www.nature.com/articles/s41467-022-28818-3. Accompanying code InnerEye-DataQuality

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Maintenance

This toolbox is maintained by the Microsoft Medical Image Analysis team.

innereye-deeplearning's People

Contributors

annaschroder avatar ant0nsc avatar arsenkhy avatar asantamariapang avatar csiebler avatar dccastro avatar dumbledad avatar erann1987 avatar fepegar avatar harshita-s avatar ivantarapov avatar jacopoteneggi avatar javier-alvarez avatar jaysnanavati avatar jonathantripp avatar kh296 avatar ktakeda1 avatar maxilse avatar mebristo avatar melanibe avatar peterhessey avatar pre-commit-ci[bot] avatar sarthakpati avatar sennendoko avatar shruthi42 avatar stevehaigh avatar vale-salvatelli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

innereye-deeplearning's Issues

Improve monitoring usability

Fix docu around monitoring: There is no azure_runner.py. Ensure that monitor.py is well documented.
Enable monitory.py to pick up most_recent_run.txt
Add output to runner that explains how to run monitor
Add --monitor to runner
Rename to --tensorboard? tensorboard.py?
Check if monitor.py can be used on local runs - if it can, add that to the explanation when we build the small local model.

Run recovery of a typical PR model fails with a cuda/cpu error

Running training recovery on a BasicModel2Epochs fails with

2020-09-04T21:10:54Z ERROR    Model training/testing failed. Exception: expected device cpu but got device cuda:0
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/runner.py", line 313, in run_in_situ
    self.create_ml_runner().run()
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/run_ml.py", line 199, in run
    model_train(self.model_config, run_recovery)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 147, in model_train
    train_epoch_results = train_or_validate_epoch(training_steps)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training.py", line 283, in train_or_validate_epoch
    sample, batch_index, train_val_params.epoch)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/model_training_steps.py", line 610, in forward_and_backward_minibatch
    mask=mask)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 101, in forward_pass_patches
    result = self._forward_pass_with_anomaly_detection(patches=patches, mask=mask, labels=labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 119, in _forward_pass_with_anomaly_detection
    return self._forward_pass(patches, mask, labels)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 144, in _forward_pass
    single_optimizer_step(self.config, loss, self.optimizer)
  File "/mnt/batch/tasks/shared/LS_root/jobs/innereye-deeplearning/azureml/refs_pull_197_merge_1599253709_09897648/mounts/workspaceblobstore/azureml/refs_pull_197_merge_1599253709_09897648/Submodule/InnerEye/ML/pipelines/forward_pass.py", line 188, in single_optimizer_step
    optimizer.step(closure=None)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 51, in wrapper
    return wrapped(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/_initialize.py", line 242, in new_step
    output = old_step(*args, **kwargs)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/torch/optim/adam.py", line 95, in step
    exp_avg.mul_(beta1).add_(1 - beta1, grad)
  File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/apex/amp/wrap.py", line 101, in wrapper
    return orig_fn(arg0, *args, **kwargs)
RuntimeError: expected device cpu but got device cuda:0

Can we retire blobxfer and use AzureML APIs for download?

Using blobxfer means we need to have lots of key storage infra in place. It would be better if we could use one identity to access AML and download datasets.
The big advantage would be that there is no longer a need to maintain storage account datasets and access keys, simplifying the setup on the user side.

Coordinate documentation

Work item for managing documentation as a whole, as opposed to specific documents which should have their own work items.

Suggested outline:

README.md

Description of the project

Documentation

Setting up environment

Creating datasets

Building segmentation models on AML
Lung Challenge

Building classification models on AML
Glaucoma

Debugging and monitoring models

Testing
How to do pull requests
Roadmap
How to deploy a model on AML
How to deploy a model on ASH

Automate the creation of the Azure setup

Using Azure Resource Manager, automate the creation of a workspace and a storage account.
Can we also create the Service Principal and the training clusters?

Register models on the run, rather than in the workspace

At present, all models are registered in AzureML by Model.register. This means that from the list of models we can't have a reference back to which run generated the model.
Instead, in AzureML runs, register the models by run.register_model. Caveat here is that the API does not allow child paths, hence have to copy into a separate folder.
For runs on the commandline, continue to register by Model.register.

Ensemble aggregation is using storage_account instead of Run context

2020-09-12T03:23:17Z INFO
Starting the daemon thread to refresh tokens in background for process with pid = 134

The experiment failed. Finalizing run...
2020-09-12T03:23:17Z INFO Exiting context: TrackUserError
2020-09-12T03:23:17Z INFO Exiting context: RunHistory
[2020-09-12T03:23:17.966397] TimeoutHandler init
[2020-09-12T03:23:17.966479] TimeoutHandler enter
Cleaning up all outstanding Run operations, waiting 300.0 seconds
9 items cleaning up...
Cleanup took 0.9337708950042725 seconds
[2020-09-12T03:23:19.147607] TimeoutHandler exit
2020-09-12T03:23:19Z INFO Exiting context: Dataset
Enter exit of DatasetContextManager
Exit exit of DatasetContextManager
2020-09-12T03:23:19Z INFO Exiting context: ProjectPythonPath
Traceback (most recent call last):
File "InnerEye/ML/runner.py", line 391, in
main()
File "InnerEye/ML/runner.py", line 387, in main
post_cross_validation_hook=default_post_cross_validation_hook)
File "InnerEye/ML/runner.py", line 381, in run
return runner.run()
File "InnerEye/ML/runner.py", line 249, in run
self.run_in_situ()
File "InnerEye/ML/runner.py", line 329, in run_in_situ
self.wait_for_cross_val_runs_to_finish_and_aggregate()
File "/azureml-envs/azureml_25e24d323549738d2dc10e4c7fc4746d/lib/python3.7/site-packages/stopit/utils.py", line 148, in wrapper
return func(*args, **kwargs)
File "InnerEye/ML/runner.py", line 124, in wait_for_cross_val_runs_to_finish_and_aggregate
self.create_ensemble_model()
File "InnerEye/ML/runner.py", line 158, in create_ensemble_model
self.azure_config, self.model_config, PARENT_RUN_CONTEXT, output_subdir_name=OTHER_RUNS_SUBDIR_NAME)
File "/mnt/batch/tasks/shared/LS_root/jobs/radnetinnereyedev/azureml/hd_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/mounts/workspaceblobstore/azureml/HD_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/InnerEye/ML/utils/run_recovery.py", line 92, in download_checkpoints_from_run
run=run
File "/mnt/batch/tasks/shared/LS_root/jobs/radnetinnereyedev/azureml/hd_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/mounts/workspaceblobstore/azureml/HD_3d2f42e2-5087-42f2-91d7-eaa50c2d656a_0/InnerEye/Azure/azure_config.py", line 273, in download_outputs_from_run
raise ValueError("self.storage_account cannot be None")
ValueError: self.storage_account cannot be None

2020/09/12 03:23:31 logger.go:293: Failed to run the wrapper cmd with err: exit status 1
2020/09/12 03:23:31 sysutils_linux.go:221: mpirun version string: {
mpirun (Open MPI) 3.1.2

Inconsistent model registration if you use the code as submodule

When you train the model with submodule it looks like this:

• Environment.yml
• model_inference_config.json
• innereye-deeplearning/score.py

However, when you train a model directly with innereye-deeplearning:

• Environment.yml
• model_inference_config.json
• score.py

VM sizes for clusters

Add in the docs for cluster creation the VM sizes required for radiotherapy models

Clean up spurious module loading errors

We see most builds showing repeated errors saying "Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception cannot import name '_DistributedTraining' from 'azureml.train._distributed_training' (/home/jaalvare/miniconda3/envs/InnerEye/lib/python3.7/site-packages/azureml/train/_distributed_training.py).". Can those be suppressed or avoided altogether?

AB#3927

Possibility to run without Azure

It'll be interesting for development purpose to be able to run it entirely on localhost without needing Azure.

In addition, a company without the right to use (specific or any) cloud provider can later be authorized and during this time can use InnerEye

Enable monitor.py on local runs?

Tensorboard monitoring is presently hooked up to AzureML runs. It should be possible to point tensorboard to a local folder, and start the monitoring script for local runs. Upon job start, instructions should be printed out.

AB#3926

Data augmentation

  • GPU augmentations at training time should be configurable
  • Document the benefits of this feature

AB#3925

HelloWorld fails on WSL1/Ubuntu distribution

(InnerEye) maher@PC:~/InnerEye-DeepLearning$ python InnerEye/ML/runner.py --model=HelloWorld
Setting up logging to stdout.
Setting logging level to 20
2020-09-25T13:50:35Z INFO rpdb is handling traps. To debug: identify the main runner.py process, then as root: kill -TRAP <process_id>; nc 127.0.0.1 4444
2020-09-25T13:50:37Z INFO Found class HelloWorld in file /home/maher/InnerEye-DeepLearning/InnerEye/ML/configs/segmentation/HelloWorld.py
2020-09-25T13:50:37Z INFO Creating the default output folder structure.
2020-09-25T13:50:37Z INFO Running outside of AzureML.
2020-09-25T13:50:37Z INFO All results will be written to a subfolder of the project root folder.
2020-09-25T13:50:37Z INFO Run outputs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld
2020-09-25T13:50:37Z INFO Logs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/logs
2020-09-25T13:50:37Z INFO Creating the adjusted output folder structure.
2020-09-25T13:50:37Z INFO Running outside of AzureML.
2020-09-25T13:50:37Z INFO All results will be written to a subfolder of the project root folder.
2020-09-25T13:50:37Z INFO Run outputs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld
2020-09-25T13:50:37Z INFO Logs folder: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/logs
2020-09-25T13:50:37Z INFO extra_code_directory is unset
Setting logging level to 20
Setting up logging with level 20 to file /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/logs/stdout.txt
2020-09-25T13:50:47Z INFO Setting multiprocessing start method to 'forkserver'
2020-09-25T13:50:47Z INFO Model training will use the local dataset provided in /home/maher/InnerEye-DeepLearning/Tests/ML/test_data
2020-09-25T13:50:47Z INFO
Arguments:
__center_size_param_value: None
__dataset_data_frame_param_value: None
__inference_stride_size_param_value: None
__largest_connected_component_foreground_classes_param_value: None
__min_l_rate_param_value: 0
__model_category_param_value: ModelCategory.Segmentation
__model_name_param_value: HelloWorld
__overrides_param_value: None
__use_gpu_param_value: False
_architecture_param_value: UNet3D
_class_weights_param_value: [0.02, 0.49, 0.49]
_colours_param_value: [(130, 183, 14), (238, 127, 26)]
_comparison_blob_storage_paths_param_value: None
_crop_size_param_value: (64, 64, 64)
_datasets_for_inference: None
_datasets_for_training: None
_feature_channels_param_value: [4]
_file_system_config_param_value:
_fill_holes_param_value: [True, True]
_ground_truth_ids_display_names_param_value: ['region', 'region_1']
_ground_truth_ids_param_value: ['region', 'region_1']
_image_channels_param_value: ['channel1', 'channel2']
_instance__params : {}
_l_rate_multi_step_milestones_param_value: None
_level_param_value: 50
_local_dataset_param_value: /home/maher/InnerEye-DeepLearning/Tests/ML/test_data
_mask_id_param_value: mask
_multiprocessing_start_method_param_value: MultiprocessingStartMethod.forkserver
_name_param_value : HelloWorld00008
_norm_method_param_value: PhotometricNormalizationMethod.CtWindow
_num_dataload_workers_param_value: 0
_num_epochs_param_value: 2
_param_watchers : {}
_save_start_epoch_param_value: 1
_save_step_epochs_param_value: 1
_slice_exclusion_rules_param_value: []
_start_epoch_param_value: 0
_summed_probability_rules_param_value: []
_tail_param_value : None
_test_crop_size_param_value: (64, 64, 64)
_test_diff_epochs_param_value: 1
_test_start_epoch_param_value: 2
_test_step_epochs_param_value: 1
_train_batch_size_param_value: 2
_use_mixed_precision_param_value: True
_window_param_value: 200
initialized : True
param : <param.parameterized.Parameters object at 0x7fceaeff2ef0>

2020-09-25T13:50:47Z INFO
2020-09-25T13:50:47Z INFO **** STARTING: Model training **********************************************************************
2020-09-25T13:50:47Z INFO
2020-09-25T13:50:47Z INFO Train: 3, Test: 1, and Val: 2. Total subjects: 6
2020-09-25T13:50:47Z INFO Model Training: Random seed set to: 42
2020-09-25T13:50:47Z INFO Starting to read and parse the datasets.
2020-09-25T13:50:47Z INFO Processing dataset (name=None)
2020-09-25T13:50:47Z INFO Processing dataset (name=None)
2020-09-25T13:50:47Z INFO Creating the data loader for the training set.
2020-09-25T13:50:47Z INFO Creating the data loader for the validation set.
2020-09-25T13:50:47Z INFO Finished creating the data loaders.
2020-09-25T13:50:48Z INFO Models are saved at /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints
2020-09-25T13:50:48Z INFO Writing model summary to: logs/model_summaries/model_log001.txt
Attempted to log scalar metric LoggingColumns.NumTrainableParameters:
506139
2020-09-25T13:50:48Z INFO Making no adjustments to the model because no GPU was found.
2020-09-25T13:50:50Z INFO Starting training
2020-09-25T13:50:50Z INFO Starting epoch 1
2020-09-25T13:50:50Z INFO Loaded the first minibatch of training data in 0.15 sec.
2020-09-25T13:50:57Z INFO Epoch 1 training took 6.51 sec of which data loading took 0.21 sec
2020-09-25T13:50:57Z INFO Model Training: Random seed set to: 42
2020-09-25T13:50:57Z INFO Loaded the first minibatch of validation data in 0.10 sec.
2020-09-25T13:50:58Z INFO Epoch 1 validation took 1.69 sec of which data loading took 0.10 sec
2020-09-25T13:50:58Z INFO Saved model checkpoint for epoch 1 to /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints/1_checkpoint.pth.tar
2020-09-25T13:50:58Z INFO Starting epoch 2
2020-09-25T13:50:59Z INFO Loaded the first minibatch of training data in 0.11 sec.
2020-09-25T13:51:04Z INFO Epoch 2 training took 5.88 sec of which data loading took 0.16 sec
2020-09-25T13:51:05Z INFO Model Training: Random seed set to: 42
2020-09-25T13:51:05Z INFO Loaded the first minibatch of validation data in 0.11 sec.
2020-09-25T13:51:06Z INFO Epoch 2 validation took 1.75 sec of which data loading took 0.11 sec
2020-09-25T13:51:06Z INFO Saved model checkpoint for epoch 2 to /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints/2_checkpoint.pth.tar
2020-09-25T13:51:06Z INFO Finished training
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** FINISHED: Model training after 19.01 seconds **************************************************
2020-09-25T13:51:06Z INFO
Attempted to log scalar metric Train epochs:
2
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** STARTING: Registering default model ***********************************************************
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z WARNING Not registering a model, because the run has no associated experiment
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** FINISHED: Registering default model after 0.00 seconds ****************************************
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO **** STARTING: Running default model on test set ***************************************************
2020-09-25T13:51:06Z INFO
2020-09-25T13:51:06Z INFO Train: 3, Test: 1, and Val: 2. Total subjects: 6
2020-09-25T13:51:06Z INFO Model Training: Random seed set to: 42
2020-09-25T13:51:06Z INFO Results directory: /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/epoch_002/Test
2020-09-25T13:51:06Z INFO Starting evaluation of model HelloWorld on epoch 2 Test set
2020-09-25T13:51:06Z INFO Processing dataset (name=None)
2020-09-25T13:51:06Z INFO Processing dataset (name=None)
2020-09-25T13:51:06Z INFO Processing dataset (name=None)
2020-09-25T13:51:06Z INFO Loading checkpoint /home/maher/InnerEye-DeepLearning/outputs/2020-09-25T135037Z_HelloWorld/checkpoints/2_checkpoint.pth.tar
2020-09-25T13:51:06Z INFO Loaded checkpoint (epoch: 2)
2020-09-25T13:51:06Z INFO Writing model summary to: logs/model_summaries/model_log001.txt
Setting up logging with level 20 to file logs/model_summaries/model_log001.txt
2020-09-25T13:51:07Z INFO Making no adjustments to the model because no GPU was found.
2020-09-25T13:51:09Z INFO Predicting for image 1 of 1...
2020-09-25T13:51:09Z INFO Inference pipeline (0), Predicting patient: 6
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Traceback (most recent call last):
File "", line 1, in
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/forkserver.py", line 186, in main
with socket.socket(socket.AF_UNIX, fileno=listener_fd) as listener,
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/socket.py", line 151, in init
_socket.socket.init(self, family, type, proto, fileno)
OSError: [Errno 22] Invalid argument: 'protocol'
2020-09-25T13:51:26Z ERROR Model training/testing failed. Exception: unexpected EOF
Traceback (most recent call last):
File "InnerEye/ML/runner.py", line 360, in run_in_situ
self.create_ml_runner().run()
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 318, in run
best_epoch = self.run_inference_and_register_model(run_recovery, ModelProcessing.DEFAULT)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 340, in run_inference_and_register_model
test_metrics, val_metrics, _ = self.model_inference_train_and_test(RUN_CONTEXT, run_recovery, model_proc)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 663, in model_inference_train_and_test
test_metrics = run_model_test(ModelExecutionMode.TEST)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/run_ml.py", line 659, in run_model_test
return model_test(config, data_split=data_split, run_recovery=run_recovery, model_proc=model_proc)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/model_testing.py", line 72, in model_test
return segmentation_model_test(config, data_split, run_recovery, model_proc)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/model_testing.py", line 102, in segmentation_model_test
run_recovery=run_recovery)
File "/home/maher/InnerEye-DeepLearning/InnerEye/ML/model_testing.py", line 180, in segmentation_model_test_epoch
with Pool(processes=num_workers) as pool:
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/pool.py", line 176, in init
self._repopulate_pool()
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/popen_forkserver.py", line 35, in init
super().init(process_obj)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/popen_forkserver.py", line 55, in _launch
self.pid = forkserver.read_signed(self.sentinel)
File "/home/maher/anaconda3/envs/InnerEye/lib/python3.7/multiprocessing/forkserver.py", line 312, in read_signed
raise EOFError('unexpected EOF')
EOFError: unexpected EOF

(InnerEye) maher@PC:~/InnerEye-DeepLearning$

Allow custom experiment names

We presently hardcode the AzureML experiment name to be the git branch name. We could add a switch to make that configurable on the commandline.

Store number of trainable parameters in config for later use

At the moment, the model summary code in generate_and_print_model_summary does not store any of its results. The number of trainable parameters is logged to AzureML, but not anywhere else.
It would be helpful to store the key output of the model summary, the number of trainable parameters, inside the model config, or at some other place so that it can be accessed later: In particular, it is important when we create a model dashboard in a post-crossvalidation-hook

AB#3923

Training starts from epoch 0 when using a run recovery object.

When continuing to train from a recovered run, the training run starts again from epoch 0. This has the side effect of causing the the inference run to use the run recovery object instead of the weights from the last epoch on the validation and test datasets.

Pick up git information from local repository, rather than via commandline args

Git commit ID, author, etc. are presently expected in commandline arguments, because we have been using the runner from DevOps pipelines. Going forward, we should expect that most people will use the code from their local boxes. To get better traceability, we should change the code to also pick up git information from there.
This will also make calling the runner in DevOps pipelines a lot easier.
Can we use something like gitpython to achieve that? https://github.com/gitpython-developers/GitPython

"conda env remove --name InnerEye" fails because `tqdm` is overloaded

Once the PYTHONPATH is set to the repository root, it is no longer possible to remove environments (or quite possibly do anything with environments). Conda uses the tqdm package, and we have a hack for that at repository root to avoid further dependencies.

antonsc@MSR:/mnt/c/git$ conda env remove -n InnerEye
Traceback (most recent call last):
  File "/home/antonsc/miniconda2/bin/conda-env", line 6, in <module>
    from conda_env.cli.main import main
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda_env/cli/main.py", line 13, in <module>
    import conda.exports  # noqa
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/exports.py", line 25, in <module>
    from . import plan  # NOQA
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/plan.py", line 26, in <module>
    from .core.link import PrefixSetup, UnlinkLinkTransaction
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/core/link.py", line 43, in <module>
    from ..resolve import MatchSpec
  File "/home/antonsc/miniconda2/lib/python2.7/site-packages/conda/resolve.py", line 9, in <module>
    from tqdm import tqdm
  File "/mnt/c/git/InnerEye-DeepLearning/tqdm.py", line 14
    def tqdm(arg: Any, *_rest: Any) -> Any:
                ^
SyntaxError: invalid syntax

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.