microsoft / hi-ml Goto Github PK
View Code? Open in Web Editor NEWHI-ML toolbox for deep learning for medical imaging and Azure integration
Home Page: https://aka.ms/hi-ml
License: MIT License
HI-ML toolbox for deep learning for medical imaging and Azure integration
Home Page: https://aka.ms/hi-ml
License: MIT License
(Separated out from Issue #2)
We are not consistent about prefixing private functions with one underscore (module functions), or two (instance methods). That doesn't matter much for an internal project but as we hope to have external developers using these packages we should signal clearly the difference between private functions which should only be called by code inside the package, and public functions which may be useful to consumers of the package.
sphinx/readthedocs?
To get things started, upload and run a xscript that just prints out a message handed in as an argument using our new submit_to_azure_if_needed
function
The github action, build-test-pr.yml, and makefile, both call mypy directly. Change them both to call mypy_runner instead.
Also change the build dependencies to publish to test.pypi only dependent on pytest
Javier pointed out that our tagline, Microsoft Health Intelligence AzureML helpers, on https://pypi.org/manage/project/hi-ml/releases/ is too generic.
In the Makefile, we now have code to do mypy, flake, environment building. Use those as building blocks in the PR build. This way we can ensure that the local dev environment is the same as used in the cloud.
Move this segment into the AzureML layer:
# For PR builds where we wait for job completion, the job must have ended in a COMPLETED state.
if self.azure_config.wait_for_completion and not is_run_and_child_runs_completed(azure_run):
raise ValueError(f"Run {azure_run.id} in experiment {azure_run.experiment.name} or one of its child "
"runs failed.")
Sketch basic unit tests & mocks for:
Uploading a single Python file to AzureML, running it, and returning the run_id. Then wait for run completion, then downloading output files and stdout
Uploading a single Python file and a data file, e.g. data.csv, rest as above.
Uploading two Python files, to check imports, rest as above.
To publish a new package to PyPi (i.e. the public package repository, not the test one) we run these commands:
make clean
make build
twine upload dist/*
That runs setup.py
. To get local testing working we added code to setup.py
that checks whether it is running in GitHub and if it is not then it changes the post number to a string of nine random digits. Unfortunately that code runs when packaging for real, i.e. not as part of testing, and so instead of a build numbers like 0.0.1post1, 0.0.1post2 , 0.0.1post3 etc. we get ones like 0.0.1post 5725449762
Helper functions to use in code to download a file from run (that work seamlessly in local runs and in AML) – used for downloading a checkpoint
We have sporadic test pipeline failures when building a Docker image in the AzureML jobs. Example here The job claims that package version post282 does not exist - but the github build agents successfully pulled that version a few minutes before. This probably comes from a package mirror that AML are using, that is not completely up to date with test.pypi.
As a workaround, we could publish to an MS internal package feed. AML would probably not have a cache of that, and hence always use the latest.
From old repo, copy over everything that makes sense. This would include:
To ease local development and testing, instead of requiring a package to be built, copy the contents of the "src" folder into the test folder.
Introduce a while loop similar to publishing to test.pypi to check that the version downloaded is the same as the version just uploaded
Ensure that all files have copyright notices, and that editors are set up to automatically insert them (PyCharm does it correctly on InnerEye)
You must run the following source code analysis tools:
CredScan
CodeQL (Semmle)
Component Governance Detection
The easiest way to run these tools is to add thems in your build pipeline in a Microsoft-managed Azure DevOps account.
For CodeQL, please ensure the following (detailed instructions for CodeQL can be found here):
Select the source code language in the CodeQL task.
If your application was developed using multiple languages, add multiple CodeQL tasks.
Define the build variable LGTM.UploadSnapshot=true.
Configure the build to allow scripts to access OAuth token.
If the code is hosted in Github, create Azure DevOps PAT token with code read scope for dev.azure.com/Microsoft (or ‘all’) organization and set the local task variable System_AccessToken with it. (Note: This only works for YAML-based pipelines.)
Review security issues by navigating to semmleportal.azurewebsites.net/lookup. It may take up to one day to process results.
Potential omission from #55 where it says
For CodeQL, please ensure the following (detailed instructions for CodeQL can be found here):
Select the source code language in the CodeQL task.
If your application was developed using multiple languages, add multiple CodeQL tasks.
Define the build variable LGTM.UploadSnapshot=true.
Configure the build to allow scripts to access OAuth token.
If the code is hosted in Github, create Azure DevOps PAT token with code read scope for dev.azure.com/Microsoft (or ‘all’)
organization and set the local task variable System_AccessToken with it. (Note: This only works for YAML-based pipelines.)
Review security issues by navigating to semmleportal.azurewebsites.net/lookup. It may take up to one day to process results.
This is not done yet (unless it happens automatically) and I cannot find any mention of LGTM in InnerEye to crib from.
The CodeQL Portal says "Only Visual Studio Team System and Azure Dev Ops URLs are supported" and will not upload a snapshot from GitHub
Should not cause merge conflicts
Individual files that are collated into a changelog file
config.json
and environment.yml
in current folder or any of the parent folders (stop going up when going out of a folder that is in PythonPath)Add a new file: run_requirements.txt with the package run requirements.
Parse this and add it to the setup.py install_requires array.
Add a new shell script to pip install all the requirements for ease of development.
At present, we use commandline arguments, and always tag runs with git commit information. We can do that in hi-ml too, but maybe have a switch to turn that off (no run tagging)
All arguments that point to files are presently accepting Path objects. However, many people still use strings for handling file names. Ensure that all arguments are happily accepting PathOrString.
submit_to_azure_if_needed
into smaller functions@pytest.mark.fast
max_run_duration_seconds
, make that an argument of submit_if_neededwait_for_completion(self, ... raise_on_error=True):
Unfortunately the build code coverage is still pointing at "src" when evaluating the package. Change it to point at "health"
We should get flake, mypy, and unit test results all at once, rather than stopping already after failing mypy - if I submit a change, I should get a clear indication about ALL the things that are wrong.
amlignore_path is not assigned if ignored_folders is empty:
if ignored_folders:
amlignore_path = snapshot_root_directory or Path.cwd()
amlignore_path = amlignore_path / ".amlignore"
lines_to_append = [str(path) for path in ignored_folders] if ignored_folders else []
with append_to_amlignore(
amlignore=amlignore_path,
lines_to_append=lines_to_append):
When running the unit tests, some folders are created using the pytest tmp_path fixture and the src folder is copied into them. This means that the coverage tool things they are different to the installed package. When running as a test in github action then the package is already installed, so this should not be necessary. When running locally it should be possible to install the src folder as an editable package, with the -e option and the tests still run.
hi-ml
hi-ml
as a package in environment.yml
hi-ml
as a package:
test_register_and_score_model
in the TrainEnsemble
legtest_submit_for_inference
in the TrainInAzureMLViaSubmodule
legCopy the essence of DICOM-RT package to here
(Separated out from Issue #2)
Building a docker image costs about 20min, that's too long for a PR build.
docker_shm_size
is respectedAdd unit tests for dataset downloading/mounting for input and output.
AML Dataset's get_by_name sometimes returns ServiceException 204 - 'unknown error'
Examples of failed runs:
submit_if_needed needs arguments for
Datasets can be specified either programmatically via a DatasetConfig, or as a string. DatasetConfig should contain
Datasets as strings:
Depending on if the dataset is in the input or outputs list, we can create an input or output config from that.
Our code in setup.py
will trigger with new tags. setuptools.setup
will reject tags that are not release versions but we could do more to make that explicit by checking for the leading "v".
Also when we tag releases as, say, "v0.1.1" the leading "v" is carried through setuptools.setup
so it becomes part of the pip test download
Successfully installed pip-21.2.4
Collecting hi-ml==v0.1.0
Downloading hi_ml-0.1.0-py3-none-any.whl (25 kB)
(from here)
This works, but it would be cleaner to submit the version number using the public version identifier format mandated in PEP 440, i.e. without the leading "v"
Helps to delimit where the reordering of the PIP indexes applies.
from old code:
# AzureML seems to sometimes expect the entry script path in Linux format, hence convert to posix path
entry_script_relative_path = source_config.entry_script.relative_to(source_config.root_folder).as_posix()
In the same washup:
# Use blob storage for storing the source, rather than the FileShares section of the storage account.
run_config.source_directory_data_store = workspace.datastores.get(WORKSPACE_DEFAULT_BLOB_STORE_NAME).name
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.