awslabs / mlmax Goto Github PK

Example templates for the delivery of custom ML solutions to production so you can get started quickly without having to make too many design choices.

Home Page: https://mlmax.readthedocs.io/en/latest/

License: Apache License 2.0

Python 33.91% Shell 3.23% Jupyter Notebook 62.03% Dockerfile 0.45% Vim Script 0.38%

machine-learning training inference

mlmax's Introduction

ML Max

ML Max is a set of example templates to accelerate the delivery of custom ML solutions to production so you can get started quickly without having to make too many design choices.

Quick Start

ML Training Pipeline: This is the process to set up standard training pipelines for machine learning models enabling both immediate experimentation, as well as tracking and retraining models over time.
ML Inference Pipeline: Deploys a model to be used by the business in production. Currently this is coupled quite closely to the ML training pipeline as there is a lot of overlap.
Development environment: This module manages the provisioning of resources and manages networking and security, providing the environment for data scientists and engineers to develop solutions.
Data Management and ETL: This module determines how the machine learning operations interacts with the data stores, both to ingest data for processing, managing feature stores, and for processing and use of output data. A common pattern is to take an extract, or mirror, of the data into S3 on a project basis.
CICD Pipeline: This module provides the guidance to setting up a continuous integration (CI) and continuous deployment (CD) pipeline, and automate the delivery of the ML pipelines (e.g., training and inference pipelines) to production using multiple AWS accounts (i.e., devops account, staging account, and production account.).

Help and Support

mlmax's People

Contributors

Stargazers

Watchers

Forkers

yapweiyih anneehuu7 kranthigy goyder zhangyi733 richardscottoz yinsong1986 wfclark5 satish5555555 eason-python-xu iyksam20 anjaniksharma prateekchandrajha alnickel trellixvulnteam asdyns ryankim3gilead softaitech kunacornejo

mlmax's Issues

Clarify metadata bucket in the environment module

Clarify step 2, section Setup Guide in modules/environment/README.md, that user needs to create a new bucket or re-use an existing bucket, but the deploy.sh (or is it the CloudFormation?) must be able to write to this bucket.

Run training pipeline from Windows machine causes exception

python training_pipeline_run.py on Windows machine throws the following exception:

====
...
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sourcedir.tar.gz'
...

On Windows, the temporary directory should be

“%systemdrive%\Windows\Temp or %userprofile%\AppData\Local\Temp“

and “/tmp” only works for Mac or Linux. So it has to throw “No such file …” exceptions since /tmp is not supported on Windows.

Pipeline package refresh

🚀 Feature request

Describe the feature you'd like

Upgrade the versions of the SageMaker and Step Functions SDKs.

What is the motivation for the feature?

Currently the pipeline module uses SageMaker SDK v2.46, however, we are using <2.0.0,>=1.71.0.
Currently the pipeline module uses Step Functions SDK v2.2, however, we are using >=1.1.0,<2.0.0.
There are many new features that we could support with upgrading the packages to the latest.
Any compatibility issues between the SageMaker SDK and Step Function SDK should be resolved by now.

Could you contribute? (Optional)

A description of how you can help, e.g. submitting a PR.

Support AWS Batch execution engine

Describe the feature you'd like

Execute user scripts on AWS Batch.

What is the motivation for the feature?

To support AWS Batch users.

Your contribution

Happy to help.

Add Support for scheduling inference pipeline.

🚀 Feature request

Describe the feature you'd like

Add Support for scheduling inference pipeline.

What is the motivation for the feature?

Add Support for scheduling inference pipeline.

Could you contribute? (Optional)

submitting a PR.

Launch EC2 and notebook workspaces via AWS Service Catalog

Describe the feature you'd like

I would like the ability to launch the development environment using the AWS Systems Catalog. I would like the ability to launch the ec2 instances, and notebook instances, with all the required security configurations, with only access to the Systems Catalog. This may simply be provided. by Amazon SageMaker Projects: [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html].

What is the motivation for the feature?

In many organisations, data scientists are not given full access to AWS CloudFormation, IAM, or even the ability to spin up EC2 resources. This allows them to still have self-service access to a Ml development environment.

Your contribution

I would like to help by further developing the requirements for the feature.

Screening notebooks for stop_* API.

🚀 Feature request

Describe the feature you'd like

Screening notebooks to check whether stop_* APIs are permitted by the SageMaker execution role.

What is the motivation for the feature?

To check whether stop_* APIs are permitted by the SageMaker execution role.

Could you contribute? (Optional)

To PR.

Add CodeCommit permission to EC2 instance profile

🐛 Bug report

I have checked that this issue has not already been reported.

Describe the bug

EC2 instance does not have code-commit access, however the SageMaker notebook does. Propose to make the EC2 instance on-par with SageMaker, with access to code-commit.

To reproduce

git clone from code commit will fail with 403 forbidden.

Expected behavior

git clone from code-commit should work.

System information

awscli version: aws-cli/1.18.179 Python/3.7.6 Linux/4.14.203-156.332.amzn2.x86_64 botocore/1.19.19
SageMaker Python SDK version: irrelevant
Docker image: N/A
Python version: aws-cli/1.18.179 Python/3.7.6 Linux/4.14.203-156.332.amzn2.x86_64 botocore/1.19.19

Add Support for CI/CD automation

Describe the feature you'd like

Add Support for CI/CD automation.

What is the motivation for the feature?

This will enable the automation of the pipelines from development to deployment.

IAM Role for Typical DS/MLE work

🚀 Feature request

Describe the feature you'd like

A started IAM role with a set of basic permissions:

Spin up an EC2 instance for development (including create Key Pair and Volume).
Run common SageMaker operations, such as processing, training, batch_transform, endpoint creation
Run CloudFormation templates

What is the motivation for the feature?

Resource-level control and administration.

Could you contribute? (Optional)

Yes.

Create a gif image for CLI commands

Training/Inference Pipeline
Environment Module

Screening notebook for support KMS keys

🚀 Feature request

Describe the feature you'd like

Screening to provide support for S3 bucket that is encrypted with customer KMS key and enforce encryption header.

Error:
An error occurred (AccessDenied) when calling the PutObject operation: Access Denied

What is the motivation for the feature?

To support regulated environment that mandates encryption.

Could you contribute? (Optional)

A description of how you can help, e.g. submitting a PR.

Let user specify training and inference pipeline bucket

🚀 Feature request

Describe the feature you'd like

Training & inference pipeline use the default sagemaker bucket (i.e., sagemaker-<region>-<acc>). This won't work without create S3 bucket permission, which happens in some situations.

Propose to update the pipelines so that when they call the sagemaker APIs (train, processing), to clearly define the s3_prefix to use (code_path, output, etc.).

What is the motivation for the feature?

Pipeline can work without create S3 bucket policy.

Could you contribute? (Optional)

A description of how you can help, e.g. submitting a PR.

Create buttons for readme

📚 Documentation improvement

What did you find confusing? Please describe.

Create buttons to give positive first impression to the repository.

Could you contribute? (Optional)

Improve usage documentations for notebooks/screening/*.ipynb

📚 Documentation improvement

What did you find confusing? Please describe.

How to run the screening notebooks to verify readiness of an AWS account for running SageMaker workloads.
How to clean-up screening artifacts from S3 bucket.
DRY-pattern to define mandatory, account-specific kwargs to SageMaker SDK API.

Could you contribute? (Optional)

Yes.

Set up initial readthedocs structure

Training and Inference Pipelines:

Quick Start
Installation
Creating the Runtime Scripts
Customize the Pipeline
Customize the ML Engine (Placeholder)
API Reference (generated from strings, placeholder OK for this issue)

Environment

Similar as above

Use Sphinx

Realtime Inference

🚀 Feature request

Describe the feature you'd like

Creation of an inference endpoint in SageMaker.

What is the motivation for the feature?

Making near-real-time inference.

Private repo with code artifact

Describe the feature you'd like

Create a private repo for storing python libraries with code artifact.

What is the motivation for the feature?

This will allow administrators to decouple the approval of packages from the configuration and build of the environment itself.

Your contribution

(Optional) Is there any way that you could help, e.g. by submitting a PR?

Yes, I would be happy to help get this started!

Add docstrings to functions in src/**/*.py

Add `--data-dir` arguments to the src/mlmax/**.py files for easy local testing

Should be able to run python files locally with a single one data directory argument. This will make it relatively easy to run these files locally for development and testing. The convention we use in preprocessing.py is --data-dir. We can change it, but we should make all of the files in src/mlmax/**.py consistent with each other.

Self-contained notebook to screen SageMaker Tuning

🚀 Feature request

Describe the feature you'd like

A self-contained notebook to verify that an AWS account is ready to run SageMaker HPO jobs.

What is the motivation for the feature?

To quickly verify if an AWS account is ready to run SageMaker HPO jobs, and no misconfiguration on the AWS account. The notebook must work even on restricted account without public internet access.

Could you contribute? (Optional)

Yes.

Self-contained notebook to screen SageMaker processing

🚀 Feature request

Describe the feature you'd like

A self-contained notebook to verify that an AWS account is ready to run SageMaker processing jobs.

What is the motivation for the feature?

To quickly verify if an AWS account is ready to run SageMaker processing jobs, and no misconfiguration on the AWS account. The notebook must work even on restricted account without public internet access.

Could you contribute? (Optional)

Yes.

Cloudformation to setup permission for external account to a role

🚀 Feature request

Cloudformation to setup permission for external account to a role.

E.g.

Run cloudformation script with input <external_aws_acct> as parameter
Cloudformation will create local role arn:aws:iam::<internal_aws_acct>:role/CrossAccountRole.
External AWS account <external_aws_acct> can assume role CrossAccountRole on <internal_aws_acct>

This is tracked in branch cross_account_permission

Stop-gap solution to modular processing scripts.

🚀 Feature request

Describe the feature you'd like

What to do with this stop-gap FrameworkProcessor at https://github.com/aws-samples/smallmatter-package/tree/smproc-stopgap ?

Given the context & motivation for MLMax, I'm inclined to snapshot that class into this repo, rather than having an external dependencies. Thoughts?

What is the motivation for the feature?

Simplify modular processing scripts.

Could you contribute? (Optional)

Yes. Happy to snapshot that class into this repo.

Key rotation with Secrets Manager

Describe the feature you'd like

A clear and concise description of the functionality you want.

Would like to configure key rotation using Secrets manager for the EC2 key pair.

What is the motivation for the feature?

Preventative measure for securing the environment.

Your contribution

(Optional) Is there any way that you could help, e.g. by submitting a PR?

Yes, happy to help!

Self-contained notebook to screen SageMaker Autopilot

🚀 Feature request

Describe the feature you'd like

A self-contained notebook to verify that an AWS account is ready to run SageMaker Autopilot jobs.

What is the motivation for the feature?

To quickly verify if an AWS account is ready to run SageMaker Autopilot jobs, and no misconfiguration on the AWS account. The notebook must work even on restricted account without public internet access.

Could you contribute? (Optional)

Yes.

training and inference run scripts support STS endpoints.

🐛 Bug report

I have checked that this issue has not already been reported.

Describe the bug

Without public internet, but with STS endpoint setup, both inference_pipeline_run.py and training_pipeline_run.py will fail due to http timeout.

This can be fix by hardcoding the endpoint as follows:

inference_pipeline_run.py:136:    sts = boto3.client("sts", endpoint_url="https://sts.ap-southeast-2.amazonaws.com")
training_pipeline_run.py:132:    sts = boto3.client("sts", endpoint_url="https://sts.ap-southeast-2.amazonaws.com")

As a proper fix, I propose to add a new configurable parameter to define the VPC endpoint for STS.

To reproduce

Run {training,inference}_pipeline_run.py scripts from an EC2 instance running in private VPC with STS endpoint.

Expected behavior

Training or inference should complete.

System information

awscli version: aws-cli/1.18.179 Python/3.6.10 Linux/4.14.203-156.332.amzn2.x86_64 botocore/1.20.30
SageMaker Python SDK version: 1.72.1
Docker image: N/A
Python version: Python/3.6.10 Linux/4.14.203-156.332.amzn2.x86_64

Github Workflow Test Error

🐛 Bug report

[✅ ] I have checked that this issue has not already been reported.

Describe the bug

After pull request, github automated job workflow staled at test stage.

To reproduce

Create a pull request/push new code change to trigger github workflow.

Error message:

Run tox -e pytest
tox -e pytest
shell: /usr/bin/bash -e {0}
env:
pythonLocation: /opt/hostedtoolcache/Python/3.7.11/x64
GLOB sdist-make: /home/runner/work/uc-mlmax/uc-mlmax/setup.py
pytest create: /home/runner/work/uc-mlmax/uc-mlmax/.tox/pytest
pytest installdeps: -rrequirements.txt

Expected behavior

Both lint and test jobs to complete successfully.

System information

awscli version: NA
SageMaker Python SDK version: NA
Docker image: NA
Python version: 3.7

Proposed Solution

Include specific python package version in top level directoryrequirements.txt.

sagemaker[local]>=2.22.0
boto3>=1.9.213
pyyaml==5.4.1
stepfunctions>=2.0.0
fsspec==2021.8.1
s3fs==2021.8.1
scikit-learn==0.20.0
matplotlib==3.4.3
pandas==1.3.2
pytest==6.2.5
datatest==0.11.1
pytest-cov==2.12.1
numpy==1.21.2
loguru==0.5.3

Disable public yum repos on new DLAMI EC2 instance

🐛 Bug report

I have checked that this issue has not already been reported.

Describe the bug

From the EC2 instance deployed by the environment module, sudo yum update will time-out on public yum repos. These commands were required to disable those repos:

  120  sudo yum-config-manager --disable libnvidia-container
  122  sudo yum-config-manager --disable neuron
  124  sudo yum-config-manager --disable libnvidia-container --disable neuron
  126  sudo yum-config-manager --disable nvidia-container-runtime
  128  sudo yum-config-manager --disable nvidia-docker

To reproduce

sudo yum update and watch the timeout message. E.g.,

ec2-user@ip-xx-xxx-xx-xxx pkgs]$ sudo yum update
Loaded plugins: dkms-build-requires, extras_suggestions, langpacks, priorities, update-motd, versionlock
amzn2-core                                                                                                | 3.7 kB  00:00:00     
amzn2extra-docker                                                                                         | 3.0 kB  00:00:00     
https://nvidia.github.io/nvidia-container-runtime/stable/amzn2/x86_64/repodata/repomd.xml: [Errno 14] curl#7 - "Failed to connect to nvidia.github.io port 443: Connection timed out"
Trying other mirror.
^C
 Current download cancelled, interrupt (ctrl-c) again within two seconds
to exit.



 One of the configured repositories failed (nvidia-container-runtime),
 and yum doesn't have enough cached data to continue. At this point the only
 safe thing yum can do is fail. There are a few ways to work "fix" this:

     1. Contact the upstream for the repository and get them to fix the problem.

     2. Reconfigure the baseurl/etc. for the repository, to point to a working
        upstream. This is most often useful if you are using a newer
        distribution release than is supported by the repository (and the
        packages for the previous distribution release still work).

     3. Run the command with the repository temporarily disabled
            yum --disablerepo=nvidia-container-runtime ...

     4. Disable the repository permanently, so yum won't use it by default. Yum
        will then just ignore the repository until you permanently enable it
        again or use --enablerepo for temporary usage:

            yum-config-manager --disable nvidia-container-runtime
        or
            subscription-manager repos --disable=nvidia-container-runtime

     5. Configure the failing repository to be skipped, if it is unavailable.
        Note that yum will try to contact the repo. when it runs most commands,
        so will have to try and fail each time (and thus. yum will be be much
        slower). If it is a very temporary problem though, this is often a nice
        compromise:

            yum-config-manager --save --setopt=nvidia-container-runtime.skip_if_unavailable=true

failure: repodata/repomd.xml from nvidia-container-runtime: [Errno 256] No more mirrors to try.
https://nvidia.github.io/nvidia-container-runtime/stable/amzn2/x86_64/repodata/repomd.xml: [Errno 14] curl#7 - "Failed to connect to nvidia.github.io port 443: Connection timed out"
https://nvidia.github.io/nvidia-container-runtime/stable/amzn2/x86_64/repodata/repomd.xml: [Errno 15] user interrupt

Expected behavior

Skip public yum repos.

System information

awscli version: irrelevant
SageMaker Python SDK version: irrelevant
Docker image: irrelevant
Python version: irrelevant

Create a quick start guide for using the project as a scaffolding for the start of a new project

📚 Documentation improvement

What did you find confusing? Please describe.

There aren't any instructions for how to use this as the starting point for a new project.

Could you contribute? (Optional)

Running inference_pipeline_run.py failed

🐛 Bug report

I have checked that this issue has not already been reported.

Describe the bug

When running inference_pipeline_run.py, it might have issues like:
Traceback (most recent call last):
File "inference_pipeline_run.py", line 185, in
example_run_inference_pipeline(workflow_arn, region)
File "inference_pipeline_run.py", line 100, in example_run_inference_pipeline
proc_model_s3, model_s3 = get_latest_models()
File "inference_pipeline_run.py", line 39, in get_latest_models
processing_job_name = response["ProcessingJobSummaries"][0]["ProcessingJobName"]
IndexError: list index out of range

To reproduce

Not easily to reproduce, it depends on the Sagemaker API.

Expected behavior

A clear and concise description of what you expected to happen.

System information

awscli version:
SageMaker Python SDK version:
Docker image:
Python version:

Environment module deploy.sh to support region override

🐛 Bug report

I have checked that this issue has not already been reported.

Describe the bug

modules/environment/deploy.sh only picks the region defined in the runtime environment (env vars or ~/.aws/config entry). It does not allow user to deploy on a different region without changing their AWS CLI setup.

To reproduce

Run ./deploy.sh on a bucket in a different region that what's configured in AWS CLI.

Expected behavior

something like ./deploy.sh --region ap-southeast-2 even if ~/.aws/config defaults to ap-southeast-1. Otherwise, this can impede / introduce friction for using the same account to deploy to several regions.

System information

awscli version: aws-cli/2.1.30 Python/3.9.2 Darwin/19.6.0 source/x86_64 prompt/off
SageMaker Python SDK version: aws-cli/2.1.30 Python/3.9.2 Darwin/19.6.0 source/x86_64 prompt/off
Docker image: N/A
Python version: aws-cli/2.1.30 Python/3.9.2 Darwin/19.6.0 source/x86_64 prompt/off

Infrastructure testing for pipeline module

🚀 Feature request

Describe the feature you'd like

Tests for the infrastructure being created in the Training/Inference pipeline.

What is the motivation for the feature?

Unit testing is crucial anytime we want to make updates to the pipeline, to gain quick insight into whether there may be any issues with the code.

Some datatest suggestions.

Instead of decorating each function with @dt.working_directory(...), you could use a single pytest fixture with autouse=True and then set the scope to session, module, or function as appropriate:

@pytest.fixture(scope='session', autouse=True)
def set_working_directory():
    with dt.working_directory(__file__):
        yield  # Use directory for specified scope.

Also, if you need to validate and pd.DataFrame or pd.Series objects, there is now tighter Pandas integration via the dt.register_accessors() function: https://datatest.readthedocs.io/en/stable/reference/data-handling.html#pandas-accessors

data-module: The ability to specify the bucket where the data is read/written to at runtime.

🚀 Feature request

Describe the feature you'd like

The ability to specify the bucket where the data is read/written to at runtime.

What is the motivation for the feature?

The ability to specify the bucket where the data is read/written to at runtime.

Could you contribute? (Optional)

submitting a PR.

Update the region the Docker image is stored in

Perhaps this could be something we populate from the configuration file.

Decouple the creation of EC2 instance from creation of the rest of the environment

🚀 Feature request

Describe the feature you'd like

Decouple the creation of EC2 instance from creation of the rest of the environment. This means there are two distinct steps:

Create the environemnt, including S3 and Networking
Spin up the EC2 instance

What is the motivation for the feature?

Setting up the S3 Bucket and Networking requires less updating than creating the EC2 instance. Additionally, we may want to create multiple EC2 instance stacks, while only having single stack for the other parts of the environment.

Could you contribute? (Optional)

Yes.

Support the Airflow workflow engine

Describe the feature you'd like

To support pipeline definition that runs on the Airflow workflow eninge

What is the motivation for the feature?

The current pipeline definition is coupled with AWS StepFunction, but many users are already using Airflow to drive the execution of their existing ML workflow

Your contribution

Yes, happy to help!

Add unit testing to the .py files in src

[Security] Workflow branch.yaml is using vulnerable action actions/checkout

The workflow branch.yaml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

Add a scheduler in the data module for running the data pipeline.

🚀 Feature request

Describe the feature you'd like

Add a scheduler in the data module for running the data pipeline.

What is the motivation for the feature?

The data pipeline usually needs to be run hourly/daily/weekly/monthly.

Could you contribute? (Optional)

submitting a PR.

Create an Issue template

Support public internet access for SageMaker and EC2

Describe the feature you'd like

To have an option to support public internet.

What is the motivation for the feature?

To be able to use the environment setup for ec2/sagemaker on project that does not need to block public internet.

Your contribution

(Optional) Is there any way that you could help, e.g. by submitting a PR?

Self-contained notebook to screen SageMaker experiments

🚀 Feature request

Describe the feature you'd like

A self-contained notebook to verify that an AWS account is ready to run SageMaker experiments.

What is the motivation for the feature?

To quickly verify if an AWS account is ready to run SageMaker experiments, and no misconfiguration on the AWS account. The notebook must work even on restricted account without public internet access.

Could you contribute? (Optional)

Yes.

Edit on Github links from documentation 404

🐛 Bug report

[x ] I have checked that this issue has not already been reported.

Describe the bug

Edit on Github links from documentation 404 - is the branch name wrong?

Expected behavior

Take you to the relevant github page.

awslabs / mlmax Goto Github PK

mlmax's Introduction

ML Max

Quick Start

Help and Support

mlmax's People

Contributors

Stargazers

Watchers

Forkers

mlmax's Issues

==== ... fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sourcedir.tar.gz' ...

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

Describe the feature you'd like

What is the motivation for the feature?

Your contribution

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

Describe the feature you'd like

What is the motivation for the feature?

Your contribution

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

🐛 Bug report

Describe the bug

To reproduce

Expected behavior

System information

Describe the feature you'd like

What is the motivation for the feature?

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

📚 Documentation improvement

What did you find confusing? Please describe.

Suggested fix for documentation

Could you contribute? (Optional)

📚 Documentation improvement

What did you find confusing? Please describe.

Suggested fix for documentation

Could you contribute? (Optional)

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Describe the feature you'd like

What is the motivation for the feature?

Your contribution

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

🚀 Feature request

🚀 Feature request

Describe the feature you'd like

What is the motivation for the feature?

Could you contribute? (Optional)

Describe the feature you'd like

What is the motivation for the feature?

Your contribution

🚀 Feature request

====
...
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sourcedir.tar.gz'
...