aws / sagemaker-mxnet-training-toolkit Goto Github PK

Toolkit for running MXNet training scripts on SageMaker. Dockerfiles used for building SageMaker MXNet Containers are at https://github.com/aws/deep-learning-containers.

License: Apache License 2.0

Python 100.00%

aws sagemaker mxnet docker

sagemaker-mxnet-training-toolkit's Introduction

SageMaker MXNet Training Toolkit

SageMaker MXNet Training Toolkit is an open-source library for using MXNet to train models on Amazon SageMaker. For inference, see SageMaker MXNet Inference Toolkit. For the Dockerfiles used for building SageMaker MXNet Containers, see AWS Deep Learning Containers. For information on running MXNet jobs on Amazon SageMaker, please refer to the SageMaker Python SDK documentation.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Testing

Set up a virtual environment for testing.

One of the multiple ways to setup a virtual environment

# use a package virtualenv
# create a virtualenv
virtualenv -p python3 <name of env>
# activate the virtualenv
source <name of env>/bin/activate

Install requirements

pip install --upgrade .[test]

Local Test

To run specific test

tox -- -k test/unit/test_training.py::test_train_for_distributed_scheduler

To run an entire file

tox -- test/unit/test_training.py

To run all tests within a folder [e.g. integration/local/]

Note: To run integration tests locally, one needs to build an image. To trigger image build, use -B flag.

tox -- test/integration/local

You can also run them in parallel:

tox -- -n auto test/integration/local

To run for specific interpreter [Python environment], use the -e flag

tox -e py37 -- test/unit/test_training.py

Remote Test

Make sure to provide AWS account ID, Region, Docker base name & Tag. Docker Registry is composed of (aws_id, region) Image URI is composed of (docker_registry, docker_base_name, tag)

Resulting Image URI is composed as: {aws_id}.dkr.ecr.{region}.amazonaws.com/{docker_base_name}:{tag}

tox -- --aws-id <aws_id> --region <region> --docker-base-name <docker_base_name> --tag <tag> test/integration/sagemaker

For more details, refer conftest.py

License

SageMaker MXNet Training Toolkit is licensed under the Apache 2.0 License. It is copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. The license is available at: http://aws.amazon.com/apache2.0/

sagemaker-mxnet-training-toolkit's People

Contributors

Stargazers

Watchers

Forkers

laurenyu just4jc yangaws iquintero rahul003 veera-nunna leleamol jmazanec15 icywang86rui opringle ourobouros jesterhazy strategist922 dalavancloud yuhonghong7035 mvsusp hanbman hanman-aws cnxtech awssenera lianyiding vlordier bdurepo1 classicsong arjkesh leezu saimidu nadiaya ddavydenko zhreshold nskool danabens vandanavk nihalharish tusharkanekidey giuseppeporcelli divya-bhargavi chaibapchya paddyind bobby484 karan6181 mseth10 qpc-database jeffreyhaobondilabs test-mass-forker-org-1 mahwiah trellixvulnteam mbercin-test-org seanpm2001

sagemaker-mxnet-training-toolkit's Issues

Readme out of date?

The Readme says:

# All build instructions assume you're starting from the root directory of this repository

# MXNet 1.6.0, Python 3, CPU
$ cp dist/sagemaker_mxnet_container*.tar.gz docker/1.6.0/sagemaker_mxnet_container.tar.gz
$ cp -r docker/artifacts/* docker/1.6.0/py3
$ cd docker/1.6.0/py3
$ docker build -t preprod-mxnet:1.6.0-cpu-py3 -f Dockerfile.cpu .

However, there is no dist/sagemaker_mxnet_container*.tar.gz.
I only get one that is called sagemaker_mxnet_training-3.1.12.dev0.tar.gz.

I just executed:

# Create the binary
git clone https://github.com/aws/sagemaker-mxnet-container.git
cd sagemaker-mxnet-container
python setup.py sdist

As the beginning of the Readme says to do.

Improve Readme

The current Readme gives a wrong notion that to build image one has to use Dockerfiles present in this repo
However, it looks like the Dockerfiles & tests folder have been migrated to https://github.com/aws/deep-learning-containers
That needs to be documented.

Also would be great to note

What's the exact purpose of this repository?
How is it different from the dlc repo?
How do changes made in dlc repo [update to docker image for example] gets reflected in this repo?

requirements.txt supported for train but not deploy?

I'm installing GluonCV via a requirements.txt file in my source_dir, which seems to be working fine for my training job but then ignored when the container starts up for inference: leading to ModuleNotFoundError.

I guess I could install the dependency inline in the script with something like the below, but then the requirements.txt functionality is kind of inconsistent right? I could have just done inline installs in the first place...

subprocess.call([sys.executable, "-m", "pip", "install", "--upgrade", "gluoncv"])

Is processing requirements.txt on inference startup a potential enhancement? Or an idea that's already been considered and decided against for some reason? Would it be implemented here, or in the base sagemaker-containers?

Utility to handle Pipe mode input like PipeModeDatset in tensorflow_extensions

Describe the feature you'd like
Utility to handle Pipe mode input like pipmodeDatset in https://github.com/aws/sagemaker-tensorflow-extensions/blob/master/src/sagemaker_tensorflow/pipemode.py.
Is there anything similar for mxnet ecosystem?

Container Support for Gluon Toolkits

Reference: 0413477381

The Gluon toolkits (GluonCV, GluonNLP, GluonTS) are the essential constituents of the MXNet ecosystem. Being a container for MXNet, and given that even packages outside of the ecosystem such as scikit-learn and pandas are included, shall we make sure we add these toolkits in the next release?

MXNet 1.5 container

Will MXNet 1.5 sagemaker container be released?

build args to specify additional python packages and cuda version

I'm no docker wizard but if this was possible then #31 could be achieved through modifying the build command, rather than creating new Dockerfiles.

Unsure how to implement but I imagine something like:

Build the base image
docker build -t mxnet-base:1.2.1-gpu-py3 --build-arg cuda=9.2 -f Dockerfile.gpu .

Build the final image
docker build -t mxnet-gluonnlp-cu92:1.2.1-gpu-py3 --build-arg cuda=9.2 --build-arg additional_packages=pandas,gluonnlp -f Dockerfile.gpu .

boto can't find credentials

I'm trying to use boto3 from a sagemaker-mxnet container, but I get this error when I submit the training job:

botocore.exceptions.NoCredentialsError: Unable to locate credentials

The script runs fine on a sagemaker notebook instance using the same IAM role.

EDIT:

This appears to happen bc Network Isolation is turned on by default, never mind.

DEFAULT_FILENAMES hardcoded

Im not sure why the default file names are hardcoded, and not a member of ModuleTransformer- wouldn't it make sense to have them be parameters that are set to those values by default?

Running MxNet 1.1.0 container

Hi there,
when running the line:

"docker build -t preprod-mxnet:1.1.0-cpu-py2 --build-arg py_version=2 --build-arg framework_installable=mxnet-1.1.0-py2.py3-none-manylinux1_x86_64.whl -f Dockerfile.cpu ."

I get the following error:

COPY failed: stat /var/lib/docker/tmp/docker-builder295259424/mxnet-1.1.0-py2.py3-none-manylinux1_x86_64.whl: no such file or directory

Seems like a file is missing or I am missing a step to get the file at the right place?

Thanks

Feature request: support latest mxnet package via requirements.txt?

Hi,

I was wondering if the mxnet docker image could support a requirements.txt file just like what the tensorflow container does. The latest mxnet environments offer much performance improvements. Additionally, the latest cu92 environment fixes several memory leak bugs. Would love to use these latest features in my work.

Thanks.

Add horovod/mpi support to generic container

Currently, generic container is bare-bones & doesn't have setup for MPI/SSH/Horovod.
Hence we are skipping horovod tests on mxnet.cpu [generic container].

That's the same case in TF
https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/a22e3df0faf66b215c24c1bff6f334e14c39d5cf/test/integration/local/test_horovod.py#L26-L29

https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/a22e3df0faf66b215c24c1bff6f334e14c39d5cf/test/integration/local/test_horovod.py#L36-L42

How to support multiple model inputs?

https://github.com/aws/sagemaker-mxnet-container/blob/ee9098c8c2de6a635dcd9f4b0819dc5340061cde/src/sagemaker_mxnet_container/serving.py#L228

If my mxnet model takes 2 data inputs (both are float arrays), how should I make it work at this point?

I defined my model-shapes.json like
[{"shape": [1, 12], "name": "data0"}, {"shape": [1, 12], "name": "data1"}]
And one example input looks like:
data = {'data0':[1.0, 2904.0, 1452.0, 464.0, 3022.0, 2948.0, 2548.0, 2.0, 0.0, 0.0, 0.0, 0.0], 'data1':[1.0, 2204.0, 1552.0, 494.0, 3032.0, 298.0, 2568.0, 2.0, 0.0, 0.0, 0.0, 0.0]}

But I got errors on the server side:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/sagemaker_containers/_functions.py", line 84, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/sagemaker_mxnet_container/serving.py", line 229, in default_input_fn
[data_shape] = self._model.data_shapes

ValueError: too many values to unpack (expected 1)

Recycling old, unused files

Since the dockerfiles, tests have been migrated to https://github.com/aws/deep-learning-containers

We should clean up the redundant [and hence confusing] files such as

Dockerfiles
test/

model.deploy fails

I built my custom sagemaker container with https://github.com/aws/sagemaker-mxnet-container/blob/master/docker/1.4.0/final/Dockerfile.cpu

Now I was using this script - https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/mxnet_gluon_sentiment/mxnet_sentiment_analysis_with_gluon.ipynb

Training completes successfully but m.deploy fails complaining nginx is not there
"No such file or directory: 'nginx'

Information about Building with Elastic Inference is removed from readme.rst

Hi team, can you check why the information provided in "Building the SageMaker Elastic Inference MXNet container" https://github.com/aws/sagemaker-mxnet-container/blob/e8f4a6a7904541fb4a651bb7d86b5c85496abb67/README.rst#building-the-sagemaker-elastic-inference-mxnet-container is removed in latest versions?
This link (https://github.com/aws/sagemaker-mxnet-container#building-the-sagemaker-elastic-inference-mxnet-container) is still referred in: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html
Thanks

Container support for gluon-nlp

Thinking of contributing a docker image with support for gluonnlp & GPU training. This relies on the most recent version of MXNet and cuda 9.2.

Bugfix to save mnist model in order to use default inference implementation

Hello,

to make the saved model works with the default inference implementation, I had to change sagemaker-mxnet-container/test/resources/keras/keras_mnist.py line 93, from

signature = [{'name': data_name, ...

signature = [{'name': data_name[0], ...

Reason: type(data_name) == list_of_string but the default inference code expects type(name) == str.

Without the change, the saving logic still works, but once deployed, inference will throw exception.