Code Monkey home page Code Monkey logo

sagemaker-mxnet-training-toolkit's Introduction

SageMaker MXNet Training Toolkit

SageMaker MXNet Training Toolkit is an open-source library for using MXNet to train models on Amazon SageMaker. For inference, see SageMaker MXNet Inference Toolkit. For the Dockerfiles used for building SageMaker MXNet Containers, see AWS Deep Learning Containers. For information on running MXNet jobs on Amazon SageMaker, please refer to the SageMaker Python SDK documentation.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Testing

Set up a virtual environment for testing.

One of the multiple ways to setup a virtual environment

# use a package virtualenv
# create a virtualenv
virtualenv -p python3 <name of env>
# activate the virtualenv
source <name of env>/bin/activate

Install requirements

pip install --upgrade .[test]

Local Test

To run specific test

tox -- -k test/unit/test_training.py::test_train_for_distributed_scheduler

To run an entire file

tox -- test/unit/test_training.py

To run all tests within a folder [e.g. integration/local/]

Note: To run integration tests locally, one needs to build an image. To trigger image build, use -B flag.

tox -- test/integration/local

You can also run them in parallel:

tox -- -n auto test/integration/local

To run for specific interpreter [Python environment], use the -e flag

tox -e py37 -- test/unit/test_training.py

Remote Test

Make sure to provide AWS account ID, Region, Docker base name & Tag. Docker Registry is composed of (aws_id, region) Image URI is composed of (docker_registry, docker_base_name, tag)

Resulting Image URI is composed as: {aws_id}.dkr.ecr.{region}.amazonaws.com/{docker_base_name}:{tag}

tox -- --aws-id <aws_id> --region <region> --docker-base-name <docker_base_name> --tag <tag> test/integration/sagemaker

For more details, refer conftest.py

License

SageMaker MXNet Training Toolkit is licensed under the Apache 2.0 License. It is copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. The license is available at: http://aws.amazon.com/apache2.0/

sagemaker-mxnet-training-toolkit's People

Contributors

ajaykarpur avatar arjkesh avatar chaibapchya avatar choibyungwook avatar chuyang-deng avatar danabens avatar ddavydenko avatar icywang86rui avatar iquintero avatar jesterhazy avatar karan6181 avatar laurenyu avatar leezu avatar lianyiding avatar mvsusp avatar nadiaya avatar nihalharish avatar owen-t avatar saimidu avatar saravsak avatar vandanavk avatar winstonaws avatar yangaws avatar yystreet avatar zhreshold avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-mxnet-training-toolkit's Issues

Readme out of date?

The Readme says:

# All build instructions assume you're starting from the root directory of this repository

# MXNet 1.6.0, Python 3, CPU
$ cp dist/sagemaker_mxnet_container*.tar.gz docker/1.6.0/sagemaker_mxnet_container.tar.gz
$ cp -r docker/artifacts/* docker/1.6.0/py3
$ cd docker/1.6.0/py3
$ docker build -t preprod-mxnet:1.6.0-cpu-py3 -f Dockerfile.cpu .

However, there is no dist/sagemaker_mxnet_container*.tar.gz.
I only get one that is called sagemaker_mxnet_training-3.1.12.dev0.tar.gz.

I just executed:

# Create the binary
git clone https://github.com/aws/sagemaker-mxnet-container.git
cd sagemaker-mxnet-container
python setup.py sdist

As the beginning of the Readme says to do.

Improve Readme

The current Readme gives a wrong notion that to build image one has to use Dockerfiles present in this repo
However, it looks like the Dockerfiles & tests folder have been migrated to https://github.com/aws/deep-learning-containers
That needs to be documented.

Also would be great to note

  • What's the exact purpose of this repository?
  • How is it different from the dlc repo?
  • How do changes made in dlc repo [update to docker image for example] gets reflected in this repo?

requirements.txt supported for train but not deploy?

I'm installing GluonCV via a requirements.txt file in my source_dir, which seems to be working fine for my training job but then ignored when the container starts up for inference: leading to ModuleNotFoundError.

I guess I could install the dependency inline in the script with something like the below, but then the requirements.txt functionality is kind of inconsistent right? I could have just done inline installs in the first place...

subprocess.call([sys.executable, "-m", "pip", "install", "--upgrade", "gluoncv"])

Is processing requirements.txt on inference startup a potential enhancement? Or an idea that's already been considered and decided against for some reason? Would it be implemented here, or in the base sagemaker-containers?

Container Support for Gluon Toolkits

Reference: 0413477381

The Gluon toolkits (GluonCV, GluonNLP, GluonTS) are the essential constituents of the MXNet ecosystem. Being a container for MXNet, and given that even packages outside of the ecosystem such as scikit-learn and pandas are included, shall we make sure we add these toolkits in the next release?

build args to specify additional python packages and cuda version

I'm no docker wizard but if this was possible then #31 could be achieved through modifying the build command, rather than creating new Dockerfiles.

Unsure how to implement but I imagine something like:

Build the base image
docker build -t mxnet-base:1.2.1-gpu-py3 --build-arg cuda=9.2 -f Dockerfile.gpu .

Build the final image
docker build -t mxnet-gluonnlp-cu92:1.2.1-gpu-py3 --build-arg cuda=9.2 --build-arg additional_packages=pandas,gluonnlp -f Dockerfile.gpu .

boto can't find credentials

I'm trying to use boto3 from a sagemaker-mxnet container, but I get this error when I submit the training job:

botocore.exceptions.NoCredentialsError: Unable to locate credentials

The script runs fine on a sagemaker notebook instance using the same IAM role.

EDIT:

This appears to happen bc Network Isolation is turned on by default, never mind.

DEFAULT_FILENAMES hardcoded

Im not sure why the default file names are hardcoded, and not a member of ModuleTransformer- wouldn't it make sense to have them be parameters that are set to those values by default?

Running MxNet 1.1.0 container

Hi there,
when running the line:

"docker build -t preprod-mxnet:1.1.0-cpu-py2 --build-arg py_version=2 --build-arg framework_installable=mxnet-1.1.0-py2.py3-none-manylinux1_x86_64.whl -f Dockerfile.cpu ."

I get the following error:

COPY failed: stat /var/lib/docker/tmp/docker-builder295259424/mxnet-1.1.0-py2.py3-none-manylinux1_x86_64.whl: no such file or directory

Seems like a file is missing or I am missing a step to get the file at the right place?

Thanks

Feature request: support latest mxnet package via requirements.txt?

Hi,

I was wondering if the mxnet docker image could support a requirements.txt file just like what the tensorflow container does. The latest mxnet environments offer much performance improvements. Additionally, the latest cu92 environment fixes several memory leak bugs. Would love to use these latest features in my work.

Thanks.

Add horovod/mpi support to generic container

How to support multiple model inputs?

https://github.com/aws/sagemaker-mxnet-container/blob/ee9098c8c2de6a635dcd9f4b0819dc5340061cde/src/sagemaker_mxnet_container/serving.py#L228

If my mxnet model takes 2 data inputs (both are float arrays), how should I make it work at this point?

  • I defined my model-shapes.json like
    [{"shape": [1, 12], "name": "data0"}, {"shape": [1, 12], "name": "data1"}]

  • And one example input looks like:
    data = {'data0':[1.0, 2904.0, 1452.0, 464.0, 3022.0, 2948.0, 2548.0, 2.0, 0.0, 0.0, 0.0, 0.0], 'data1':[1.0, 2204.0, 1552.0, 494.0, 3032.0, 298.0, 2568.0, 2.0, 0.0, 0.0, 0.0, 0.0]}

But I got errors on the server side:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/sagemaker_containers/_functions.py", line 84, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/sagemaker_mxnet_container/serving.py", line 229, in default_input_fn
[data_shape] = self._model.data_shapes

ValueError: too many values to unpack (expected 1)

Information about Building with Elastic Inference is removed from readme.rst

Bugfix to save mnist model in order to use default inference implementation

Hello,

to make the saved model works with the default inference implementation, I had to change sagemaker-mxnet-container/test/resources/keras/keras_mnist.py line 93, from

signature = [{'name': data_name, ...

to

signature = [{'name': data_name[0], ...

Reason: type(data_name) == list_of_string but the default inference code expects type(name) == str.

Without the change, the saving logic still works, but once deployed, inference will throw exception.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.