Code Monkey home page Code Monkey logo

aws / deep-learning-containers Goto Github PK

View Code? Open in Web Editor NEW
932.0 57.0 430.0 209.86 MB

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.

Home Page: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html

License: Other

Python 95.09% C 0.08% Shell 4.35% JavaScript 0.42% Dockerfile 0.03% PureBasic 0.03%
aws sagemaker docker tensorflow tensorflow2 mxnet pytorch

deep-learning-containers's Introduction

AWS Deep Learning Containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet. Deep Learning Containers provide optimized environments with TensorFlow and MXNet, Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries and are available in the Amazon Elastic Container Registry (Amazon ECR).

The AWS DLCs are used in Amazon SageMaker as the default vehicles for your SageMaker jobs such as training, inference, transforms etc. They've been tested for machine learning workloads on Amazon EC2, Amazon ECS and Amazon EKS services as well.

For the list of available DLC images, see Available Deep Learning Containers Images. You can find more information on the images available in Sagemaker here

License

This project is licensed under the Apache-2.0 License.

smdistributed.dataparallel and smdistributed.modelparallel are released under the AWS Customer Agreement.

Table of Contents

Getting Started

Building your Image

Running Tests Locally

Getting started

We describe here the setup to build and test the DLCs on the platforms Amazon SageMaker, EC2, ECS and EKS.

We take an example of building a MXNet GPU python3 training container.

  1. Clone the repo and set the following environment variables:
    export ACCOUNT_ID=<YOUR_ACCOUNT_ID>
    export REGION=us-west-2
    export REPOSITORY_NAME=beta-mxnet-training
  2. Login to ECR
    aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com
  3. Assuming your working directory is the cloned repo, create a virtual environment to use the repo and install requirements
    python3 -m venv dlc
    source dlc/bin/activate
    pip install -r src/requirements.txt
  4. Perform the initial setup
    bash src/setup.sh mxnet

Building your image

The paths to the dockerfiles follow a specific pattern e.g., mxnet/training/docker/<version>/<python_version>/Dockerfile.

These paths are specified by the buildspec.yml residing in mxnet/training/buildspec.yml i.e. <framework>/<training|inference>/buildspec.yml. If you want to build the dockerfile for a particular version, or introduce a new version of the framework, re-create the folder structure as per above and modify the buildspec.yml file to specify the version of the dockerfile you want to build.

  1. To build all the dockerfiles specified in the buildspec.yml locally, use the command
    python src/main.py --buildspec mxnet/training/buildspec.yml --framework mxnet
    The above step should take a while to complete the first time you run it since it will have to download all base layers and create intermediate layers for the first time. Subsequent runs should be much faster.
  2. If you would instead like to build only a single image
    python src/main.py --buildspec mxnet/training/buildspec.yml \
                       --framework mxnet \
                       --image_types training \
                       --device_types cpu \
                       --py_versions py3
  3. The arguments —image_types, —device_types and —py_versions are all comma separated list who’s possible values are as follows:
    --image_types <training/inference>
    --device_types <cpu/gpu>
    --py_versions <py2/py3>
  4. For example, to build all gpu, training containers, you could use the following command
    python src/main.py --buildspec mxnet/training/buildspec.yml \
                       --framework mxnet \
                       --image_types training \
                       --device_types gpu \
                       --py_versions py3

Upgrading the framework version

  1. Suppose, if there is a new framework version for MXNet (version 1.7.0) then this would need to be changed in the buildspec.yml file for MXNet training.
    # mxnet/training/buildspec.yml
      1   account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
      2   region: &REGION <set-$REGION-in-environment>
      3   framework: &FRAMEWORK mxnet
      4   version: &VERSION 1.6.0 *<--- Change this to 1.7.0*
          ................
  2. The dockerfile for this should exist at mxnet/docker/1.7.0/py3/Dockerfile.gpu. This path is dictated by the docker_file key for each repository.
    # mxnet/training/buildspec.yml
     41   images:
     42     BuildMXNetCPUTrainPy3DockerImage:
     43       <<: *TRAINING_REPOSITORY
              ...................
     49       docker_file: !join [ docker/, *VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
     
  3. Build the container as described above.

Adding artifacts to your build context

  1. If you are copying an artifact from your build context like this:
    # deep-learning-containers/mxnet/training/docker/1.6.0/py3
    COPY README-context.rst README.rst
    then README-context.rst needs to first be copied into the build context. You can do this by adding the artifact in the framework buildspec file under the context key:
    # mxnet/training/buildspec.yml
     19 context:
     20   README.xyz: *<---- Object name (Can be anything)*
     21     source: README-context.rst *<--- Path for the file to be copied*
     22     target: README.rst *<--- Name for the object in** the build context*
  2. Adding it under context makes it available to all images. If you need to make it available only for training or inference images, add it under training_context or inference_context.
     19   context:
        .................
     23       training_context: &TRAINING_CONTEXT
     24         README.xyz:
     25           source: README-context.rst
     26           target: README.rst
        ...............
  3. If you need it for a single container add it under the context key for that particular image:
     41   images:
     42     BuildMXNetCPUTrainPy3DockerImage:
     43       <<: *TRAINING_REPOSITORY
              .......................
     50       context:
     51         <<: *TRAINING_CONTEXT
     52         README.xyz:
     53           source: README-context.rst
     54           target: README.rst
  4. Build the container as described above.

Adding a package

The following steps outline how to add a package to your image. For more information on customizing your container, see Building AWS Deep Learning Containers Custom Images.

  1. Suppose you want to add a package to the MXNet 1.6.0 py3 GPU docker image, then change the dockerfile from:
    # mxnet/training/docker/1.6.0/py3/Dockerfile.gpu
    139 RUN ${PIP} install --no-cache --upgrade \
    140     keras-mxnet==2.2.4.2 \
    ...........................
    159     ${MX_URL} \
    160     awscli
    to
    139 RUN ${PIP} install --no-cache --upgrade \
    140     keras-mxnet==2.2.4.2 \
    ...........................
    160     awscli \
    161     octopush
  2. Build the container as described above.

Running tests locally

As part of your iteration with your PR, sometimes it is helpful to run your tests locally to avoid using too many extraneous resources or waiting for a build to complete. The testing is supported using pytest.

Similar to building locally, to test locally, you’ll need access to a personal/team AWS account. To test out:

  1. Either on an EC2 instance with the deep-learning-containers repo cloned, or on your local machine, make sure you have the images you want to test locally (likely need to pull them from ECR). Then change directory into the cloned folder. Install the requirements for tests.

    cd deep-learning-containers/
    pip install -r src/requirements.txt
    pip install -r test/requirements.txt
  2. In a shell, export environment variable DLC_IMAGES to be a space separated list of ECR uris to be tested. Set CODEBUILD_RESOLVED_SOURCE_VERSION to some unique identifier that you can use to identify the resources your test spins up. Set PYTHONPATH as the absolute path to the src/ folder. Example: [Note: change the repository name to the one setup in your account]

    export DLC_IMAGES="$ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/pr-pytorch-training:training-gpu-py3 $ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/pr-mxnet-training:training-gpu-py3"
    export PYTHONPATH=$(pwd)/src
    export CODEBUILD_RESOLVED_SOURCE_VERSION="my-unique-test"
  3. Our pytest framework expects the root dir to be test/dlc_tests, so change directories in your shell to be here

    cd test/dlc_tests
  4. To run all tests (in series) associated with your image for a given platform, use the following command

    # EC2
    pytest -s -rA ec2/ -n=auto
    # ECS
    pytest -s -rA ecs/ -n=auto
    
    #EKS
    cd ../
    export TEST_TYPE=eks
    python test/testrunner.py

    Remove -n=auto to run the tests sequentially.

  5. To run a specific test file, provide the full path to the test file

    pytest -s ecs/mxnet/training/test_ecs_mxnet_training.py
  6. To run a specific test function (in this example we use the cpu dgl ecs test), modify the command to look like so:

    pytest -s ecs/mxnet/training/test_ecs_mxnet_training.py::test_ecs_mxnet_training_dgl_cpu
  7. To run SageMaker local mode tests, launch a cpu or gpu EC2 instance with latest Deep Learning AMI.

    • Clone your github branch with changes and run the following commands
      git clone https://github.com/{github_account_id}/deep-learning-containers/
      cd deep-learning-containers && git checkout {branch_name}
    • Login into the ECR repo where the new docker images built exist
      $(aws ecr get-login --no-include-email --registry-ids ${aws_id} --region ${aws_region})
    • Change to the appropriate directory (sagemaker_tests/{framework}/{job_type}) based on framework and job type of the image being tested. The example below refers to testing mxnet_training images
      cd test/sagemaker_tests/mxnet/training/
      pip3 install -r requirements.txt
    • To run the SageMaker local integration tests (aside from tensorflow_inference), use the pytest command below:
      python3 -m pytest -v integration/local --region us-west-2 \
      --docker-base-name ${aws_account_id}.dkr.ecr.us-west-2.amazonaws.com/mxnet-inference \
       --tag 1.6.0-cpu-py36-ubuntu18.04 --framework-version 1.6.0 --processor cpu \
       --py-version 3
    • To test tensorflow_inference py3 images, run the command below:
      python3 -m  pytest -v integration/local \
      --docker-base-name ${aws_account_id}.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference \
      --tag 1.15.2-cpu-py36-ubuntu16.04 --framework-version 1.15.2 --processor cpu
  8. To run SageMaker remote tests on your account please setup following pre-requisites

    • Create an IAM role with name “SageMakerRole” in the above account and add the below AWS Manged policies
      AmazonSageMakerFullAccess
      
    • Change to the appropriate directory (sagemaker_tests/{framework}/{job_type}) based on framework and job type of the image being tested." The example below refers to testing mxnet_training images
      cd test/sagemaker_tests/mxnet/training/
      pip3 install -r requirements.txt
    • To run the SageMaker remote integration tests (aside from tensorflow_inference), use the pytest command below:
      pytest integration/sagemaker/test_mnist.py \
      --region us-west-2 --docker-base-name mxnet-training \
      --tag training-gpu-py3-1.6.0 --framework-version 1.6.0 --aws-id {aws_id} \
      --instance-type ml.p3.8xlarge
    • For tensorflow_inference py3 images run the below command
      python3 -m pytest test/integration/sagemaker/test_tfs. --registry {aws_account_id} \
      --region us-west-2  --repo tensorflow-inference --instance-types ml.c5.18xlarge \
      --tag 1.15.2-py3-cpu-build --versions 1.15.2
  9. To run SageMaker benchmark tests on your account please perform the following steps:

    • Create a file named sm_benchmark_env_settings.config in the deep-learning-containers/ folder
    • Add the following to the file (commented lines are optional):
      export DLC_IMAGES="<image_uri_1-you-want-to-benchmark-test>"
      # export DLC_IMAGES="$DLC_IMAGES <image_uri_2-you-want-to-benchmark-test>"
      # export DLC_IMAGES="$DLC_IMAGES <image_uri_3-you-want-to-benchmark-test>"
      export BUILD_CONTEXT=PR
      export TEST_TYPE=benchmark-sagemaker
      export CODEBUILD_RESOLVED_SOURCE_VERSION=$USER
      export REGION=us-west-2
    • Run:
      source sm_benchmark_env_settings.config
    • To test all images for multiple frameworks, run:
      pip install -r requirements.txt
      python test/testrunner.py
    • To test one individual framework image type, run:
      # Assuming that the cwd is deep-learning-containers/
      cd test/dlc_tests
      pytest benchmark/sagemaker/<framework-name>/<image-type>/test_*.py
    • The scripts and model-resources used in these tests will be located at:
      deep-learning-containers/test/dlc_tests/benchmark/sagemaker/<framework-name>/<image-type>/resources/
      

Note: SageMaker does not support tensorflow_inference py2 images.

deep-learning-containers's People

Contributors

arjkesh avatar aws-vrnatham avatar catalin-manciu-aws avatar dkey-amazon avatar gradientsky avatar hballuru avatar jeet4320 avatar jingyahuang avatar junpuf avatar kace avatar kenny-ezirim avatar kevinyang8 avatar mabunday avatar nskool avatar ohadkatz avatar philschmid avatar qingzi-lan avatar radhikab-97 avatar saimidu avatar sallyseok avatar satish615 avatar sergtogul avatar shantanutrip avatar shiboxing avatar sirutbuasai avatar tejaschumbalkar avatar tosterberg avatar tusharkanekidey avatar ydaiming avatar yystreet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-learning-containers's Issues

[bug] can build image mxnet-inference:1.7.0-cpu

HI,
not sure it is the correct github repository to add this:

I am following the readme to create a custom image from the mxnet-inference:1.7.0.cpu.
My docker file looks like that:

# Take the base mxnet container
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-cpu

# Add your custom stack of code
RUN  pip install  EXTRA_LIBVRARIES

While building the image i run in the following:

 => ERROR [internal] load metadata for 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-cpu                                                                                                                     0.3s
 => ERROR [1/3] FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-cpu                                                                                                                                       0.1s
 => => resolve 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-cpu                                                                                                                                             0.1s
------
 > [internal] load metadata for 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-cpu:
------
------
 > [1/3] FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-cpu:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: failed to load cache key: 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-cpu not found

IS there something special with that image? i can successfully build a custom image from the 1.6.0 version!
Thank you

Checklist

Concise Description:

DLC image/dockerfile:

Current behavior:

Expected behavior:

Additional context:

Segmentation fault (core dumped) on torch.jit.optimize_function() with Amazon Elastic Inference and PyTorch (through amazonei_pytorch_p36 conda environment)

Hi,

I am getting a Segmentation fault (core dumped) error on ubuntu 16.04 when I execute the torch.jit.optimize_function(). This happens specifically in the amazonei_pytorch_p36 conda environment which uses torch==1.3.1 and a specialised torchei package (presumably for elastic inference stuff).

Upon running the code under gdb, I get a more verbose but confusing error:

#1  0x00007fffe5fb4eeb in c10::detail::LogAPIUsageFakeReturn(std::string const&) ()
   from /home/ubuntu/anaconda3/envs/amazonei_pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so
#2  0x00007fffe899ed82 in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) () from /home/ubuntu/anaconda3/envs/amazonei_pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so
#3  0x00007fffe8b97df0 in torch::jit::Function::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
   from /home/ubuntu/anaconda3/envs/amazonei_pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so
#4  0x00007fffeb31a544 in torch::jit::runAndInsertCall(torch::jit::Function&, torch::jit::tuple_slice, pybind11::kwargs, c10::optional<c10::IValue>, std::function<torch::jit::Value* (torch::jit::Graph&, torch::jit::script::MatchedSchema const&)>) ()
   from /home/ubuntu/anaconda3/envs/amazonei_pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#5  0x00007fffeb31a9c2 in torch::jit::invokeScriptMethodFromPython(torch::jit::script::Method&, torch::jit::tuple_slice, pybind11::kwargs) ()
   from /home/ubuntu/anaconda3/envs/amazonei_pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6  0x00007fffeb2f1b01 in void pybind11::cpp_function::initialize<torch::jit::script::initJitScriptBindings(_object*)::{lambda(pybind11::args, pybind11::kwargs)#33}, pybind11::object, pybind11::args, pybind11::kwargs, pybind11::name, pybind11::is_method, pybind11::sibling>(torch::jit::script::initJitScriptBindings(_object*)::{lambda(pybind11::args, pybind11::kwargs)#33}&&, pybind11::object (*)(pybind11::args, pybind11::kwargs), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) ()
   from /home/ubuntu/anaconda3/envs/amazonei_pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#7  0x00007fffeaf94264 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /home/ubuntu/anaconda3/envs/amazonei_pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so

To Reproduce

Steps to reproduce the behavior:

  1. Use amazonei_pytorch_p36 conda environment from the Ubuntu 16.04 Deep Learning AMI EC2 image
  2. Load a pretrained ".pt" jit pytorch model and make it predict on sample data
  3. Pass tensor through torch.jit.optimize_function()

My code sample:

import torch
model = torch.jit.load('traced_bert.pt', map_location=torch.device('cpu'))

from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained('tokenizer/',do_lower_case=True)

# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway.
# In the original paper, the authors used a length of 512.
MAX_LEN = 256

## Import BERT tokenizer, that is used to convert our text into tokens that corresponds to BERT library
input_ids = [tokenizer.encode(sent, add_special_tokens=True,max_length=MAX_LEN,pad_to_max_length=True) for sent in sentences]

print('input_ids done')

## Create attention mask
attention_masks = []
## Create a mask of 1 for all input tokens and 0 for all padding tokens
attention_masks = [[float(i>0) for i in seq] for seq in input_ids]

print('attention_masks done')

# convert all our data into torch tensors, required data type for our model
inputs = torch.tensor(input_ids)
masks = torch.tensor(attention_masks)

print('input and mask tensors done')
print('model ready')

input_id = inputs
input_mask = masks

print('inputs ready')

with torch.no_grad():
    # Forward pass, calculate logit predictions
    with torch.jit.optimized_execution(True, {'target_device': 'eia:0'}):
        print('creating logits')
        logits = model(input_id, attention_mask=input_mask)[0]
        print('logits done')

logits = logits.to('cpu').numpy()

pred_flat = np.argmax(logits, axis=1).flatten()

Sample data:

sentences = [
    'hello my name is bob',
    'hello my name is not bob'
]

[bug] Torch doesn't see CUDA in latest Pytorch1.6-CUDA101 container

Checklist

Concise Description:
pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04 seems to have CUDA drivers misconfigured. I'm getting False when i'm running python -c "import torch; print(torch.cuda.is_available())" inside container.

DLC image/dockerfile:
At least pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04, but can be other containers too.

Current behavior:
Torch doesn't see CUDA installed/configured. python -c "import torch; print(torch.cuda.is_available())" returns False

Expected behavior:
python -c "import torch; print(torch.cuda.is_available())" returns True

Additional context:

Sample dockerfile:

FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04
RUN python -c "import torch; print(torch.cuda.is_available())"

[bug] nvidia-smi not available on GPU inference images

Concise Description:
When building a custom container from the Pytorch-Inference 1.5.1 container image we utilize the Python GPUtil library (https://github.com/anderskm/gputil) to get information about the currently available GPUs. That library has an internal dependency on nvidia-smi. I was surprised to see that these images have CUDA installed but don't have nvidia-smi available.

$ echo $PATH
/opt/amazon/bin:/opt/amazon/sbin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

I see that /usr/local/nvidia/bin is included on the path but that folder doesn't actually exist

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.5.1-gpu-py36-cu101-ubuntu16.04

Current behavior:
root@dev-dsk-gldlb-2b-b0f9af72:/opt/amazon/bin# nvidia-smi
bash: nvidia-smi: command not found

[bug] ECR Get Login logs into the wrong account when run locally

Checklist

Concise Description:
In

connection.run(f"$(aws ecr get-login --no-include-email --region {region})", hide=True)
, the aws ecr get-login command logs into the ECR Registry owned by the AWS Account ID running the test script, and not the ECR Registry that contains the docker image being tested.

This works in automated PRs on CI and CD, but local test attempts to run tests on images held in other accounts/regions will cause the test to fail with no basic auth credentials.

DLC image/dockerfile:
N/A

Current behavior:
Docker Login credentials created are for registry owned by EC2 instance owner

Expected behavior:
Docker Login credentials created should be for registry owned by ECR docker image owner

Additional context:

[bug] skip_cpu pytest marker fails on local test

Checklist

Concise Description:

Existing pytest marker skip_cpu doesn't handle the case of testing locally.

DLC image/dockerfile:
Any DLC image
Current behavior:

@pytest.mark.skip_cpu
def test_of_your_choice

Running this test locally

pytest -s -v -rP integration/local/test_<>.py::test_of_your_choice --docker-base-name <>.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training --tag <>

Result

SKIPPED

Expected behavior:
Instead of skipping the test altogether, run it on local [if instance has gpu]

Additional context:
If instance doesn't have gpu-support, then skipping is justified. But currently, it skips_cpu ends up skipping test on GPU-based instance.

As a result, for debugging locally, one has to omit/comment the pytest marker

[bug] Horovod does not work with PyTorch

This issue is a follow-up to aws/sagemaker-pytorch-training-toolkit#217. Most of the work for addressing this issue has already been implemented in this draft PR.

Checklist

Concise Description

Horovod does not work with the PyTorch images.

DLC image/dockerfile:

The PyTorch GPU training Dockerfiles are most relevant to this issue.

The CPU Dockerfiles could also be updated. But, as far as I know, people don't usually use Horovod with CPU nodes.

Current behavior:

I ran into the errors while working with PyTorch 1.4.0 CUDA 10.1 Python 3 image:

763104351884.dkr.ecr.$region.amazonaws.com/pytorch-training:1.4.0-gpu-py3

In particular, I was running the tests in the PyTorch training toolkit (which used this Dockerfile). I haven't reproduced the errors with any of the other framework versions, but I assume the issues would arise with them as well.

Error 1: mpi4py is not installed
Using the PyTorch training image, run the Horovod integration tests found in the PyTorch training toolkit repository. You should receive the following error:

[1,0]<stderr>:/opt/conda/bin/python: No module named mpi4py
[1,1]<stderr>:/opt/conda/bin/python: No module named mpi4py

The error can be resolved by adding the following line to the Dockerfile:

RUN pip3 install mpi4py==3.0.3

Error 2: SSH login
In addition to the preceding error, I have also received the following error:

2020-08-27 01:45:04,050 sagemaker-training-toolkit ERROR    framework error: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_training/trainer.py", line 85, in train
    entrypoint()
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_pytorch_container/training.py", line 108, in main
    train(environment.Environment())
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_pytorch_container/training.py", line 62, in train
    runner_type=runner_type)
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_training/process.py", line 150, in run
    self._setup()
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_training/mpi.py", line 158, in _setup
    _start_sshd_daemon()
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_training/mpi.py", line 280, in _start_sshd_daemon
    raise RuntimeError(_SSH_DAEMON_NOT_FOUND_ERROR_MESSAGE)
RuntimeError: 
SSH daemon not found, please install SSH to allow MPI to communicate different nodes in cluster.

You can install ssh by running following commands:
-------------------------------------------------

1. Install SSH via apt-get:

apt-get update && apt-get install -y --no-install-recommends openssh-server && mkdir -p /var/run/sshd

2. SSH login fix. Otherwise user is kicked off after login:
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd

3. Create SSH key to allow password less ssh between different docker instances:
mkdir -p /root/.ssh/ && ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa &&   cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys &&   printf "Host *
  StrictHostKeyChecking no
" >> /root/.ssh/config



SSH daemon not found, please install SSH to allow MPI to communicate different nodes in cluster.

You can install ssh by running following commands:
-------------------------------------------------

1. Install SSH via apt-get:

apt-get update && apt-get install -y --no-install-recommends openssh-server && mkdir -p /var/run/sshd

2. SSH login fix. Otherwise user is kicked off after login:
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd

3. Create SSH key to allow password less ssh between different docker instances:
mkdir -p /root/.ssh/ && ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa &&   cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys &&   printf "Host *
  StrictHostKeyChecking no
" >> /root/.ssh/config

It seems that adding

RUN mkdir -p /var/run/sshd && \
 sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd

RUN rm -rf /root/.ssh/ && \
 mkdir -p /root/.ssh/ && \
 ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
 cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
 && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config

to the Dockerfile resolves the error, although I'm not entirely sure if it's necessary for all of the PyTorch images. The preceding code snippet was retrieved from this MXNet DLC PR.

Error 3: Mutually exclusive MCA settings
Another issue I have received is that the MCA settings passed in by the MPIRunnerType are mutually exclusive with the MCA settings in the config. The resulting error looks something like this:

Two mutually-exclusive MCA variables were specified.  This can result
in undefined behavior, such as ignoring the components that the MCA
variables are supposed to affect.

  1st MCA variable: btl_tcp_if_include
    Source of value: environment
  2nd MCA variable: btl_tcp_if_exclude
    Source of value: file (/home/.openmpi/etc/openmpi-mca-params.conf:62)

You can get around this error by commenting out the specific line in the MPI config that specifies the MCA setting:

RUN sed -i '62,62 s/^/#/' /home/.openmpi/etc/openmpi-mca-params.conf

It's pretty hack-ish, but it fixed the issue when I was implemented the MPIRunnerType PR. There's probably a better way to fix the error.

Expected behavior

The tests should pass without errors.

Additional context

Here's what still needs to be done before merging this PR:

  • Verify that each of the three proposed fixes described above are necessary for all of the PyTorch images.
  • If some of the changes are unnecessary, remove them from the appropriate Dockerfiles in the horovod-pt branch.
  • Fix any issues with the Horovod tests in the horovod-pt branch. The tests should logically be okay, but there may be some minor syntax-esque errors.
  • Remove to-do items from the Dockerfile.dlc.gpu file in the PyTorch training kit repository once the PyTorch images are updated.

[bug]'BatchGetImage permission' error when deploying SageMaker endpoint by using Tensorflow base image

I have a pre-trained Tensorflow model, I'm trying to using SagaMaker client.create_endpoint() to create an endpoint so that I can call the API to get predictions, the doc is here

After creating the model by using client.create_model() I have a model stored on SageMaker, and the base image I'm using is xxxxxxx.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.15.2-gpu

After running this, I'm able to create the endpoint configuration, but it failed to create the endpoint, reason:

Failure reason
The role 'arn:aws:iam::xxxxxxxx:role/test-role' does not have BatchGetImage permission for the image: 'xxxx.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.15.2-gpu'.

In the policy of this role, I have:

{
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "ecr:BatchGetImage"
            ],
            "Resource": [
                "arn:aws:ecr:us-east-1:xxxx:repository/*sagemaker*"
            ]
        },

        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "ecr:BatchDeleteImage",
                "ecr:UploadLayerPart",
                "ecr:DeleteRepository",
                "ecr:PutImage",
                "ecr:SetRepositoryPolicy",
                "ecr:BatchGetImage",
                "ecr:CompleteLayerUpload",
                "ecr:DeleteRepositoryPolicy",
                "ecr:InitiateLayerUpload"
            ],
            "Resource": [
                "arn:aws:ecr:*:*:repository/*sagemaker*"
 ]
        }
....

Am I missing anything in the policy? Might someone be able to help please? Many thanks.

[bug] Signals don't get propagated to the training script

Checklist

Concise Description:
Images (I've tested the PyTorch one) aren't built as suggested here:https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html.
Specifically, the entrypoint is set to a bash script, instead of directly to the python code, hence the python code isn't running as pid 1 and signals don't get propagated.

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3

Current behavior:
Registering to the SIGTERM signal using "signal.signal(sigName, handler)" in a training script, doesn't get called e.g. when setting max_run to 60, and waiting enough time.
Also, running "ps -elf" by using subprocess.run("ps -elf", shell=True) from a training script shows the below:
4 S root 1 0 0 80 0 - 4941 - 07:57 ? 00:00:00 bash -m start_with_right_hostname.sh train
4 S root 15 1 2 80 0 - 56741 - 07:57 ? 00:00:00 /opt/conda/bin/python /opt/conda/bin/train
4 S root 26 15 0 80 0 - 7630 - 07:57 ? 00:00:00 /opt/conda/bin/python shell_launcher.py --SSM_CMD_LINE ps -elf

0 S root 27 26 0 80 0 - 1641 - 07:57 ? 00:00:00 /bin/sh -c ps -elf
0 R root 28 27 0 80 0 - 9041 - 07:57 ? 00:00:00 ps -elf
And as you can see the python isn't PID 1

Expected behavior:
Signals to get propagated + python script to be PID 1, (unless signals can get propagated otherwise)

Additional context:
The below as a training script, when running with small value of max_run, or just stopping the training job from the console can show the problem.

`
import signal
import sys
import time

def handler(signum, frame):
print("Signal handler called with signal", signum)
print(frame)
sys.exit(0)

for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
signal.signal(sigName, handler)

print("Waiting for a signal...")
while True:
time.sleep(1)
`

[feature-request] Python 3.8 containers

Checklist

Concise Description:
Hi,

Python 3.8 has been out for 9 months. Is there any plan to support Python 3.8 in the Tensorflow containers?

Is your feature request related to a problem? Please describe.
Our company has upgraded to Python 3.8, so we'd like to have the same Python version in the Sagemaker containers.

Describe the solution you'd like
Have official Tensorflow 2.2.0 + py38 containers for CPU and GPU.

Describe alternatives you've considered
We are sending our own Python runtime and libraries into the current py37 containers so we don't run into version mismatch problem. However, as a result, every training job requires a lot more data being uploaded to AWS, which significantly slowed down our development processes.

[feature-request] GovCloud Support

Checklist

  • I've prepended issue tag with type of change: [feature]
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented the tests I've run on the DLC image

Concise Description:

Many prebuilt containers are available on GovCloud and this one should be too.

Available on GovCloud 😀

  • "blazingtext"
  • "factorization-machines"
  • "forecasting-deepar"
  • "image-classification"
  • "ipinsights"
  • "kmeans"
  • "knn"
  • "lda"
  • "linear-learner"
  • "ntm"
  • "object-detection"
  • "object2vec"
  • "pca"
  • "randomcutforest"
  • "sagemaker-scikit-learn"
  • "sagemaker-sparkml-serving"
  • "sagemaker-xgboost"
  • "semantic-segmentation"
  • "seq2seq"

NOT Available on GovCloud 😢

  • "mxnet-inference-eia"
  • "mxnet-inference"
  • "mxnet-training"
  • "pytorch-inference-eia"
  • "pytorch-inference"
  • "pytorch-training"
  • "sagemaker-tensorflow-serving"
  • "tensorflow-inference-eia"
  • "tensorflow-inference"
  • "tensorflow-training"

DLC image/dockerfile:
Any relating to mxnet-inference-eia, mxnet-inference, mxnet-training, pytorch-inference-eia, pytorch-inference, pytorch-training, tensorflow-inference-eia, tensorflow-inference, and tensorflow-training.

Is your feature request related to a problem? Please describe.
Yes. These containers cannot be used in the GovCloud partition.

Describe the solution you'd like
Availability within the GovCloud partition.

Describe alternatives you've considered
None

Additional context
We would like the Terraform AWS provider to be able to provide information about these containers to users using GovCloud: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/sagemaker_prebuilt_ecr_image

[feature-request] update fastai package to v2

The fastai library has recently had a big overhaul with a major release number v2 (see https://pypi.org/project/fastai/#history & https://www.fast.ai/2020/08/21/fastai2-launch/). Would it be possible to update the PyTorch 1.6 container with the 2.0.x version of fastai ?

Checklist

Concise Description:

DLC image/dockerfile:

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Documentation] Document Sagemaker Performance tests reproducibility locally

  1. Make the SM Perf tests reproducible locally
  2. Document the steps

@saimidu

Concise Description:

DLC image/dockerfile:

Is your feature request related to a problem? Please describe.
Tough to reproduce SM Perf tests locally [requires non-trivial hacks] & accesses

Describe the solution you'd like
make the code locally reproducible

Describe alternatives you've considered
None

Additional context
Identified as part of #444

Add libsnappy-dev / python-snappy

Snappy is the default compression method used for Parquet files in Pandas.

With the image tensorflow-training:2.3.1-cpu-py37-ubuntu18.04, adding the Python package python-snappy in requirements.txt dependencies results in the following error during the pip install executed by a SageMaker training job:

 creating build/temp.linux-x86_64-3.7/snappy
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/usr/local/include/python3.7m -c snappy/snappymodule.cc -o build/temp.linux-x86_64-3.7/snappy/snappymodule.o
  snappy/snappymodule.cc:31:10: fatal error: snappy-c.h: No such file or directory
   #include <snappy-c.h>
            ^~~~~~~~~~~~
  compilation terminated.
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for python-snappy

libsnappy-dev is the library needed to get rid of the error: apt-get install libsnappy-dev

Note: I haven't tested other images but they would probably equally benefit from Snappy support

[bug] Sagemaker Remote Test reporting issues

Checklist

Concise Description:
SM Remote Test log doesn't get reported correctly.

Observed in 2 commits of the PR: #444

DLC image/dockerfile:
MX 1.6 DLC

Current behavior:
Github shows "pending" status.
CodeBuild logs show "Failed" status.
However, actual codebuild logs doesn't bear Failure log. It terminates abruptly.


============================= test session starts ==============================
platform linux -- Python 3.8.0, pytest-5.3.5, py-1.9.0, pluggy-0.13.1
rootdir: /codebuild/output/src687836801/src/github.com/aws/deep-learning-containers/test/dlc_tests
plugins: rerunfailures-9.0, forked-1.3.0, xdist-1.31.0, timeout-1.4.2
gw0 I / gw1 I / gw2 I / gw3 I / gw4 I / gw5 I / gw6 I / gw7 I
gw0 [3] / gw1 [3] / gw2 [3] / gw3 [3] / gw4 [3] / gw5 [3] / gw6 [3] / gw7 [3]

SM-Cloudwatch log
Navigating to the appropriate SM training log shows that the job ran for 2 hours and ended successfully. It says:
mx-tr-bench-gpu-4-node-py3-867d394-2020-09-11-21-28-30/algo-1-1599859900

2020-09-11 23:31:37,755 sagemaker-training-toolkit INFO     Reporting training SUCCESS

Expected behavior:

  1. PR commit status should say Failed if CodeBuild log says Failed
  2. CodeBuild log should not abruptly hang. It should print out the error. Currently it just terminates after printing some logs post session start.

Additional context:

[bug] eksctl version upgrade v0.36

Concise Description:

eksctl v0.35 issue causes the nodes not joining cluster.

Fix included in v0.36.0 which is in pre-release state.

TODO: Upgrade eksctl version

[bug] Horovod installations in all framework Dockerfiles are not framework-specific

Checklist

Concise Description:
Question: Why do we not use

HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]

instead of

&& HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda-10.1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod==${HOROVOD_VERSION} \

or

HOROVOD_WITH_MXNET=1 pip install horovod[mxnet]

instead of

HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-cache-dir \

or

HOROVOD_WITH_TENSORFLOW=1 pip install horovod[tensorflow]

instead of

&& HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 ${PIP} install --no-cache-dir horovod==0.19.5 \

because acc to https://horovod.readthedocs.io/en/stable/install_include.html that [] is required to force particular framework right?
DLC image/dockerfile:

Current behavior:

Expected behavior:

Additional context:

[pending-change] Unskip ECR Scan Test for TF 2.4.1 DLCs

Checklist

  • I've prepended issue tag with type of change: [feature]
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the test files this relates to

Concise Description:

Unskip ECR Scan test for TF 2.4.1 DLC images after they were disabled due to UnsupportedImageError: The operating system and/or package manager are not supported. errors that need further exploration to resolve.

DLC image/dockerfile:

TensorFlow 2.4.1 Training CPU
TensorFlow 2.4.1 Training GPU CUDA 11.0
TensorFlow 2.4.1 Training GPU CUDA 11.0 Example
TensorFlow 2.4.1 Inference CPU
TensorFlow 2.4.1 Inference GPU CUDA 11.0

Additional context

PR with original change: #880

[Documentation] PyTorch Custom Images

Checklist

  • I've prepended issue tag with type of change: [feature]

Concise Description:
Add information on creating custom PyTorch images.

Describe the solution you'd like
Documentation outlining the steps needed to create custom PyTorch images.

[bug] sqlite support is missing from compiled Python

Checklist

  • [ x ] I've prepended issue tag with type of change: [bug]
  • (If applicable) I've attached the script to reproduce the bug
  • [ x ] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • [ x ] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:
sqlite support is a quasi-standard part of Python which many packages relying on it, like pytest, pytest-cov, etc.
For this reason Python is usually compiled with sqlite support. Unfortunately that is NOT the case for this images in this repo, see example:
https://github.com/aws/deep-learning-containers/blob/master/tensorflow/inference/docker/2.3/py3/Dockerfile.cpu#L72

See similar complaints about other times when people forgot to include sqlite support in Python compiled from source:
https://stackoverflow.com/questions/1210664/no-module-named-sqlite3
The solution is to install the libsqlite3-dev package first and then compile Python with sqlite support (with the --enable-loadable-sqlite-extensions flag).

Also see the same in the official Python docker images:
https://github.com/docker-library/python/blob/01b773accc5a2ccb7a4f0d83ec6eb195fe3be655/3.7/buster/Dockerfile#L43

DLC image/dockerfile:
tensorflow-inference:2.3.1-cpu-py37-ubuntu18.04-v1.7

Current behavior:
ModuleNotFoundError: No module named '_sqlite3'

Expected behavior:
Python should be compiled with sqlite support.

Additional context:

Ensemble of models doesn't work correctly with Pytorch Elastic Inference container

Issue: In case of ensemble of models of similar architecture, all models produce the exact same output on forward pass even though they have different number output parameters

Linking similar issue Elastic Inference mixes up models and gives incorrect results in pytorch

I am using the container pytorch-inference-eia:1.3.1-cpu-py36-ubuntu16.04 for inference.

And I have an ensemble of 3 models and am doing a forward pass on each of them in my code.
Each model is a fine-tuned DistilBERT model and has different number of output parameters.

When I run this container without EI, everything works fine. All 3 models produce different outputs of different shapes.

But when I use EI i.e. add {'target_device': 'eia:0'} to torch.jit.optimized_execution, then all the models produce the same output as the first one that gets executed. Each of the models is used in its own separate torch.jit.optimized_execution context.

My model_fn and predict_fn functions looks something like this:

def model_fn(model_dir):
    # load 3 models
    models = [model1, model2, model3]
    return models

def predict_fn(input_object, models):
    input_ids = input_object["input_ids"]
    attention_masks = input_object["attention_mask"]
   
    outputs = []
    for i, model in enumerate(models): # the models object is a list of 3 models 
        with torch.no_grad():
            if EI_ENABLED:
                with torch.jit.optimized_execution(True, {"target_device": "eia:0"}): 
                    print("Using EI")
                    output = model(input_ids, attention_masks)[0]
            else:
                with torch.jit.optimized_execution(True):
                    output = model(input_ids, attention_masks)[0]
        
        print(f"Output of {i}th model = {output}")
        print(f"Shape of output of {i}th model = {output.shape}")
        outputs.append(output)
    # remaining ensemble code...

Support default models on Multi-Model endpoints.

Checklist

DLC image/dockerfile: tensorflow 2.2.0

Hello,

A common scenario I've encountered is the need to combine different models to get a final result. There are two different ways to architect a solution like this:

  1. Deploying each model to a different endpoint
  2. Deploying all models to the same endpoint and write custom inference code to combine them.

In some cases, the first approach is not possible because of the added latency when making predictions. For example, when processing billion of records, having to ping-pong from different endpoints becomes cost-prohibitive.

The second scenario, unfortunately, is not straightforward to implement today. Here is a specific example:

I have several models that will be deployed in a Multi-Model endpoint. Each model is trained for a specific animal species. To make a prediction, I need to run the input through a common model (shared across all animals) and then run the result through the specific model corresponding to the species specified in the input data.

To achieve this, I needed to have a default model that's always loaded when the container starts. This default model is the one that's used across all animal species. Then, each species model will be dynamically loaded when requested.

The current version of the inference container doesn't allow this out of the box. It either loads a default model or completely defers loading when the endpoint is marked as multi-model.

I was able to work around this issue by doing the following:

  • Packaging the model assets with the container (because for multi-model endpoints, there's no way to specify a single model.tar.gz file)
  • Subclassing ServiceManager and overwriting the start() method in the following way:
class CustomServiceManager(ServiceManager):

    def start(self):
        log.info("Starting custom services...")
        self._start_tfs()
        super().start()
  • Overwriting the serve shell script so the custom implementation of ServiceManager is called instead of the original one.

This is not a great solution, especially because I need to call directly _start_tfs() which is supposed to be a private method.

I think we could increase the flexibility of the existing containers in one of two ways:

  1. Support the ability to have both default and on-demand models in a multi-model endpoint.
  2. Provide a mechanism by which the container can be extended in a straightforward way to allow for this scenario (without resorting to accessing private implementation details.)

The first approach is probably the most cumbersome because there will have to be multiple changes, starting with the ability to specify which model(s) should run by default and have SageMaker automatically download them when starting the endpoint.

The second approach is probably easier to implement in the short term.

I'd be delighted to help in any way with contributions here.

[bug] test_batch_transform test failure on MXNET build

Concise Description:

Test test_batch_transform fails with the below error on MXNET-1.6 DLC image.

job_type = status_key_name.replace("JobStatus", " job")
>           raise exceptions.UnexpectedStatusException(
                message="Error for {job_type} {job_name}: {status}. Reason: {reason}".format(
                    job_type=job_type, job_name=job, status=status, reason=reason
                ),
                allowed_statuses=["Completed", "Stopped"],
                actual_status=status,
            )
E           sagemaker.exceptions.UnexpectedStatusException: Error for Transform job test-mxnet-serving-batch-1599870660-6902: Failed. Reason: AlgorithmError: See job logs for more information

Checking out SM job, Could not find model "arn:aws:sagemaker:us-west-2:<account-id>:model/pr-mxnet-inference-2020-09-12-00-30-53-511". is seen.

Expected behavior:

  • Test test_batch_transform to pass.

[bug] incorrect imports in test scripts

Checklist

  • I've prepended issue tag with type of change: [bug]
  • (If applicable) I've attached the script to reproduce the bug
  • I've built my own container based off DLC (and I've attached the code used to build my own image)
pytest integration/local/test_mnist_training.py --region us-west-2 --docker-base-name 968277166688.dkr.ecr.us-west-2.amazonaws.com/beta-mxnet-training-bapac --tag 1.6.0-gpu-py36-cu101-ubuntu16.04-example-2020-06-11-00-32-53 --aws-id 968277166688 --instance-type p2.xlarge
============================================================================================================================= test session starts ==============================================================================================================================
platform linux -- Python 3.6.9, pytest-4.5.0, py-1.8.1, pluggy-0.11.0
rootdir: /home/ubuntu/deep-learning-containers/test/sagemaker_tests/mxnet/training
plugins: requests-mock-1.7.0, xdist-1.31.0, timeout-1.3.4, rerunfailures-8.0, forked-1.1.3, cov-2.9.0
collected 0 items / 1 errors

==================================================================================================================================== ERRORS ====================================================================================================================================
__________________________________________________________________________________________________________ ERROR collecting integration/local/test_mnist_training.py ___________________________________________________________________________________________________________
ImportError while importing test module '/home/ubuntu/deep-learning-containers/test/sagemaker_tests/mxnet/training/integration/local/test_mnist_training.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
integration/local/test_mnist_training.py:20: in <module>
    import local_mode_utils
E   ModuleNotFoundError: No module named 'local_mode_utils'
------------------------------------------------------------------------------------------------------------------------------- Captured stderr --------------------------------------------------------------------------------------------------------------------------------
WARNING:root:pandas failed to import. Analytics features will be impaired or broken.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=========================================================================================================================== 1 error in 0.10 seconds ============================================================================================================================

[bug] tensorflow not detected by pip

Checklist

Concise Description:
@see my initial issue on StackOverflow and my workaround at https://stackoverflow.com/q/63650203/4112200

I'm trying to build my custom DLC image that includes the TF2 Object Detection API using the instructions in https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html#install-the-object-detection-api . However, I noticed that running the following part causes the already installed tensorflow to be replaced.

# From within TensorFlow/models/research/
cp object_detection/packages/tf2/setup.py .
python -m pip install .

Even when using --ignore-installed. After further investigation, I can see that the package is listed in /usr/local/lib/python3.7/site-packages/tensorflow/ but isn't detected by pip when running pip list or pip freeze (it only lists the tensorflow-gpu module.

Here's my Dockerfile (I tried to remove anything not related to the bug):

ARG REPO_LOCATION=763104351884.dkr.ecr.us-east-1.amazonaws.com
FROM ${REPO_LOCATION}/tensorflow-training:2.3.0-gpu-py37-cu102-ubuntu18.04

RUN apt-get update \
  && apt-get install -y \
    unzip

RUN python -m pip install --upgrade pip

RUN mkdir /tensorflow

RUN cd /tensorflow \
  && git clone --progress --verbose https://github.com/tensorflow/models

RUN wget -O /tmp/protoc-3.13.0-linux-x86_64.zip https://github.com/protocolbuffers/protobuf/releases/download/v3.13.0/protoc-3.13.0-linux-x86_64.zip \
  && echo 'fbebe5e32db9edbb1bf7988af5fed471d22730104bc6ebd5066c5b4646a0949e49139382cae2605c7abc188ea53f73b044f688c264726009fc68cbbab6a98819  /tmp/protoc-3.13.0-linux-x86_64.zip' \
    | sha512sum -c -

RUN mkdir -p /tensorflow/protobuf \
  && unzip /tmp/protoc-3.13.0-linux-x86_64.zip -d /tensorflow/protobuf \
  && rm /tmp/protoc-3.13.0-linux-x86_64.zip

ENV PATH="/tensorflow/protobuf/bin:${PATH}"

RUN cd /tensorflow/models/research/ \
  && protoc object_detection/protos/*.proto --python_out=.

RUN pip install cython \
  && pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI

RUN pip install --upgrade protobuf

RUN cd /tensorflow/models/research/ \
  && cp object_detection/packages/tf2/setup.py . \
  && python -m pip install .

ENTRYPOINT ["python", "-c", "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"]

Steps to reproduce:

  1. Running on p2.xlarge EC2 instance (Deep Learning AMI / Amazon Linux 2)
  2. docker build --tag tmp-img:latest .
  3. docker run --rm --gpus all -it tmp-img:latest
  4. You'll see warning messages that tensorflow failed to load GPU lib files and "Skipping registering GPU devices"

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.0-gpu-py37-cu102-ubuntu18.04

Current behavior:
pip list and pip freeze do not list tensorflow as an installed package/module.

Expected behavior:
pip list and pip freeze should list tensorflow as an installed package/module.

Additional context:

[bug] test flakiness for test_eks_mxnet_gluonnlp_inference on MX-1.7 Inference image

Concise Description:

Test failure on gluonnlp_inference test [1] on EKS infrastructure. Model used: bert_sst

The test succeeds for some runs but mostly fails on the below error.

2020-09-11 05:26:26,409 [INFO ] W-9000-bert_sst-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Downloading /root/.mxnet/models/9170812217274505269/9170812217274505269_book_corpus_wiki_en_uncased-a6607397.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/book_corpus_wiki_en_uncased-a6607397.zip...
2020-09-11 05:26:35,318 [INFO ] W-9000-bert_sst-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model bert_sst loaded io_fd=ee8c47fffe8df7f4-00000007-00000000-39808ad99f2105ce-cd6798ce
2020-09-11 05:26:35,326 [INFO ] W-9000-bert_sst com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 10761

Ref:
[1]

def test_eks_mxnet_gluonnlp_inference(mxnet_inference, py3_only):

Expected behavior:

  • test_eks_mxnet_gluonnlp_inference to succeed

[bug] Only 1 CPU is available in pytorch containers after deploy

Concise Description:
In the pytorch containers after deploying on SageMaker only 1 CPU is available.

DLC image/dockerfile:
Image: "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.6.0-gpu-py3"
Instance: "ml.g4dn.xlarge" (number of vCPU: 4)

Current behavior:
In cloudwatch logs only 1 cpu is avaliable:

main org.pytorch.serve.ModelServer
Torchserve version: 0.2.1
TS Home: /opt/conda/lib/python3.6/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 3806 M
Python executable: /opt/conda/bin/python
Config file: /etc/sagemaker-ts.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Metrics address: http://127.0.0.1:8082
Model Store: /.sagemaker/ts/models

Expected behavior:
All 4 CPUs should be available

Number of CPUs: 4

Additional context:
The issue is reproducible with other instance types.

[feature-request] Include AWS ml-io in training containers

Checklist

Concise Description:

AWS Labs' ml-io library is a useful tool for model training but cannot be pip installed via requirements.txt, so would be beneficial to include in base DLC training images and in particular those for PyTorch.

DLC image/dockerfile:

Particularly relevant to all PyTorch training containers, because:

  • MXNet has native RecordIO support, which helps with loading some of the standard SageMaker data formats
  • TensorFlow has PipeModeDataset through the pip-installable sagemaker tensorflow extensions, which also helps tackle some of these use cases

...but may be useful for edge cases across all training containers.

Is your feature request related to a problem? Please describe.

ml-io provides a high-performance, low-code-complexity option to performantly read a range of data types into training jobs... that might be non-trivial to load via other tools.

For e.g:

  • Ouickly iterating through a CSV in numpy mini-batches, or images as numpy arrays
  • Reading RecordIO files such as SageMaker Augmented Manifests into numpy arrays, or with zero-copy conversion to PyTorch tensors

...But as noted on the project's GitHub README - ml-io can't be installed via pip because of the non-Python dependencies.

It's frustrating to have difficulties loading data from SageMaker-specific formats, and a library tool also authored by AWS that could solve the prooblem - but to have to go through additional non-trivial steps derive a custom container from the provided DLCs and use that custom container with the SageMaker SDK.

Describe the solution you'd like

ml-io core library and Python binding to be pre-installed on DLC training container images, particularly the PyTorch images, to provide an easier route to loading complex/non-PyTorch-native dataset formats without resorting to inefficient Python-based solutions that will likely bottleneck training.

Describe alternatives you've considered

  • Maybe framework containers could provide some user-friendly way to install conda packages as well as pip?
  • Perhaps ml-io is not the best tool for this, but there exists something else that we'd recommend for these kinds of tasks?

Additional context

I have a couple of PyTorch scenarios in SageMaker where I'm considering ml-io based solutions: One using AugmentedManifests for bounding box detection in images (SM Ground Truth format), and one just investigating how to efficiently train on larger-than-memory CSV files maybe with some vectorized transformations on batches as they load.

[feature-request] Support for EAI for framework 1.6.0/1.7.0

Hi,

do you have an ETA for supporting Elastic inference with the 1.6.0 framework?
And also, are you going to add to this repository the docker files for the old 1.4.1 framework?
Thank you

Checklist

Concise Description:

DLC image/dockerfile:

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[feature-request] [design clarification] Is it possible to have single docker image for both training and inference from mxnet 1.8 onwards ?

Checklist

Concise Description:
Until MXNet 1.6 the same training image could be used for both training/inference. Starting from MXNet 1.7 separate docker images are recommended for training and inference as per https://github.com/aws/deep-learning-containers/blob/master/available_images.md. Based on my discussions with Sandeep Krishnamurthy within Amazon, I learned that "Inference image for MXNet 1.7 is optimized with MKL BLAS. Intel merged the MKL's implementation of BLAS operation to OneDNN which is used by default by MXNet on CPU". Appreciate any help on following questions
1Q) Is the reason for having separate images for training and inference for MXNet 1.7 primarily related to MKL BLAS (or) are there other reasons ?
2Q) Moving forward (MXNet1.8 onwards) could we expect to have separate images or same image for training and inference

DLC image/dockerfile:
MXNet 1.7 (763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-gpu-py36-cu101-ubuntu16.04)

Is your feature request related to a problem? Please describe.
When I tried to upgrade our pipelines to MXNet 1.7, our SageMaker training jobs were "In progress" model and never ended. We do a sys.exit(0) at the end of training job to signal SageMaker about a successful training job. Though our training job ran to completion, produced all required logs, the SageMaker training job was "In progress" mode forever due to using MXNet 1.7 GPU Inference image (763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-gpu-py36-cu101-ubuntu16.04) for training on a CPU instance (ml.m5.4xlarge) . Our pipelines are built such that we can use only 1 image across training/inference. We will add functionality to support different training/inference image with time on our end. But for now we are planning to use MXNet 1.6 as the image supports both training/inference and it worked smoothly for us in above setting. It would be great if there is a single image that supports both training/inference from MXNet 1.8 onwards

Describe the solution you'd like
If possible having 1 single image that supports both training and inference (like MXNet 1.6)

[bug] apt-get failure in sagemaker-local-test builds

Description:

An apt-get error is seen in sagemaker-local-test builds as below. This is because apt-get process is already running and in active state.

E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
--
294 | E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Versioning of PyTorch and Torchvision

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.5.1-gpu-py36-cu101-ubuntu16.04

I am trying to use the PyTorch-1.5.1 image with detectron2-0.1.3 and am having import errors.

detectron2 has documentation for the errors I'm seeing under this common installation issue:
Undefined symbols that contains TH,aten,torch,caffe2; Missing torch dynamic libraries; Segmentation fault immediately when using detectron2. (https://detectron2.readthedocs.io/tutorials/install.html#install-pre-built-detectron2-linux-only)

Which suggests that the error may be with the versioning of PyTorch and Torchvision and recommends re-installing PyTorch from their website.

Detectron2's "collect_env.py" script shows the following with this image:
torchvision 0.6.0 @/opt/conda/lib/python3.6/site-packages/torchvision
PyTorch 1.5.1 @/opt/conda/lib/python3.6/site-packages/torch
detectron2 0.1.3 @/opt/amazon/lib/python3.6/site-packages/detectron2
detectron2._C failed to import

The PyTorch documentation for installing suggests that PyTorch 1.5.1 should be installed with Torchvision-0.6.1
https://pytorch.org/get-started/previous-versions/

I see that the PyTorch-1.5.1 offering here specifically installs Torchvision-0.6.0
https://github.com/aws/deep-learning-containers/blob/master/pytorch/inference/docker/1.5.1/py3/cu101/Dockerfile.gpu#L86

I took my customized container built from this image and uninstalled PyTorch, Torchvision and detectron2 then re-installed all of them via pip with PyTorch-1.5.1, Torchvision-0.6.1 and detectron2-0.1.3 and that resolved all the import issues I was having.

Considering that the PyTorch documentation suggests that Torchvision-0.6.1 be installed with PyTorch-1.5.1 is there a particular reason that this image is specifically using Torchvision-0.6.0? Not sure if it would be possible to request that this image is updated to use Torchvision-0.6.1

[feature-request] Unify aws-id & account-id tag for sagemaker test

Checklist

Concise Description:
tensorflow pytest on DLC need account-id as parameter

if framework == "tensorflow" and job_type == "training":
aws_id_arg = "--account-id"

while mxnet & pytorch use aws-id

DLC image/dockerfile:

Is your feature request related to a problem? Please describe.
I have to switch the arg to be passed to the pytest [for testing on my devbox]
for PT/MX: --aws-id
for TF: --account-id

Describe the solution you'd like
Switch TF to --aws-id since other 2 framework SM tests use that arg.

Describe alternatives you've considered
While this isn't a bug, it would be a good dev experience to have unified args.

Additional context
NA

Pytorch Version Mismatch

Hi, I am using one of the available containers in my docker like this

FROM 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-inference:1.4.0-gpu-py36-cu101-ubuntu16.04

and running a training job using the sagemaker Estimator.

In my training job, when i print(torch.__version__) i am getting 0.4.0

Can someone help me resolve this issue?

Thanks!

[bug] dlc not found when building mxnet image locally

Checklist

  • I've prepended issue tag with type of change: [bug]
  • (If applicable) I've attached the script to reproduce the bug
  • I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:
dlc not found
DLC image/dockerfile:

Current behavior:

$ python src/main.py --buildspec mxnet/buildspec.yml \
>                      --framework mxnet \
>                      --image_types training \
>                      --device_types gpu \
>                      --py_versions py3
Traceback (most recent call last):
  File "src/main.py", line 4, in <module>
    import utils
  File "/home/ubuntu/deep-learning-containers/src/utils.py", line 22, in <module>
    from dlc.github_handler import GitHubHandler
ModuleNotFoundError: No module named 'dlc'
Detailed Steps
git clone https://github.com/ChaiBapchya/deep-learning-containers.git
cd deep-learning-containers/
pip install -r src/requirements.txt
bash src/setup.sh mxnet
python src/main.py --buildspec mxnet/buildspec.yml --framework mxnet --image_types training --device_types gpu --py_versions py3

[feature-request] add tensorflow-datasets package

Checklist

Concise Description:
https://pypi.org/project/tensorflow-datasets/
This datasets package really helpful to have and we'll use it in example notebooks.

DLC image/dockerfile:
Any future TF release, and add it to patch releases of existing versions.
https://github.com/aws/deep-learning-containers/blob/master/tensorflow/training/docker/2.3.0/py3/Dockerfile.cpu

[feature-request] Support for multiple CUDA versions in buildspec.yml file

Checklist

Concise Description:
The current design of buildspec supports only one version of CUDA. Can we get a provision of handling multiple CUDA versions so that framework teams have the option of creating 1 PR and releasing everything in it?
Now that TF2.4 is coming up, it would be good to have this feature.
Tagging @saimidu as he knows some context here.

DLC image/dockerfile:

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[bug] codebuild issue

Checklist

  • I've prepended issue tag with type of change: [bug]
  • (If applicable) I've attached the script to reproduce the bug
  • I've built my own container based off DLC (and I've attached the code used to build my own image)
python src/main.py --buildspec mxnet/buildspec.yml --framework mxnet --image_types training --device_types gpu --py_versions py3

Additional context:

=================================================================Uploading Metrics================================================================================================================================
[INFO] Found credentials in environment variables.
================================================================================================================================================================================================================================================================================

Traceback (most recent call last):
  File "src/main.py", line 45, in <module>
    image_builder(args.buildspec)
  File "/home/ubuntu/deep-learning-containers/src/image_builder.py", line 202, in image_builder
    test_trigger_job = utils.get_codebuild_project_name()
  File "/home/ubuntu/deep-learning-containers/src/utils.py", line 420, in get_codebuild_project_name
    return os.getenv("CODEBUILD_BUILD_ID").split(":")[0]
AttributeError: 'NoneType' object has no attribute 'split'

[feature] Mechanism for skipping tests for specific versions of a framework

Checklist

Concise Description: Please introduce something like a helper function to skip certain tests for framework versions where the functionality being tested does not exist.

DLC image/dockerfile: I am using the PyTorch 1.5.1 DLCs provided in the link above.

Is your feature request related to a problem? Please describe.
Certain tests will always lead to failures in the PR. For example, native AMP was introduced by PyTorch in the framework version 1.6. Because of this, a PR introducing changes to the PyTorch 1.5.1, the EC2 test results do not provide an accurate results, as can be seen in this PR #650.

Describe the solution you'd like
There seems to be no intuitive way to skip tests for specific version numbers of a framework. A helper function or flag that disables tests depending on the framework version (i.e. lower/higher than 1.6) will be helpful.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[feature-request] publish all images with historical tags to public ecr(https://gallery.ecr.aws/)

Checklist

Concise Description:

Pulling the base deep learning image from ECR requires AWS authentication. Also the different accounts are used for different regions/partitions. It’s really annoying to create a CloudFormation CfnMapping to get the container for corresponding region when building a ML solution crossing AWS regions/partitions.

Since re:Invent 2020 AWS offers the public ECR with anonymous pulling quota. Pls also pushing those deep learning images to public ECR(gallery.ecr.aws) for improving the user experience.

DLC image/dockerfile:

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
Pls also pushing those deep learning images to public ECR(gallery.ecr.aws) for improving the user experience.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[bug] putmetricdata max retries

Command

python src/main.py --buildspec tensorflow/buildspec.yml --framework tensorflow

Docker build finishes successfully. However, metricdata isn't pushed to cloudwatch due to following error

Stacktrace

Traceback (most recent call last):
  File "/home/ubuntu/PRIVATE-deep-learning-containers/src/metrics.py", line 29, in push
    Namespace=self.namespace,
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InternalFailure) when calling the PutMetricData operation (reached max retries: 4): Unknown
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/ubuntu/PRIVATE-deep-learning-containers/src/image_builder.py", line 221, in image_builder
    metrics.push_image_metrics(image)
  File "/home/ubuntu/PRIVATE-deep-learning-containers/src/metrics.py", line 49, in push_image_metrics
    self.push("build_time", "Seconds", build_time, info)
  File "/home/ubuntu/PRIVATE-deep-learning-containers/src/metrics.py", line 32, in push
    raise Exception(str(e))
Exception: An error occurred (InternalFailure) when calling the PutMetricData operation (reached max retries: 4): Unknown
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "src/main.py", line 59, in <module>
    main()
  File "src/main.py", line 55, in main
    image_builder(args.buildspec)
  File "/home/ubuntu/PRIVATE-deep-learning-containers/src/image_builder.py", line 226, in image_builder
    raise Exception(f"Build passed. {e}")
Exception: Build passed. An error occurred (InternalFailure) when calling the PutMetricData operation (reached max retries: 4): Unknown

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.