aws / sagemaker-python-sdk Goto Github PK

View Code? Open in Web Editor NEW

2.0K 136.0 1.1K 149.7 MB

A library for training and deploying machine learning models on Amazon SageMaker

Home Page: https://sagemaker.readthedocs.io/

License: Apache License 2.0

Python 99.60% Shell 0.04% Dockerfile 0.02% Java 0.01% Jupyter Notebook 0.34%

aws mxnet tensorflow machine-learning python pytorch sagemaker huggingface

sagemaker-python-sdk's Introduction

SageMaker Python SDK

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

With the SDK, you can train and deploy models using popular deep learning frameworks Apache MXNet and TensorFlow. You can also train and deploy models with Amazon algorithms, which are scalable implementations of core machine learning algorithms that are optimized for SageMaker and GPU training. If you have your own algorithms built into SageMaker compatible Docker containers, you can train and host models using these as well.

For detailed documentation, including the API reference, see Read the Docs.

Installing the SageMaker Python SDK

The SageMaker Python SDK is built to PyPI and the latest version of the SageMaker Python SDK can be installed with pip as follows :

pip install sagemaker==<Latest version from pyPI from https://pypi.org/project/sagemaker/>

You can install from source by cloning this repository and running a pip install command in the root directory of the repository:

git clone https://github.com/aws/sagemaker-python-sdk.git
cd sagemaker-python-sdk
pip install .

Supported Operating Systems

SageMaker Python SDK supports Unix/Linux and Mac.

Supported Python Versions

SageMaker Python SDK is tested on:

Python 3.8
Python 3.9
Python 3.10
Python 3.11

AWS Permissions

As a managed service, Amazon SageMaker performs operations on your behalf on the AWS hardware that is managed by Amazon SageMaker. Amazon SageMaker can perform only operations that the user permits. You can read more about which permissions are necessary in the AWS Documentation.

The SageMaker Python SDK should not require any additional permissions aside from what is required for using SageMaker. However, if you are using an IAM role with a path in it, you should grant permission for iam:GetRole.

Licensing

Running tests

SageMaker Python SDK has unit tests and integration tests.

You can install the libraries needed to run the tests by running pip install --upgrade .[test] or, for Zsh users: pip install --upgrade .\[test\]

Unit tests

We run unit tests with tox, which is a program that lets you run unit tests for multiple Python versions, and also make sure the code fits our style guidelines. We run tox with all of our supported Python versions, so to run unit tests with the same configuration we do, you need to have interpreters for those Python versions installed.

To run the unit tests with tox, run:

tox tests/unit

Integration tests

To run the integration tests, the following prerequisites must be met

AWS account credentials are available in the environment for the boto3 client to use.
The AWS account has an IAM role named SageMakerRole. It should have the AmazonSageMakerFullAccess policy attached as well as a policy with the necessary permissions to use Elastic Inference.
To run remote_function tests, dummy ecr repo should be created. It can be created by running - aws ecr create-repository --repository-name remote-function-dummy-container

We recommend selectively running just those integration tests you'd like to run. You can filter by individual test function names with:

tox -- -k 'test_i_care_about'

You can also run all of the integration tests by running the following command, which runs them in sequence, which may take a while:

tox -- tests/integ

You can also run them in parallel:

tox -- -n auto tests/integ

Git Hooks

to enable all git hooks in the .githooks directory, run these commands in the repository directory:

find .git/hooks -type l -exec rm {} \;
find .githooks -type f -exec ln -sf ../../{} .git/hooks/ \;

To enable an individual git hook, simply move it from the .githooks/ directory to the .git/hooks/ directory.

Building Sphinx docs

Setup a Python environment, and install the dependencies listed in doc/requirements.txt:

# conda
conda create -n sagemaker python=3.7
conda activate sagemaker
conda install sphinx=3.1.1 sphinx_rtd_theme=0.5.0

# pip
pip install -r doc/requirements.txt

Clone/fork the repo, and install your local version:

pip install --upgrade .

Then cd into the sagemaker-python-sdk/doc directory and run:

make html

You can edit the templates for any of the pages in the docs by editing the .rst files in the doc directory and then running make html again.

Preview the site with a Python web server:

cd _build/html
python -m http.server 8000

View the website by visiting http://localhost:8000

SageMaker SparkML Serving

With SageMaker SparkML Serving, you can now perform predictions against a SparkML Model in SageMaker. In order to host a SparkML model in SageMaker, it should be serialized with MLeap library.

For more information on MLeap, see https://github.com/combust/mleap .

Supported major version of Spark: 3.3 (MLeap version - 0.20.0)

Here is an example on how to create an instance of SparkMLModel class and use deploy() method to create an endpoint which can be used to perform prediction against your trained SparkML Model.

sparkml_model = SparkMLModel(model_data='s3://path/to/model.tar.gz', env={'SAGEMAKER_SPARKML_SCHEMA': schema})
model_name = 'sparkml-model'
endpoint_name = 'sparkml-endpoint'
predictor = sparkml_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)

Once the model is deployed, we can invoke the endpoint with a CSV payload like this:

payload = 'field_1,field_2,field_3,field_4,field_5'
predictor.predict(payload)

For more information about the different content-type and Accept formats as well as the structure of the schema that SageMaker SparkML Serving recognizes, please see SageMaker SparkML Serving Container.

sagemaker-python-sdk's People

Contributors

Stargazers

Watchers

Forkers

harlig everpeace henfee laurenyu doneladams andremoeller hyhlinux ranman jraede jbnunn mvsusp lukmis mfranklin1 yangmiok geoyi rakelkar partnercloudsupport eglxiang djarpin anujloomba winstonaws stealthycoin drawcloser tristaneljed jalabort hudl ragavvenkatesan cpfrer shafaypro nkconnor just4jc polugithub iquintero getglad hanifmahboobi imenjarroudi matthieudelaro yangaws shotarok ceceshao1 nadiaya spervez-dg julianocristian hephaex pm3310 gehc-edison-ai af018 besirkurtulmus hyandell mateorodriguezgt raulrc mckev-amazon josesaribeiro knowledgemonger robperc ptabc valorum mrku69 getamazednow kapilt pedrocardoso avilamala mrtj coreynoone auserj troychen728 kenstler pdasamzn mjugan lily1110 shubham-vish wuyangli cosinex shivakishore14 kenza-ai icywang86rui juliodelgadoaws leopd debasish-das-ck seahyinghang8 evanfwelch rsullivan00 saswatac whitesoil duasahil8 mfund0 jenniew ouss-bch jnclt cb-lifanyang eslesar-aws alfysamuel bharath5673 mklissa cnasri aaronn-xilinx jrolf epistimos apacker housesbro

sagemaker-python-sdk's Issues

InternalServerError: We encountered an internal error. Please try again.

My jobs keep failing when trying to run a custom Grid Search over hyperparameters for a custom model, built in Scikit-learn (Sklearn).

I keep getting an Internal Server Error when trying to train my own Sklearn model using SageMaker. I am running Scikit-Learn Grid Search on a ml.m4.10xlarge and it keeps failing and throwing this very non-descript error. I am running the grid search in one of Sage Makers hosted notebook. Any help would greatly be appreciated as this is holding me up right now.

ValueError                                Traceback (most recent call last)
<ipython-input-4-11313d0f3d2c> in <module>()
      9                        sagemaker_session=sess)
     10 
---> 11 tree.fit("s3://tickr-machine-learning-data/financial_data")

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    152         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    153         if wait:
--> 154             self.latest_training_job.wait(logs=logs)
    155         else:
    156             raise NotImplemented('Asynchronous fit not available')

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
    321     def wait(self, logs=True):
    322         if logs:
--> 323             self.sagemaker_session.logs_for_job(self.job_name, wait=True)
    324         else:
    325             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
    645 
    646         if wait:
--> 647             self._check_job_status(job_name, description)
    648             if dot:
    649                 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc)
    388         if status != 'Completed':
    389             reason = desc.get('FailureReason', '(No reason provided)')
--> 390             raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
    391 
    392     def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training financial-model-train-njobs-4-predispat-2018-02-05-17-08-19-428: Failed Reason: InternalServerError: We encountered an internal error. Please try again. ```

Asynchronous fit

My first use for SageMaker involved async submission of KMeans jobs. EstimatorBase.fit raises NotImplemented('Asynchronous fit not available') if the wait parameter is set to false.

Is fit() necessarily synchronous, or is it just not implemented yet?

creating estimator with role name fails if role is default sagemaker service role

if you pass a role name (instead of full ARN) to estimator constructor, and you are using default role created by SageMaker, training job will fail with "unable to assume the role" ClientError:

sagemaker-mxnet-py2-cpu-2018-01-14-15-35-12-859: Failed Reason: ClientError: SageMaker was unable to assume the role 'arn:aws:iam::<ACCOUNT>:role/AmazonSageMaker-ExecutionRole-20180114T073202'

error message shows the sdk converted role name to ARN without the 'service-role' prefix. running the same job with corrected full ARN works:

arn:aws:iam::<ACCOUNT>:role/service-role/AmazonSageMaker-ExecutionRole-20180114T075260

i encountered this error starting a training job, but I assume it would impact endpoint creation in same way.

Invoke TensorFlow Endpoint

Hi,

I have used the sample code for tensor flow model, which predicts mnist digit. I have hosted this model with an end point using the sample code.
mnist_predictor = mnist_estimator.deploy(initial_instance_count=1,endpoint_name="tensor-endpoint", instance_type='ml.m4.xlarge')

When I try to invoke the endpoint, using boto3.client object, from outside the AWS, how should the data be sent?
runtime_client.invoke_endpoint(EndpointName="TensorFlowEndpoint-2018-02-24-10-55-14", ContentType='data', Body=data_array)
How should the image array be serialised?

What should be the ContentType and Body?

I have tried lists, numpy array, tensorflow.core.framework.tensor_pb2.TensorProto

The image has been converted to a list like:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.03137254901960784, 0.0, 0.0, 0.03529411764705882, 0.03529411764705882, 0.03529411764705882, 0.08627450980392157, 0.12941176470588237, 0.3137254901960784, 0.6196078431372549, 0.8941176470588236, 0.7725490196078432, 0.12549019607843137, 0.00392156862745098, 0.00784313725490196, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.8470588235294118, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7647058823529411, 0.0, 0.01568627450980392, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.027450980392156862, 0.0, 0.6627450980392157, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9686274509803922, 1.0, 1.0, 1.0, 0.6352941176470588, 0.0, 0.01568627450980392, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.027450980392156862, 0.0, 0.49019607843137253, 1.0, 0.9647058823529412, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.596078431372549, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.0, 1.0, 1.0, 1.0, 0.8549019607843137, 0.5882352941176471, 0.6627450980392157, 0.6588235294117647, 0.6078431372549019, 0.45098039215686275, 0.29411764705882354, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.0, 1.0, 0.9803921568627451, 1.0, 0.1803921568627451, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.027450980392156862, 0.00392156862745098, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.0, 1.0, 0.9803921568627451, 1.0, 0.1607843137254902, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.011764705882352941, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.0, 1.0, 0.996078431372549, 1.0, 0.6705882352941176, 0.6549019607843137, 1.0, 1.0, 1.0, 1.0, 0.5215686274509804, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.0, 1.0, 1.0, 1.0, 0.9686274509803922, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9764705882352941, 1.0, 1.0, 1.0, 0.25098039215686274, 0.0, 0.011764705882352941, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.0, 1.0, 0.9803921568627451, 1.0, 1.0, 1.0, 0.9294117647058824, 0.8549019607843137, 0.8274509803921568, 1.0, 1.0, 1.0, 0.984313725490196, 1.0, 1.0, 0.09019607843137255, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 1.0, 1.0, 1.0, 0.6627450980392157, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23529411764705882, 0.9882352941176471, 1.0, 1.0, 1.0, 1.0, 0.01568627450980392, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.07058823529411765, 0.7019607843137254, 0.4235294117647059, 0.0, 0.0, 0.0, 0.01568627450980392, 0.011764705882352941, 0.0, 0.0, 0.0, 0.6941176470588235, 1.0, 0.996078431372549, 1.0, 0.26666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.027450980392156862, 0.01568627450980392, 0.023529411764705882, 0.023529411764705882, 0.027450980392156862, 0.0, 0.0, 0.9176470588235294, 1.0, 0.996078431372549, 1.0, 0.2901960784313726, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.027450980392156862, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.29411764705882354, 0.0, 0.0, 0.0, 0.0, 0.08235294117647059, 0.2901960784313726, 0.5176470588235295, 0.6980392156862745, 1.0, 1.0, 0.9921568627450981, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9803921568627451, 0.9607843137254902, 1.0, 1.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011764705882352941, 0.0, 0.03529411764705882, 1.0, 1.0, 1.0, 1.0, 1.0, 0.996078431372549, 0.9725490196078431, 0.9764705882352941, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9450980392156862, 0.0, 0.00784313725490196, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.9254901960784314, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8196078431372549, 0.615686274509804, 0.4196078431372549, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.00784313725490196, 0.3686274509803922, 0.7450980392156863, 0.7647058823529411, 0.6274509803921569, 0.4980392156862745, 0.3686274509803922, 0.24313725490196078, 0.13725490196078433, 0.050980392156862744, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

TypeError: init() got an unexpected keyword argument 'file'

Hello!

Excited to try sage-maker but getting the error below after pip install sagemaker in new virtualenv. Python 2.7.10, Sierra 10.12.6. New to this product but would love to get some direction on this.

Traceback (most recent call last): File "python.py", line 29, in <module> import sagemaker File "****/lib/python2.7/site-packages/sagemaker/__init__.py", line 16, in <module> from sagemaker.amazon.kmeans import KMeans, KMeansModel, KMeansPredictor File "***/lib/python2.7/site-packages/sagemaker/amazon/kmeans.py", line 13, in <module> from sagemaker.amazon.amazon_estimator import AmazonAlgorithmEstimatorBase, registry File "***/lib/python2.7/site-packages/sagemaker/amazon/amazon_estimator.py", line 19, in <module> from sagemaker.amazon.common import write_numpy_to_dense_tensor File "****/lib/python2.7/site-packages/sagemaker/amazon/common.py", line 19, in <module> from sagemaker.amazon.record_pb2 import Record File "***/lib/python2.7/site-packages/sagemaker/amazon/record_pb2.py", line 41, in <module> options=_descriptor._ParseOptions(descriptor_pb2.FieldOptions(), _b('\020\001')), file=DESCRIPTOR), TypeError: __init__() got an unexpected keyword argument 'file'

Any direction welcome.

support requirements.txt in TF serving

Do you support customized packages (installing dependences) in serving mode?

Multiple files not supported if not executed through notebook?

Hello.

I am currently facing the following problem:

I have built my own service for building, deploying and serving ML models using Sagemaker in production, and I am very much trying to make this work without using notebooks.

I just encountered the following problem, and it seems like I am not able to simply use entry_point=modelclass.py in order to train a model that imports from other local files.

I was trying to recreate the Cifar10 with Tensorboard example in amazon-sagemaker-examples, but I got the following error, when using "resnet_cifar_10" as the entry_point for the Tensorflow Estimator:

ValueError: Error training cifar10-tensorflow-train-2018-04-01-23-14: Failed Reason: AlgorithmError: uncaught exception during training: No module named resnet_model
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 25, in start
    fw.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 87, in train
    customer_script = env.import_user_module()
  File "/usr/local/lib/python2.7/dist-packages/container_support/environment.py", line 88, in import_user_module
    user_module = importlib.import_module(script)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/opt/ml/code/resnet_cifar_10.py", line 6, in <module>
    import resnet_model
ImportError: No module named resnet_model

By looking at this error, it seems like that if all the necessary code has to be in this file. This eventually gets nicely packaged into a Docker container on SageMaker's working out of the box if I understand it correctly. Is there a way to 'send more' files so the TF model doesn't have to be defined in one .py file conforming to the 5 must implement protocol methods?

how does custom algorithm access "train" channel's S3 uri and superparameter specified in "create training job" console?

I'm trying to use scikit_bring_your_own example to test own algorithm
What I'm trying to do:

run the sample sagemaker jupyter notenook scikit_bring_your_own.ipynb upto docker push ${fullname}
from SageMaker console, Jobs -> "Create New Job"
from algorithm dropdown, select Custom, and enter ECR uri created on step 1
Specify "train" channel with S3 path
Error was : No such file or directory: '/opt/ml/input/data/training'

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/container/decision_trees/train#L21 has hard-coded path. In my "train" python, how do I access "train" channel's S3 uri? Thanks

No h5py libraries in the Tensorflow container

no h5py library in the Tensorflow container, such that tf.keras.model.load_weights does not work!!!

Installing Dependencies

Hi! I am using Deep Mind's Sonnet library to make my architecture code reusable. Sonnet is not included in the py2 container and I have not found a way to pass additional requirements to SageMaker. Could you implement a way to pass a requirements.txt file to the Tensorflow constructor so SageMaker installs the requirements in the Docker container before training? Floydhub lets you define the floyd_requirements.txt file that does exactly this and works very well.

documentation error

Under section "Creating a serving_input_fn", the documentation claims:

"At the end of training, similarly, serving_input_fn is used to create the model that is exported for TensorFlow Serving."

Is this accurate? Based on my understanding: I dont think serving_input_fn() is used to create the model. This is also inconsistent with its role described a few lines afterwards.

TypeError: new() got an unexpected keyword argument 'file'

import sagemaker gives the following error in an ubuntu machine:
TypeError: __new__() got an unexpected keyword argument 'file'

installation was tried by both:
pip install sagemaker
or
git clone https://github.com/aws/sagemaker-python-sdk.git python setup.py sdist pip install dist/sagemaker-1.0.3.tar.gz

Unable to update existing endpoint with newly trained model

Hello!

I am investigating the Sagemaker API for use in production (without notebooks). I am able to train a model, create an endpoint and delete the endpoint without any problems with the API.

However, in a very common situation where I have a newly trained model on new data, I would like to be able to update/change the model that is currently serving in the specified endpoint and not have to update other services. In production, I would like to update the model serving without any downtime.

Currently when I try to do this operation, simply train a new model and deploy to an endpoint using deploy with:

    def deploy(self):
        self.estimator.deploy(
                initial_instance_count=1000,
                instance_type=ml.c4.xlarge,
                endpoint_name="iris"
            )

I get the following error:
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: Cannot create already existing endpoint "arn:aws:sagemaker:eu-west-1:166488713907:endpoint/iris".

Am I missing something here? Do I have to / can I do this operation manually with the boto3 api instead?

Thank you

Using MXNet LSTM with SageMaker

Going off of the MNIST Example for MXNet, I need to have a function like,

def save(net, model_dir):
    # save the model
    y = net(mx.sym.var('data'))
    y.save('%s/model.json' % model_dir)
    net.collect_params().save('%s/model.params' % model_dir)

But, if I try this with an LSTM I get an error. The minimal code to reproduce the bug is,

import mxnet as mx
from mxnet import gluon

net = gluon.rnn.LSTM(100)
net(mx.sym.var('dats'))

which gives,

$ python shape_error.py 
Traceback (most recent call last):
  File "shape_error.py", line 5, in <module>
    net(mx.sym.var('dats'))
  File "/usr/local/lib/python3.6/site-packages/mxnet/gluon/block.py", line 304, in __call__
    return self.forward(*args)
  File "/usr/local/lib/python3.6/site-packages/mxnet/gluon/rnn/rnn_layer.py", line 173, in forward
    batch_size = inputs.shape[self._layout.find('N')]
AttributeError: 'Symbol' object has no attribute 'shape'

How do I solve this?

how to pass multiple S3 keys to Estimator.fit?

I have created my own algorithm container very similar to the scikit_bring_your_own example. I also have an S3 directory full of about 50 files, some of which I'd like to use for training and some for prediction. Right now, in order to use some subset of those files for training I actually have to create a new S3 directory and copy the training files into that directory, so I can pass the directory to Estimator.fit. Is there a better way of doing this that doesn't require me to touch my files in S3? Ideally I would like to be able to pass a list of S3 key prefixes to Estimator.fit, so that I can train on multiple files that do not share the same prefix. Shuffling the files around in S3 becomes especially impractical if I want to do something like k-folds cross validation. Thanks!

How do I use MXNet's distributed key-value store in this framework?

Looking at your MXNet training script documentation, I see,

hosts (list[str]): The list of host names running in the SageMaker Training Job cluster.

The only way I have seen to do distributed training in MXNet is with Distributed Key-Value Stores which run on DMLC via MPI/SSH like,

$ mxnet_path/tools/launch.py -H hostfile -n 2 python myprog.py

This launch script is not something that could easily changed.

So how am I supposed to use the hosts list you pass into my SageMaker training function? (see this too).

Durability or github integration of jupyter notebook

I was very impressed with quality and rich functionalities of the first release of sagemaker. Great job!

I might have missed those features, but I would like to see improved durability of the sagemaker notebooks, i.e. the notebooks can be stored on S3, or even better github integration of the notebooks.

Thanks!

Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape

In running,

pred = predictor.predict(np.nan_to_num(data[column_name].values))

I get an error,

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message

Looking in the CloudWatch logs, I only see errors like,

[2018-02-12 21:02:50,957] ERROR in serving: [21:02:50] src/core/pass.cc:43: Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape
...
MXNetError: [21:02:50] src/core/pass.cc:43: Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape
...
2018-02-12 21:02:50,957 ERROR - model server - [21:02:50] src/core/pass.cc:43: Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape
...
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [21:02:50] src/core/pass.cc:43: Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape
...
[2018-02-12 21:02:50,958] ERROR in serving: [21:02:50] src/core/pass.cc:43: Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape
...
2018-02-12 21:02:50,958 ERROR - model server - [21:02:50] src/core/pass.cc:43: Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape
...
10.32.0.1 - - [12/Feb/2018:21:02:50 +0000] "POST /invocations HTTP/1.1" 500 0 "-" "AHC/2.0"
[21:02:51] /tmp/mxnet/dmlc-core/include/dmlc/./logging.h:308: [21:02:51] src/core/pass.cc:43: Graph attr dependency shape is required by pass PlanMemory but is not available The attribute is provided by pass InferShape
...
10.32.0.1 - - [12/Feb/2018:21:02:51 +0000] "POST /invocations HTTP/1.1" 500 0 "-" "AHC/2.0"

full output

What's going on?

Cannot pull algorithm container

Hey there,

I'm attempting to run a custom SageMaker job that I'm creating programmatically through this library using the Session.train method. All validation passes, the job is created, and my channel data is pulled from S3 (judging by how long it takes ~ 40 minutes). At this point I get the following error that I've been unable to diagnose:

Failed Reason: ClientError: Cannot pull algorithm container. Either the image does not exist or its permissions are incorrect.

Addressing each of these issues

I've verified that my image uri leads to a proper ECS repo and looks like "1234567890.dkr.ecr.us-east-1.amazonaws.com/sage".
The execution role I created to run the job has the policies AmazonSageMakerFullAccess, AmazonEC2ContainerRegistryFullAccess, and an additional custom policy to limit the S3 access to my training bucket.

What am I missing here? Is there another policy I need to add to the IAM role? Happy to provide any other relevant details.

PyTorch support

Wondering if there are any plans for supporting pytorch framework in the nearest future. It's clear how to configure it myself, but if it's on a roadmap - I would rather wait for it. Thanks

JSON examples for SageMaker / TF serving

cols = [
    tf.feature_column.numeric_column("age"),
    tf.feature_column.categorical_column_with_vocabulary_list("gender", ["m", "f", "other"]),
    tf.feature_column.categorical_column_with_hash_bucket("city", hash_bucket_size=15000)
]

example_spec = tf.feature_column.make_parse_example_spec

print(example_spec)
# {'gender': VarLenFeature(dtype=tf.string), 'age': FixedLenFeature(shape=(1,), dtype=tf.float32, default_value=None), 'city': VarLenFeature(dtype=tf.string)}

srv_fun = tf.estimator.export.build_parsing_serving_input_receiver_fn(example_spec)()

print(str(srv_fun))
#ServingInputReceiver(features={'city': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f83c7509ad0>, 'age': <tf.Tensor 'ParseExample_7/ParseExample:6' shape=(?, 1) dtype=float32>, 'gender': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f83c75093d0>}, receiver_tensors={'examples': <tf.Tensor 'input_example_tensor_8:0' shape=(?,) dtype=string>}, receiver_tensors_alternatives=None)

What format can we use to send predict requests using the Sagemaker SDK for input functions like the above?

The JSON serializer only handles arrays.. so it seems like tf_estimator.predict({"city":"Paris", "gender":"m", "age":22}) is out. I tried variations of Array input and get cryptic errors from the TF serving proxy client (that source code is not available to my knowledge)

Looking at the TF Iris DNN example notebook: it uses a syntax like iris_predictor.predict([6.4, 3.2, 4.5, 1.5]) though the FeatureSpec is like {'input': IrisArrayData}. So perhaps the feature spec needs a top level?

Support for slim based models?

Is there any way to use sagemaker for training/testing the (imagenet pretrained) models that come with TF models (these are based on TF slim)? Is there any documentation that you can point me towards for this?

Thanks!

Question: Adding Cost allocation tags

Hi,
Can you add Tags to an estimator / predictor using the High level API?

I saw in the boto3 docs that there is a AddTags function, but I can't find any way of doing it using the High level api in a Jupytr notebook...

Cheers,
Dan

[MXNet example] how to replace the existing endpoint from the notebook

when I ran the sample mxnet notebook, every time it created a new endpoint (with timestamps), i.e. "sagemaker-mxnet-py2-cpu-2018-01-25-18-04-31-174"

Let's say I have a use case that the inference end point will be consumed by individual application outside sagemaker. how do i specify the model name and replace if exist in the notebook when creating model and endpoint?

Unable to invoke SageMaker API endpoint

Hello!

The last few days I've been trying to deploy my trained model to SageMaker so I can query it from an AWS lambda using this SDK.
So far I've been able to train my POC model and deploy and endpoint, but I've not been able to query it after several tries. In the other hand, I've been able to query the generated SavedModel on my own computer.

The API endpoint has been giving me errors I've not been able to decypher, would you be so kind to give me a hand with this?

All my code can be found on this repository.
I'll break down the sources for you:

aws_dnnreg.py: Script used to fit the model at SageMaker
aws_dnnreg_cli.py: First client attempt, based on the iris classifier example.
aws_dnnreg_cli_v2.py: Second client attempt, based on a issue I saw on this Github.
aws_ecommerce_poc_dnn.py: Training script.
local/ecommerce-poc-dnn.py: Script I've been using to test the training script at local.
local/ecommerce-poc-predictor.py: Script I've been using to query the SavedModel locally.

The datasets are as follows:

Train: AWS-Ecommerce-Train.csv
Test: AWS-Ecommerce-Test.csv

Thank you for your time.
Yours,
Daniel.

** EDIT ** : This is the issue I based my second client on.

Also adding the log trace I retrieved from CloudWatch:

[2018-03-15 20:33:55,234] ERROR in serving: u'tensorflow/serving/regress' Traceback (most recent call last): File "/opt/amazon/lib/python2.7/site-packages/container_support/serving.py", line 165, in _invoke self.transformer.transform(content, input_content_type, requested_output_content_type) File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 254, in transform return self.transform_fn(data, content_type, accepts), accepts File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 178, in f input = input_fn(serialized_data, content_type) File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 211, in _default_input_fn data = self.proxy_client.parse_request(serialized_data) File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 47, in parse_request request = request_fn_map[self.prediction_type]() KeyError: u'tensorflow/serving/regress' 2018-03-15 20:33:55,234 ERROR - model server - u'tensorflow/serving/regress' Traceback (most recent call last): File "/opt/amazon/lib/python2.7/site-packages/container_support/serving.py", line 165, in _invoke self.transformer.transform(content, input_content_type, requested_output_content_type) File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 254, in transform return self.transform_fn(data, content_type, accepts), accepts File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 178, in f input = input_fn(serialized_data, content_type) File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 211, in _default_input_fn data = self.proxy_client.parse_request(serialized_data) File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 47, in parse_request request = request_fn_map[self.prediction_type]() KeyError: u'tensorflow/serving/regress' [2018-03-15 20:33:55,234] ERROR in serving: u'tensorflow/serving/regress' 2018-03-15 20:33:55,234 ERROR - model server - u'tensorflow/serving/regress'

Tensorflow used in SageMaker is not optimized for CPU

It makes sense to compile TF to let it use AVX2 instruction set - this should boost training on CPU c4 instances:

Here is log output taken from training job which I ran on SageMaker:

2018-01-27 19:18:03.954899: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

Tensorboard not displaying scalars

When the flag run_tensorboard_locally is set to True , for example
estimator.fit(inputs, run_tensorboard_locally=True), where estimator = TensorFlow(..) ,
Tensorboard only displays the graph and projector but not any scalars or images.

If one run is terminated and a new one is started by running again:
estimator.fit(inputs, run_tensorboard_locally=True)
then the scalars and images of the previous run are displayed on Tensorboard but they are not updated as training continues.
It seems like it, when training is restarted, Tensorboard loads the previously saved logs from the /tmp/<temp_folder>/ , which was created by tempfile.mkdtemp(), but the new logs are then saved to a newly created folder.

Any way to get Tensorboard working properly?
Would it make sense to add the ability to define logdir for Tensorboard when calling TensorFlow?

Load endpoint?

Hi,

How can you load an existing endpoint after you've deployed it for predictions?
The notebook example I've seen show end to end training to deployment but what if you want to reuse a previous model just to make predictions?

Where and how may I preprocess input data before making predictions?

I'd like to preprocess my data by sending in a string input through predictor.predict(data) and turning it into numerical embeddings just as my train_input_fn is doing with vocab_processor.fit_transform before going though my model_fn :

def train_input_fn(training_dir, hyperparmeters):
    return _input_fn(training_dir, 'meta_data_train.csv')

def _input_fn(training_dir, training_filename):
    training_set = pd.read_csv(os.path.join(training_dir, training_filename), dtype={'Classification class name': object}, encoding='cp1252')
    global n_words
  # Prepare training and testing data
    data = training_set['Features']
    target = pd.Series(training_set['Labels'])

    if training_filename == 'meta_data_test.csv':
        vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor.restore("s3://sagemaker-blah/vocab")
        data = np.array(list(vocab_processor.transform(data)))
        return tf.estimator.inputs.numpy_input_fn(
            x={INPUT_TENSOR_NAME: data},
            y=target,
            num_epochs=100,
            shuffle=False)()
    elif training_filename == 'meta_data_train.csv':
        vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
        data = np.array(list(vocab_processor.fit_transform(data)))
        vocab_processor.save("s3://sagemaker-blah/vocab")
        n_words = len(vocab_processor.vocabulary_)
        return tf.estimator.inputs.numpy_input_fn(
            x={INPUT_TENSOR_NAME: data},
            y=target,
            batch_size=len(data),
            num_epochs=None,
            shuffle=True)()

The documentation says to do it through serving_input_fn but I'm not sure how I can access and manipulate the data from my tensor using vocab_processor.transform. Here's my serving_input_fn for context:

def serving_input_fn(hyperparmeters):
    tensor = tf.placeholder(tf.int64, shape=[None, MAX_DOCUMENT_LENGTH])
    return build_raw_serving_input_receiver_fn({INPUT_TENSOR_NAME: tensor})()

I tried doing so through a input_fn instead:

def input_fn(serialized_input, content_type):
    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
    deserialized_input = pickle.loads(serialized_input)
    deserialized_input = np.array(list(vocab_processor.fit_transform(deserialized_input)))
    return deserialized_input

Here, I had an error deserializing:

KeyError: '['

What would be the best method to preprocess the data?

Low Level Python Boto End to End

This is definitely an "enhancement" note but why don't your docs have an end to end Python Low Level example? I have to jump between boto and the notebook code when I really just want the low level code. I think people would definitely appreciate an example script with TensorFlow or something with maybe the best way to get the docker image working with it. Just a thought - thanks!

Sparse matrix to recordIO?

How can I write sparse matrix to binary format (for use in the factorization machines algorithm)? I found only the write_numpy_to_dense_tensor function.

Support processing datasets from S3 directly which have already been processed in the desired format

Hi,

We are trying to support SM algorithms, where the user has multiple channels of processed data in S3 that is ready. The plan is to create a AmazonS3BaseEstimator class that will inherit from AmazonBaseEstimator class and will overwrite the record_set method. In this method, we will return a list of RecordSet objects. The _TrainingJob class already works with a dictionary of multiple channels of data. Therefore, we will overwrite the fit method in our class with data element being a dictionary of multiple channels of record objects.

User will have to supply in their fit call, a list of S3 uri's.

Thanks,
R

Pushing Hosted SageMaker Jupter Notebook Directory to GitHub

Hello everyone,

I'm not sure if this is the right place or not but here goes. I am using the hosted SageMaker notebook on AWS. How would I go about pushing the directory I made to GitHub? Hopefully, what I would like to be able to do is, whenever I make changes to any code in the directory, I would like to commit and push my changes to GitHub.

Right now, I don't see a way to download the whole hosted Jupyter directory (not the individual notebook). If there is a way to do that, I can simply download the whole directory and push it to GitHub.

I would highly appreciate it if someone could help me with this.

Thanks.

training a sequential keras model fails

I'm trying to run the model training for a keras sequential model on sagemaker and get this error message. Am I doing something wrong?

Traceback (most recent call last):
  File "./train_and_deploy.py", line 21, in <module>
    if __name__ == "__main__": main()
  File "./train_and_deploy.py", line 18, in main
    estimator.fit(TRAING_DATA_BUCKET)
  File "/ANONYMIZED/venv/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py", line 166, in fit
    fit_super()
  File "/ANONYMIZED/venv/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py", line 154, in fit_super
    super(TensorFlow, self).fit(inputs, wait, logs, job_name)
  File "/ANONYMIZED/venv/lib/python2.7/site-packages/sagemaker/estimator.py", line 517, in fit
    super(Framework, self).fit(inputs, wait, logs, self._current_job_name)
  File "/ANONYMIZED/venv/lib/python2.7/site-packages/sagemaker/estimator.py", line 154, in fit
    self.latest_training_job.wait(logs=logs)
  File "/ANONYMIZED/venv/lib/python2.7/site-packages/sagemaker/estimator.py", line 323, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True)
  File "/ANONYMIZED/venv/lib/python2.7/site-packages/sagemaker/session.py", line 647, in logs_for_job
    self._check_job_status(job_name, description)
  File "/ANONYMIZED/venv/lib/python2.7/site-packages/sagemaker/session.py", line 390, in _check_job_status
    raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
ValueError: Error training binary-classification-sample-2018-01-10-10-24-07-871: Failed Reason: AlgorithmError: 
Exception during training:
Unsuccessful TensorSliceReader constructor: Failed to find any matching files for s3://sagemaker-eu-west-1-ANONYMIZED/binary-classification-sample-2018-01-10-10-24-07-871/checkpoints/.
	 [[Node: save/RestoreV2_2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_2/tensor_names, save/RestoreV2_2/shape_and_slices)]]

Caused by op u'save/RestoreV2_2', defined at:
  File "/opt/amazon/bin/entry.py", line 32, in <module>
    modes[mode]()
  File "/opt/amazon/lib/python2.7/site-packages/container_support/training.py", line 15, in start
    fw.train()
  File "/opt/amazon/lib/python2.7/site-packages/tf_container/train.py", line 104, in train
    run.train_and_log_exceptions(train_wrapper, env.output_dir)
  File "/opt/amazon/lib/python2.7/site-packages/tf_container/run.py", line 20, in train_and_log_exceptions
    test_wrapper.train()
  File "/opt/amazon/lib/python2.7/site-packages/tf_container/trainer.

This is the model code, using the binary classification keras sample from here: https://keras.io/getting-started/sequential-model-guide/

import numpy as np
import tensorflow as tf

INPUT_TENSOR_NAME = 'inputs_input' # needs to match the name of the first layer + "_input"

def keras_model_fn(hyperparameters):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(64, input_dim=20, activation='relu', name='inputs'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(64, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(1, activation='sigmoid', name='output'))

    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model

def train_input_fn(training_dir = None, hyperparameters = None):
    return _input_fn()

def eval_input_fn(training_dir = None, hyperparameters = None):
    return _input_fn()


def _load_data():
    # Generate dummy data
    X = np.random.random((1000, 20))
    y = np.random.randint(2, size=(1000, 1))

    return X.astype(np.float32), y.astype(np.float32)

def _input_fn():
    X, y = _load_data()

    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: X},
        y=y,
        num_epochs=None,
        shuffle=True)()

I start the training with this script:

#!/usr/bin/env python
import sagemaker
from sagemaker.tensorflow import TensorFlow

TRAING_DATA_BUCKET = 's3://ANONYMIZED'

def main():
    estimator = TensorFlow(
        entry_point='binary_classification_sample.py',
        role='SageMakeFullAccess',
        training_steps=1000,
        evaluation_steps=1000,
        hyperparameters={'learning_rate': 1e-04},
        train_instance_count=1,
        train_instance_type='ml.m4.xlarge',
        base_job_name='binary-classification-sample')

    estimator.fit(TRAING_DATA_BUCKET)


if __name__ == "__main__": main()

Local training works:

#!/usr/bin/env python
import tensorflow as tf
from binary_classification_sample import keras_model_fn, _load_data

EPOCHS = 10000

def main():
    estimator = keras_model_fn({})
    X, y = _load_data()
    tensor_board_callback = tf.keras.callbacks.TensorBoard(log_dir='./graph_logs', histogram_freq=0,
          write_graph=True, write_images=True)

    estimator.fit(X,y, epochs=EPOCHS, callbacks=[tensor_board_callback])


if __name__ == "__main__": main()

How to save model?

How can I save model after the model.fit()so next time I could load and deploy it using model.deploy()?

api call error : input error for model using tensorflow custom estimators

Hi. I'm new to AWS sagemaker and build my custom tensorflow estimator with your tensorflow iris sample code.

I created own estimator, like this.

  if mode == tf.estimator.ModeKeys.PREDICT:
        export_outputs = {
            "recommend": tf.estimator.export.PredictOutput(predictions),
            tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY :
            tf.estimator.export.PredictOutput(predictions),
        }
        return tf.estimator.EstimatorSpec(mode,predictions=predictions,
                                         export_outputs = export_outputs)

(without export_outputs, classifier.export_savedmodel cannot export saved model)

I exported trained model using this

INPUT_TENSOR_NAME = 'items'
def serving_input_fn():
    feature_spec = {INPUT_TENSOR_NAME : tf.FixedLenFeature(dtype=tf.int64, shape=[100])}
    return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()
exported_model = classifier.export_savedmodel(export_dir_base = 'export/Servo/', 
                               serving_input_receiver_fn = serving_input_fn)

Then I saved my model, created checkpoint and send query to it.

sample = np.arange(100).astype(np.int64).tolist() predictor.predict(sample)

I got error follows.

Error on Jupyter Notebook Console:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-py2-cpu-2018-02-01-17-06-45-306 in account 561830960602 for more information.

Error found on CloudWatch Management Console

[2018-02-01 17:21:08,384] ERROR in serving: Unsupported request data format: [1].
Valid formats: tensor_pb2.TensorProto, dict<string, tensor_pb2.TensorProto> and predict_pb2.PredictRequest
Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/container_support/serving.py", line 161, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 255, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 180, in f
prediction = self.predict_fn(input)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 195, in predict_fn
return self.proxy_client.request(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 51, in request
return request_fn(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 77, in predict
request = self._create_predict_request(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 94, in _create_predict_request
input_map = self._create_input_map(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 199, in _create_input_map
raise ValueError(msg.format(data))
ValueError: Unsupported request data format: [1].
Valid formats: tensor_pb2.TensorProto, dict<string, tensor_pb2.TensorProto> and predict_pb2.PredictRequest
2018-02-01 17:21:08,384 ERROR - model server - Unsupported request data format: [1].
Valid formats: tensor_pb2.TensorProto, dict<string, tensor_pb2.TensorProto> and predict_pb2.PredictRequest
Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/container_support/serving.py", line 161, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 255, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 180, in f
prediction = self.predict_fn(input)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/serve.py", line 195, in predict_fn
return self.proxy_client.request(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 51, in request
return request_fn(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 77, in predict
request = self._create_predict_request(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 94, in _create_predict_request
input_map = self._create_input_map(data)
File "/opt/amazon/lib/python2.7/site-packages/tf_container/proxy_client.py", line 199, in _create_input_map
raise ValueError(msg.format(data))
ValueError: Unsupported request data format: [1].
Valid formats: tensor_pb2.TensorProto, dict<string, tensor_pb2.TensorProto> and predict_pb2.PredictRequest
[2018-02-01 17:21:08,384] ERROR in serving: Unsupported request data format: [1].
Valid formats: tensor_pb2.TensorProto, dict<string, tensor_pb2.TensorProto> and predict_pb2.PredictRequest
2018-02-01 17:21:08,384 ERROR - model server - Unsupported request data format: [1].
Valid formats: tensor_pb2.TensorProto, dict<string, tensor_pb2.TensorProto> and predict_pb2.PredictRequest
10.32.0.2 - - [01/Feb/2018:17:21:08 +0000] "POST /invocations HTTP/1.1" 500 0 "-" "AHC/2.0"

I tried to send predict_pb2 object to model, but it failed.

Nothing in Tensorboard after Eval steps

I recently upgraded to the most recent release of the python SDK through a pip upgrade. I followed a couple other Github threads and it said the with the most recent version of sagemaker-python-sdk this was solved (temporarily). But I am not seeing anything updating in my Local Tensorboard Instance. Am I missing something?
This is the model and helper functions I am deploying with SageMaker:

import pandas as pd
import numpy as np
import os
import json
import pickle
import sys
import traceback
import tensorflow as tf
from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn

from tensorflow.python.keras._impl.keras.layers import Dense
from tensorflow.python.keras._impl.keras.layers import Dropout
from tensorflow.python.keras._impl.keras.layers import LSTM
from tensorflow.python.keras._impl.keras.layers.embeddings import Embedding
from tensorflow.python.keras._impl.keras.optimizers import Adam
from tensorflow.python.keras._impl.keras.callbacks import ModelCheckpoint
from tensorflow.python.keras._impl.keras.callbacks import CSVLogger
from tensorflow.python.keras._impl.keras.callbacks import EarlyStopping
from tensorflow.python.keras._impl.keras.callbacks import LambdaCallback
from tensorflow.python.keras._impl.keras import metrics
from tensorflow.python.keras._impl.keras.models import Model
from tensorflow.python.keras._impl.keras import layers
from tensorflow.python.keras._impl.keras import Input

NUM_CLASSES = 2
NUM_DATA_BATCHES = 5
NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
BATCH_SIZE = 256
INPUT_TENSOR_NAME_1 = 'text1' # needs to match the name of the first layer + "_input"
INPUT_TENSOR_NAME_2 = 'text2' # needs to match the name of the first layer + "_input"
INPUT_TENSOR_NAME_3 = 'title1' # needs to match the name of the first layer + "_input"
INPUT_TENSOR_NAME_4 = 'title2' # needs to match the name of the first layer + "_input"



def keras_model_fn(training_dir):
    """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
    The model will transformed in a TensorFlow Estimator before training and it will saved in a TensorFlow Serving
    SavedModel in the end of training.

    Args:
        hyperparameters: The hyperparameters passed to SageMaker TrainingJob that runs your TensorFlow training
                         script.
    Returns: A compiled Keras model
    """

    text_input_1 = Input(shape=(None,), dtype='int32', name='text1')
    embedded_text_1 = layers.Embedding(50000,300)(text_input_1)
    embed_drop_1=Dropout(.5)(embedded_text_1)

    text_input_2 = Input(shape=(None,), dtype='int32', name='text2')
    embedded_text_2 = layers.Embedding(50000,300,)(text_input_2)
    embed_drop_2=Dropout(.5)(embedded_text_2)


    shared_lstm_text = LSTM(256)
    left_output_text = shared_lstm_text(embed_drop_1)
    right_output_text = shared_lstm_text(embed_drop_2)

    title_input_1 = Input(shape=(None,), dtype='int32', name='title1')
    embedded_title_1 = layers.Embedding(50000,300)(title_input_1)
    embed_drop_3=Dropout(.5)(embedded_title_1)

    title_input_2 = Input(shape=(None,), dtype='int32', name='title2')
    embedded_title_2 = layers.Embedding(50000,300)(title_input_2)
    embed_drop_4=Dropout(.5)(embedded_title_2)

    shared_lstm_title = LSTM(128)
    left_output_title = shared_lstm_title(embed_drop_3)
    right_output_title = shared_lstm_title(embed_drop_4)
    # Calculates the distance as defined by the MaLSTM model
    # malstm_distance = Merge(mode=lambda x: exponent_neg_manhattan_distance(x[0], x[1]), output_shape=lambda x: (x[0][0], 1))([left_output, right_output])
    merged = layers.concatenate([left_output_text, right_output_text,left_output_title, right_output_title], axis=-1)
    drop_1 = Dropout(.3)(merged)
    dense_1 = layers.Dense(256, activation='sigmoid')(drop_1)
    drop_2 = Dropout(.3)(dense_1)

    dense_2 = layers.Dense(128, activation='sigmoid')(drop_2)


    predictions = layers.Dense(1, activation='sigmoid')(dense_2)

    # Pack it all up into a model
    shared_layer_model = Model([text_input_1, text_input_2,title_input_1,title_input_2], [predictions])
    shared_layer_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return shared_layer_model


def train_input_fn(training_dir , hyperparameters = None):

    return _input_fn(training_dir,"train")

def eval_input_fn(training_dir , hyperparameters = None):

    return _input_fn(training_dir,"dev")

def serving_input_fn(hyperparameters = None):

    text_ph_1 = tf.placeholder(tf.int32, shape=[None,501])
    text_ph_2 = tf.placeholder(tf.int32, shape=[None,501])
    title_ph_1 = tf.placeholder(tf.int32, shape=[None,51])
    title_ph_2 = tf.placeholder(tf.int32, shape=[None,51])

    #label is not required since serving is only used for inference
    feature_placeholders = {"text1":text_ph_1,"text2":text_ph_2,"title1":title_ph_1,"title2":title_ph_2}
    return build_raw_serving_input_receiver_fn(feature_placeholders)()

def _input_fn(training_dir,mode):
    train_text_1=np.load(training_dir+"/"+mode+"_text_1.npy")
    train_text_2=np.load(training_dir+"/"+mode+"_text_2.npy")
    train_title_1=np.load(training_dir+"/"+mode+"_title_1.npy")
    train_title_2=np.load(training_dir+"/"+mode+"_title_2.npy")

    y=np.load(training_dir+"/targets_"+mode+".npy")
    y=y.reshape((y.shape[0],1)).astype(np.float32)
    # y=tf.cast(y, tf.float32)



    x={INPUT_TENSOR_NAME_1: train_text_1, 
       INPUT_TENSOR_NAME_2: train_text_2,
       INPUT_TENSOR_NAME_3: train_title_1, 
       INPUT_TENSOR_NAME_4: train_title_2}
    dataset=tf.estimator.inputs.numpy_input_fn(x=x,y=y,batch_size=BATCH_SIZE,num_epochs=10,shuffle=False)()


    return dataset

here is my train script to actually deploy this model:

import sagemaker
from sagemaker.tensorflow import TensorFlow

TRAING_DATA_BUCKET = "s3:/some_bucket"

def main():
    estimator = TensorFlow(
        entry_point='model.py',
        role="some_role",
        training_steps=100000,
        evaluation_steps=100,
        train_instance_count=1,
        train_instance_type='ml.p2.xlarge',
        base_job_name='model')

    estimator.fit(TRAING_DATA_BUCKET,run_tensorboard_locally=True)


if __name__ == "__main__": main()

and here is a sample of the logs:

018-04-02 22:48:34,865 INFO - root - running container entrypoint
2018-04-02 22:48:34,866 INFO - root - starting train task
2018-04-02 22:48:34,884 INFO - container_support.training - Training starting
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-04-02 22:48:37,298 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2
2018-04-02 22:48:37,572 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-304913402249.s3.amazonaws.com
2018-04-02 22:48:37,623 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-304913402249.s3.amazonaws.com
2018-04-02 22:48:37,640 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-304913402249.s3.us-west-2.amazonaws.com
2018-04-02 22:48:37,694 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-304913402249.s3.us-west-2.amazonaws.com
2018-04-02 22:48:37,800 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
2018-04-02 22:48:38,133 INFO - tf_container - ----------------------TF_CONFIG--------------------------
2018-04-02 22:48:38,134 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}
2018-04-02 22:48:38,134 INFO - tf_container - ---------------------------------------------------------
2018-04-02 22:48:38,134 INFO - tf_container - creating RunConfig:
2018-04-02 22:48:38,134 INFO - tf_container - {'save_checkpoints_secs': 300}
2018-04-02 22:48:38,134 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1:2222']}, u'task': {u'index': 0, u'type': u'master'}}
2018-04-02 22:48:38,134 INFO - tf_container - invoking keras_model_fn
2018-04-02 22:48:39,319 INFO - tensorflow - Using the Keras model from memory.
2018-04-02 22:48:39.476033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-02 22:48:39.476412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-04-02 22:48:39.476442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:48:40.256068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:48:40.671208: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x5ab2560
2018-04-02 22:48:41,776 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fca8794a650>, '_model_dir': u's3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
2018-04-02 22:48:41.783420: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
2018-04-02 22:48:41.783448: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
2018-04-02 22:48:41.783460: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
2018-04-02 22:48:41.783476: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
2018-04-02 22:48:41.783501: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
2018-04-02 22:48:41.783525: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating TaskRole with default ECSCredentialsClient and refresh rate 900000
2018-04-02 22:48:41.783572: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/credentials for reading.
2018-04-02 22:48:41.783586: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-02 22:48:41.783601: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/config for reading.
2018-04-02 22:48:41.783617: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-02 22:48:41.783630: I tensorflow/core/platform/s3/aws_logging.cc:54] Credentials have expired or will expire, attempting to repull from ECS IAM Service.
2018-04-02 22:48:41.783741: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-02 22:48:41.783761: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:41.787251: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
2018-04-02 22:48:41.790005: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-02 22:48:41.790033: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:41.962637: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:48:41.962686: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:48:41.963541: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:43.164915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:48:43.165103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 298 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:48:48.326432: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:48.387841: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2248481522709328325
2018-04-02 22:48:49.098319: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:08.979787: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2248481522709328387
2018-04-02 22:49:09.067202: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:09.081492: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.687063: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.696369: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.718624: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2249311522709371695
2018-04-02 22:49:31.718807: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.731217: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.758332: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.657755: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.674760: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2249321522709372657
2018-04-02 22:49:32.675032: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.691849: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.738527: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.297412: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.339622: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:49:33.339665: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:49:33.339807: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.428415: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.593379: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2249331522709373426
2018-04-02 22:49:33.593994: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.604710: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.688775: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33,848 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
2018-04-02 22:49:33.849136: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.855879: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.867678: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.877568: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.887121: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.898457: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.917874: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.925198: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.936749: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.943854: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.954551: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.971271: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.980231: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.988384: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.994245: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.005499: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.016162: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.042879: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.057164: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.068841: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.082194: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.090577: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.100416: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:50:56,086 INFO - tensorflow - Calling model_fn.
2018-04-02 22:51:00,260 INFO - tensorflow - Done calling model_fn.
2018-04-02 22:51:00,262 INFO - tensorflow - Create CheckpointSaverHook.
2018-04-02 22:51:00.262548: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.325489: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:51:00.325524: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:51:00.325674: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.415882: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.425139: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:51:00.425182: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:51:00.425382: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.881982: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.894606: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.904546: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.913076: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.927390: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.980548: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.190686: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.199297: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01,888 INFO - tensorflow - Graph was finalized.
2018-04-02 22:51:01.889441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:51:01.889626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 294 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:51:01.890096: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.897931: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.908246: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.916873: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.927502: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01,939 INFO - tensorflow - Restoring parameters from s3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints/keras_model.ckpt
2018-04-02 22:51:02.287560: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.305949: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.316070: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.326439: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.337997: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.348196: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.356024: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.365497: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.375296: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.387042: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.396985: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.553669: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:03.341350: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:04.435127: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.190987: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.887292: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.925875: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.953532: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.976745: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:06.089836: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:06.948680: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:07.925257: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:07.949810: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:08.862382: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:09.552191: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:10.286866: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:10.985758: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.751739: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.810978: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.832691: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.861106: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.882758: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.659197: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.704166: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.813763: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.835065: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.923866: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.943118: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.981208: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.000233: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.019809: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.041899: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.053608: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13,170 INFO - tensorflow - Running local_init_op.
2018-04-02 22:51:13,209 INFO - tensorflow - Done running local_init_op.
2018-04-02 22:51:13.882071: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.891074: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:51:13.891112: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:51:13.891286: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.615216: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.741352: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251151522709475613
2018-04-02 22:51:15.742244: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.757312: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.880544: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:20,436 INFO - tensorflow - Saving checkpoints for 1 into s3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints/model.ckpt.
2018-04-02 22:51:20.540136: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:20.550186: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251201522709480539
2018-04-02 22:51:21.201413: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:37.626173: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251201522709480550
2018-04-02 22:51:39.231071: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:39.248048: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.381842: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.394329: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.413002: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251521522709512394
2018-04-02 22:51:52.413186: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.426175: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.461108: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.471439: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.482154: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251521522709512471
2018-04-02 22:51:52.482316: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.490492: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.501403: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.510951: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.519910: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.529277: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.541365: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:09.972384: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:09.981957: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:09.998237: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2252091522709529981
2018-04-02 22:52:09.998410: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.007228: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.019049: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.030835: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.080188: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2252101522709530030
2018-04-02 22:52:10.080439: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.096768: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.139914: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.481185: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.512270: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:52:10.512315: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:52:10.512477: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.609451: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.707867: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2252101522709530607
2018-04-02 22:52:10.708356: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.722452: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.829838: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.919452: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.927894: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.939060: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.347466: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.370474: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.382340: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.401401: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.410047: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.420895: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.428769: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:23,944 INFO - tensorflow - Calling model_fn.
2018-04-02 22:52:28,194 INFO - tensorflow - Done calling model_fn.
2018-04-02 22:52:28,216 INFO - tensorflow - Starting evaluation at 2018-04-02-22:52:28
2018-04-02 22:52:28,412 INFO - tensorflow - Graph was finalized.
2018-04-02 22:52:28.413193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:52:28.413394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 294 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:52:28,413 INFO - tensorflow - Restoring parameters from s3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints/model.ckpt-1
2018-04-02 22:52:28.741275: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.819166: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.826148: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.835709: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.843095: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.853672: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.862041: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.874661: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.885648: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.895464: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.905889: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.960158: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:29.660736: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:30.502709: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:31.190924: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.085296: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.106829: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.128063: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.149246: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.169160: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.858521: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:33.546670: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:33.566797: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:34.256998: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:34.964872: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:35.654776: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:36.344339: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.043329: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.066176: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.089093: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.109659: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.133638: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.821845: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.839246: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.859660: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.880892: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:38,038 INFO - tensorflow - Running local_init_op.
2018-04-02 22:52:38,075 INFO - tensorflow - Done running local_init_op.
2018-04-02 22:52:45,575 INFO - tensorflow - Evaluation [10/100]
2018-04-02 22:52:52,843 INFO - tensorflow - Evaluation [20/100]
2018-04-02 22:53:00,118 INFO - tensorflow - Evaluation [30/100]
2018-04-02 22:53:07,396 INFO - tensorflow - Evaluation [40/100]
2018-04-02 22:53:14,678 INFO - tensorflow - Evaluation [50/100]
2018-04-02 22:53:21,963 INFO - tensorflow - Evaluation [60/100]
2018-04-02 22:53:29,247 INFO - tensorflow - Evaluation [70/100]
2018-04-02 22:53:36,525 INFO - tensorflow - Evaluation [80/100]
2018-04-02 22:53:43,806 INFO - tensorflow - Evaluation [90/100]
2018-04-02 22:53:51,092 INFO - tensorflow - Evaluation [100/100]
2018-04-02 22:53:51,298 INFO - tensorflow - Finished evaluation at 2018-04-02-22:53:51
2018-04-02 22:53:51,299 INFO - tensorflow - Saving dict for global step 1: accuracy = 0.49929687, global_step = 1, loss = 0.703844

Any tips or clues would be greatly appreciated.

Please erase

Never Mind!

Unable to install sagemaker using pip3 and python3 (3.5.2) on ubuntu 16.04

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.5/dist-packages/sagemaker/init.py", line 15, in
from sagemaker import estimator
File "/usr/local/lib/python3.5/dist-packages/sagemaker/estimator.py", line 24, in
from sagemaker.model import Model
File "/usr/local/lib/python3.5/dist-packages/sagemaker/model.py", line 18, in
from sagemaker.session import Session
File "/usr/local/lib/python3.5/dist-packages/sagemaker/session.py", line 32, in
SDK_VERSION = pkg_resources.require('sagemaker')[0].version
File "/usr/local/lib/python3.5/dist-packages/pkg_resources/init.py", line 984, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/local/lib/python3.5/dist-packages/pkg_resources/init.py", line 875, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (botocore 1.7.48 (/home/karan/.local/lib/python3.5/site-packages), Requirement.parse('botocore<1.9.0,>=1.8.0'), {'boto3'})

TensorFlow logs not showing up on CloudWatch or console

A recent changed on sagemaker caused tensorflow logs not showing up on the console and cloudwatch. This makes a lot of tasks very difficult including hyper parameter tunning. I've tested on my own code and the abalone example: no tf logs shown.

security feature for jupyter notebooks

how granular is jupyter notebooks security setting? for example, how can we have different user groups for different permissions (such as read/write/run and others)?

How to set the num_gpus?

I am not sure how to set the num_gpus? It's not mentioned anywhere in the docs.

Tensorflow datatype changing while training and writing

For context, here's the first few lines of my model_fn:

def model_fn(features, labels, mode, hyperparmeters):
  word_vectors = tf.contrib.layers.embed_sequence(features[WORDS_FEATURE], vocab_size=n_words, embed_dim=EMBEDDING_SIZE)
  word_vectors = tf.expand_dims(word_vectors, 3)

Here's my input_fn:

def _input_fn(training_dir, training_filename):
    training_set = pd.read_csv(os.path.join(training_dir, training_filename), dtype={'Classification class name': object}, encoding='cp1252')
    global n_words
  # Prepare training and testing data
    data = training_set['Features']
    target = pd.Series(training_set['Labels'])
    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)

    if training_filename == 'meta_data_test.csv':
        data = np.array(list(vocab_processor.transform(data)))
        return tf.estimator.inputs.numpy_input_fn(
            x={WORDS_FEATURE: data},
            y=target,
            num_epochs=100,
            shuffle=False)()
    else:
        data = np.array(list(vocab_processor.fit_transform(data)))
        n_words = len(vocab_processor.vocabulary_)
        return tf.estimator.inputs.numpy_input_fn(
            x={WORDS_FEATURE: data},
            y=target,
            batch_size=len(data),
            num_epochs=None,
            shuffle=True)()

My WORDS_FEATURE is defined as:
WORDS_FEATURE = 'words'

Now, I ran a modified version of this file locally which just trains and evaluates so I can see if the models are working and everything runs fine. When I try to use tf_estimator.fit using a Notebook session, I get this error:

executing startup script (first run)
2018-03-13 19:28:21,590 INFO - root - running container entrypoint
2018-03-13 19:28:21,590 INFO - root - starting train task
2018-03-13 19:28:24,270 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2
2018-03-13 19:28:25,536 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
2018-03-13 19:28:25,611 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.us-east-2.amazonaws.com
2018-03-13 19:28:25,709 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
INFO:tensorflow:----------------------TF_CONFIG--------------------------
INFO:tensorflow:{"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}
INFO:tensorflow:---------------------------------------------------------
INFO:tensorflow:going to training
2018-03-13 19:28:25,788 INFO - root - creating RunConfig:
2018-03-13 19:28:25,788 INFO - root - {'save_checkpoints_secs': 300}
2018-03-13 19:28:25,788 INFO - root - creating the estimator
INFO:tensorflow:Using config: {'_model_dir': u's3://sagemaker-tcclassification/sagemaker-tensorflow-py2-gpu-2018-03-13-19-20-41-949/checkpoints', '_save_checkpoints_secs': 300, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_session_config': None, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f525c99fd10>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': '', '_log_step_count_steps': 100}
2018-03-13 19:28:25,789 INFO - root - creating Experiment:
2018-03-13 19:28:25,789 INFO - root - {'min_eval_frequency': 1000}
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:267: __init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
INFO:tensorflow:Create CheckpointSaverHook.
2018-03-13 19:28:28.627772: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-13 19:28:28.760402: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-13 19:28:28.760766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-03-13 19:28:28.760798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Saving checkpoints for 1 into s3://sagemaker-tcclassification/sagemaker-tensorflow-py2-gpu-2018-03-13-19-20-41-949/checkpoints/model.ckpt.
INFO:tensorflow:loss = 4.8201203, step = 1
INFO:tensorflow:Saving checkpoints for 100 into s3://sagemaker-tcclassification/sagemaker-tensorflow-py2-gpu-2018-03-13-19-20-41-949/checkpoints/model.ckpt.
INFO:tensorflow:Loss for final step: 0.045926746.
INFO:tensorflow:Starting evaluation at 2018-03-13-19:28:37
2018-03-13 19:28:37.508276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from s3://sagemaker-tcclassification/sagemaker-tensorflow-py2-gpu-2018-03-13-19-20-41-949/checkpoints/model.ckpt-100
INFO:tensorflow:Evaluation [1/100]
INFO:tensorflow:Evaluation [2/100]
INFO:tensorflow:Evaluation [3/100]
INFO:tensorflow:Evaluation [4/100]
INFO:tensorflow:Evaluation [5/100]
INFO:tensorflow:Evaluation [6/100]
INFO:tensorflow:Evaluation [7/100]
INFO:tensorflow:Evaluation [8/100]
INFO:tensorflow:Evaluation [9/100]
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [11/100]
INFO:tensorflow:Evaluation [12/100]
INFO:tensorflow:Evaluation [13/100]
INFO:tensorflow:Evaluation [14/100]
INFO:tensorflow:Evaluation [15/100]
INFO:tensorflow:Evaluation [16/100]
INFO:tensorflow:Evaluation [17/100]
INFO:tensorflow:Evaluation [18/100]
INFO:tensorflow:Evaluation [19/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [21/100]
INFO:tensorflow:Evaluation [22/100]
INFO:tensorflow:Evaluation [23/100]
INFO:tensorflow:Evaluation [24/100]
INFO:tensorflow:Evaluation [25/100]
INFO:tensorflow:Evaluation [26/100]
INFO:tensorflow:Evaluation [27/100]
INFO:tensorflow:Evaluation [28/100]
INFO:tensorflow:Evaluation [29/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [31/100]
INFO:tensorflow:Evaluation [32/100]
INFO:tensorflow:Evaluation [33/100]
INFO:tensorflow:Evaluation [34/100]
INFO:tensorflow:Evaluation [35/100]
INFO:tensorflow:Evaluation [36/100]
INFO:tensorflow:Finished evaluation at 2018-03-13-19:28:38
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.02173913, global_step = 100, loss = 29.5561
ERROR:tensorflow:writing error
ERROR:tensorflow:error file is
ERROR:tensorflow:
Exception during training:
Value passed to parameter 'indices' has DataType string not in list of allowed values: int32, int64
Traceback (most recent call last):
  File "/opt/amazon/lib/python2.7/site-packages/tf_container/run.py", line 20, in train_and_log_exceptions
    test_wrapper.train()
  File "/opt/amazon/lib/python2.7/site-packages/tf_container/trainer.py", line 113, in train
    learn_runner.run(experiment_fn, self.training_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 218, in run
    return _execute_schedule(experiment, schedule)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 46, in _execute_schedule
    return task()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 641, in train_and_evaluate
    export_results = self._maybe_export(eval_result)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 744, in _maybe_export
    eval_result=eval_result))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/export_strategy.py", line 87, in export
    return self.export_fn(estimator, export_path, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py", line 442, in export_fn
    checkpoint_path=checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 511, in export_savedmodel
    config=self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 694, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/opt/amazon/lib/python2.7/site-packages/tf_container/trainer.py", line 204, in _model_fn
    return self.customer_script.model_fn(features, labels, mode, params)
  File "/opt/ml/code/text_classification_cnn.py", line 48, in model_fn
    word_vectors = tf.contrib.layers.embed_sequence(features[WORDS_FEATURE], vocab_size=n_words, embed_dim=EMBEDDING_SIZE)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/encoders.py", line 142, in embed_sequence
    return embedding_ops.embedding_lookup(embeddings, ids)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 328, in embedding_lookup
    transform_fn=None)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 150, in _embedding_lookup_and_transform
    result = _clip(_gather(params[0], ids, name=name), ids, max_norm)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 54, in _gather
    return array_ops.gather(params, ids, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 2486, in gather
    params, indices, validate_indices=validate_indices, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1834, in gather
    validate_indices=validate_indices, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 609, in _apply_op_helper
    param_name=input_name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
    ", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'indices' has DataType string not in list of allowed values: int32, int64

It mentions tf.contrib.layers.embed_sequence(features[WORDS_FEATURE], vocab_size=n_words, embed_dim=EMBEDDING_SIZE) is using a string as a parameter, though, that's not correct unless somehow features[WORDS_FEATURE] turns into a string when the checkpoint is restored? I've been trying to tackle this error for hours. And again, a version of this works fine locally (at least for training and evaluation) so I'm assuming this problem arises only when the checkpoint is restored.

I checked the datatype locally and I can confirm features[WORDS_FEATURE] is an int64:
Tensor("fifo_queue_DequeueUpTo:1", shape=(?, 100), dtype=int64, device=/device:CPU:0)

Does anyone have any idea why this is happening?

Will tensorflow return the best model as a result of training

I'm trying to understand the strategy of model evaluation implemented in TF within a container, my goal is to sage the most accurate model, not the most recent one. Is it possible via this API somehow?

Documentation of requirements_file not on readthedocs.io

Was looking for details on how to use the requirements_file parameter, but that does not appear to be listed on http://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html

The builds on RTD are failing, so nothing new has been published recently. https://readthedocs.org/projects/sagemaker/builds/

Tensorflow: fail to download saved model

I tried to save checkpoints and summaries every N steps by specify them in the run_config in tf.estimator.DNNClassifier
Everything went fine but an error occurred in the end: tf_container.serve - Failed to download saved model. File does not exist in s3://xxxxx
Output folder and checkpoint folder are specified but the checkpoints still write to the tmp folder.
Also, there are no output folder and checkpoints folder generated in the path I specified.

Support Python 3 for TensorFlow scripts

Is there any reason not to do this? Python 2 is an officially outdated language and will retire in 2 years.

Supporting custom frameworks in SageMaker??

Thank you for shipping this great product!!!

After reading the documents, I understood SageMaker is fully extensible by docker container. And scikit_bring_your_own example clearly shows users how custom python codes can be applied to SageMaker.

However, when I wanted to support custom framework (like torch, chainer, etc.), how to do that effectively?? To achieve this, they possibly need to write notebook side and container side implementations.

On notebook side, I think mxnet support in sagemaker-python-sdk can be really helpful reference.

However, on container side, I couldn't find such reference. But, there would need a lot of common codes for example:

download user assets(training data, user codes) from s3 to local filesystem
loading custom user codes to train and save models
spawning server via wsgi or similar libraries
detecting cpu/gpu feature
etc.

I imagined these codes might exist in MXNet, tensorflow support, too. Then, I found SageMakerContainerSupport package in sagemaker-mxnet-py2-cpu:1.0 container. This package defines very useful abstract classes to support custom frameworks for everybody who wants to do it.

Of course, I can write similar codes by myself, however, in OSS nature, I'm really happy if you could open this package to the public. Did you have such plan??

tests/integ failed

I am trying to follow README. While unit tests work fine, I got the following errors at integration tests. Any suggestion?
`$ tox tests/integ
GLOB sdist-make: /Users/andyfeng/dev/sagemaker-python-sdk/setup.py
py27 inst-nodeps: /Users/andyfeng/dev/sagemaker-python-sdk/.tox/dist/sagemaker-1.0.1.zip
py27 installed: apipkg==1.4,attrs==17.4.0,backports.weakref==1.0.post1,bleach==1.5.0,boto3==1.5.7,botocore==1.8.21,contextlib2==0.5.5,coverage==4.4.2,docutils==0.14,enum34==1.1.6,execnet==1.5.0,funcsigs==1.0.2,futures==3.2.0,html5lib==0.9999999,jmespath==0.9.3,Markdown==2.6.10,mock==2.0.0,numpy==1.13.3,pbr==3.1.1,pluggy==0.6.0,protobuf==3.5.1,py==1.5.2,pytest==3.3.1,pytest-cov==2.5.1,pytest-forked==0.2,pytest-xdist==1.21.0,python-dateutil==2.6.1,s3transfer==0.1.12,sagemaker==1.0.1,scipy==1.0.0,six==1.11.0,teamcity-messages==1.21,tensorflow==1.4.1,tensorflow-tensorboard==0.4.0rc3,Werkzeug==0.14
py27 runtests: PYTHONHASHSEED='3746448766'
py27 runtests: commands[0] | pytest tests/integ
================================================================ test session starts =================================================================
platform darwin -- Python 2.7.14, pytest-3.3.1, py-1.5.2, pluggy-0.6.0 -- /Users/andyfeng/dev/sagemaker-python-sdk/.tox/py27/bin/python2.7
cachedir: .cache
rootdir: /Users/andyfeng/dev/sagemaker-python-sdk, inifile: setup.cfg
plugins: teamcity-messages-1.21, xdist-1.21.0, forked-0.2, cov-2.5.1
collected 7 items

tests/integ/test_kmeans.py::test_kmeans FAILED [ 14%]
tests/integ/test_linear_learner.py::test_linear_learner FAILED [ 28%]
tests/integ/test_mxnet_train.py::test_attach_deploy ERROR [ 42%]
tests/integ/test_mxnet_train.py::test_deploy_model ERROR [ 57%]
tests/integ/test_pca.py::test_pca FAILED [ 71%]
tests/integ/test_tf.py::test_tf FAILED [ 85%]
tests/integ/test_tf_cifar.py::test_cifar FAILED [100%]

=================================================================================== ERRORS ===================================================================================
____________________________________________________________________ ERROR at setup of test_attach_deploy ____________________________________________________________________

sagemaker_session = <sagemaker.session.Session object at 0x10f20b890>

@pytest.fixture(scope='module')
def mxnet_training_job(sagemaker_session):
    with timeout(minutes=15):
        script_path = os.path.join(DATA_DIR, 'mxnet_mnist', 'mnist.py')
        data_path = os.path.join(DATA_DIR, 'mxnet_mnist')

        mx = MXNet(entry_point=script_path, role='SageMakerRole',
                   train_instance_count=1, train_instance_type='ml.c4.xlarge',
                   sagemaker_session=sagemaker_session)

        train_input = mx.sagemaker_session.upload_data(path=os.path.join(data_path, 'train'),
                                                       key_prefix='integ-test-data/mxnet_mnist/train')
        test_input = mx.sagemaker_session.upload_data(path=os.path.join(data_path, 'test'),
                                                      key_prefix='integ-test-data/mxnet_mnist/test')

      mx.fit({'train': train_input, 'test': test_input})

tests/integ/test_mxnet_train.py:47:

.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:517: in fit
super(Framework, self).fit(inputs, wait, logs, self._current_job_name)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:154: in fit
self.latest_training_job.wait(logs=logs)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:323: in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True)
.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:647: in logs_for_job
self._check_job_status(job_name, description)

self = <sagemaker.session.Session object at 0x10f20b890>, job = 'sagemaker-mxnet-py2-cpu-2018-01-01-03-36-55-859'
desc = {'AlgorithmSpecification': {'TrainingImage': '520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-mxnet-py2-cpu:1.0...sagemaker_job_name': '"sagemaker-mxnet-py2-cpu-2018-01-01-03-36-55-859"', 'sagemaker_program': '"mnist.py"', ...}, ...}

def _check_job_status(self, job, desc):
    """Check to see if the job completed successfully and, if not, construct and
        raise a ValueError.

        Args:
            job (str): The name of the job to check.
            desc (dict[str, str]): The result of ``describe_training_job()``.

        Raises:
            ValueError: If the training job fails.
        """
    status = desc['TrainingJobStatus']

    if status != 'Completed':
        reason = desc.get('FailureReason', '(No reason provided)')

      raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))

E ValueError: Error training sagemaker-mxnet-py2-cpu-2018-01-01-03-36-55-859: Failed Reason: ClientError: SageMaker was unable to assume the role 'arn:aws:iam::379899735384:role/SageMakerRole'

.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:390: ValueError
--------------------------------------------------------------------------- Captured stdout setup ----------------------------------------------------------------------------
..........................
--------------------------------------------------------------------------- Captured stderr setup ----------------------------------------------------------------------------
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-py2-cpu-2018-01-01-03-36-55-859
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
----------------------------------------------------------------------------- Captured log setup -----------------------------------------------------------------------------
credentials.py 1031 INFO Found credentials in shared credentials file: ~/.aws/credentials
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
session.py 237 INFO Creating training-job with name: sagemaker-mxnet-py2-cpu-2018-01-01-03-36-55-859
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
____________________________________________________________________ ERROR at setup of test_deploy_model _____________________________________________________________________

sagemaker_session = <sagemaker.session.Session object at 0x10f20b890>

@pytest.fixture(scope='module')
def mxnet_training_job(sagemaker_session):
    with timeout(minutes=15):
        script_path = os.path.join(DATA_DIR, 'mxnet_mnist', 'mnist.py')
        data_path = os.path.join(DATA_DIR, 'mxnet_mnist')

        mx = MXNet(entry_point=script_path, role='SageMakerRole',
                   train_instance_count=1, train_instance_type='ml.c4.xlarge',
                   sagemaker_session=sagemaker_session)

        train_input = mx.sagemaker_session.upload_data(path=os.path.join(data_path, 'train'),
                                                       key_prefix='integ-test-data/mxnet_mnist/train')
        test_input = mx.sagemaker_session.upload_data(path=os.path.join(data_path, 'test'),
                                                      key_prefix='integ-test-data/mxnet_mnist/test')

      mx.fit({'train': train_input, 'test': test_input})

tests/integ/test_mxnet_train.py:47:

def _check_job_status(self, job, desc):
    """Check to see if the job completed successfully and, if not, construct and
        raise a ValueError.

        Args:
            job (str): The name of the job to check.
            desc (dict[str, str]): The result of ``describe_training_job()``.

        Raises:
            ValueError: If the training job fails.
        """
    status = desc['TrainingJobStatus']

    if status != 'Completed':
        reason = desc.get('FailureReason', '(No reason provided)')

      raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))

E ValueError: Error training sagemaker-mxnet-py2-cpu-2018-01-01-03-36-55-859: Failed Reason: ClientError: SageMaker was unable to assume the role 'arn:aws:iam::379899735384:role/SageMakerRole'

.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:390: ValueError
================================================================================== FAILURES ==================================================================================
________________________________________________________________________________ test_kmeans _________________________________________________________________________________

def test_kmeans():

    with timeout(minutes=15):
        sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=REGION))
        data_path = os.path.join(DATA_DIR, 'one_p_mnist', 'mnist.pkl.gz')
        pickle_args = {} if sys.version_info.major == 2 else {'encoding': 'latin1'}

        # Load the data into memory as numpy arrays
        with gzip.open(data_path, 'rb') as f:
            train_set, _, _ = pickle.load(f, **pickle_args)

        kmeans = KMeans(role='SageMakerRole', train_instance_count=1,
                        train_instance_type='ml.c4.xlarge',
                        k=10, sagemaker_session=sagemaker_session, base_job_name='test-kmeans')

        kmeans.init_method = 'random'
        kmeans.max_iterators = 1
        kmeans.tol = 1
        kmeans.num_trials = 1
        kmeans.local_init_method = 'kmeans++'
        kmeans.half_life_time_size = 1
        kmeans.epochs = 1
        kmeans.center_factor = 1

      kmeans.fit(kmeans.record_set(train_set[0][:100]))

tests/integ/test_kmeans.py:51:

.tox/py27/lib/python2.7/site-packages/sagemaker/amazon/amazon_estimator.py:96: in fit
super(AmazonAlgorithmEstimatorBase, self).fit(data, **kwargs)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:154: in fit
self.latest_training_job.wait(logs=logs)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:323: in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True)
.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:647: in logs_for_job
self._check_job_status(job_name, description)

self = <sagemaker.session.Session object at 0x10f2c4e50>, job = 'test-kmeans-2018-01-01-03-32-56-860'
desc = {'AlgorithmSpecification': {'TrainingImage': '174872318107.dkr.ecr.us-west-2.amazonaws.com/kmeans:1', 'TrainingInputMo... 'HyperParameters': {'epochs': '1', 'extra_center_factor': '1', 'feature_dim': '784', 'force_dense': 'True', ...}, ...}

def _check_job_status(self, job, desc):
    """Check to see if the job completed successfully and, if not, construct and
        raise a ValueError.

        Args:
            job (str): The name of the job to check.
            desc (dict[str, str]): The result of ``describe_training_job()``.

        Raises:
            ValueError: If the training job fails.
        """
    status = desc['TrainingJobStatus']

    if status != 'Completed':
        reason = desc.get('FailureReason', '(No reason provided)')

      raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))

E ValueError: Error training test-kmeans-2018-01-01-03-32-56-860: Failed Reason: ClientError: SageMaker was unable to assume the role 'arn:aws:iam::379899735384:role/SageMakerRole'

.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:390: ValueError
---------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------
....................
---------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:sagemaker:Created S3 bucket: sagemaker-us-west-2-379899735384
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:sagemaker:Creating training-job with name: test-kmeans-2018-01-01-03-32-56-860
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
----------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------------
credentials.py 1031 INFO Found credentials in shared credentials file: ~/.aws/credentials
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
session.py 163 INFO Created S3 bucket: sagemaker-us-west-2-379899735384
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
session.py 237 INFO Creating training-job with name: test-kmeans-2018-01-01-03-32-56-860
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
____________________________________________________________________________ test_linear_learner _____________________________________________________________________________

def test_linear_learner():
    with timeout(minutes=15):
        sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=REGION))
        data_path = os.path.join(DATA_DIR, 'one_p_mnist', 'mnist.pkl.gz')
        pickle_args = {} if sys.version_info.major == 2 else {'encoding': 'latin1'}

        # Load the data into memory as numpy arrays
        with gzip.open(data_path, 'rb') as f:
            train_set, _, _ = pickle.load(f, **pickle_args)

        train_set[1][:100] = 1
        train_set[1][100:200] = 0
        train_set = train_set[0], train_set[1].astype(np.dtype('float32'))

        ll = LinearLearner('SageMakerRole', 1, 'ml.c4.2xlarge', base_job_name='test-linear-learner',
                           sagemaker_session=sagemaker_session)
        ll.binary_classifier_model_selection_criteria = 'accuracy'
        ll.target_reacall = 0.5
        ll.target_precision = 0.5
        ll.positive_example_weight_mult = 0.1
        ll.epochs = 1
        ll.predictor_type = 'binary_classifier'
        ll.use_bias = True
        ll.num_models = 1
        ll.num_calibration_samples = 1
        ll.init_method = 'uniform'
        ll.init_scale = 0.5
        ll.init_sigma = 0.2
        ll.init_bias = 5
        ll.optimizer = 'adam'
        ll.loss = 'logistic'
        ll.wd = 0.5
        ll.l1 = 0.5
        ll.momentum = 0.5
        ll.learning_rate = 0.1
        ll.beta_1 = 0.1
        ll.beta_2 = 0.1
        ll.use_lr_scheduler = True
        ll.lr_scheduler_step = 2
        ll.lr_scheduler_factor = 0.5
        ll.lr_scheduler_minimum_lr = 0.1
        ll.normalize_data = False
        ll.normalize_label = False
        ll.unbias_data = True
        ll.unbias_label = False
        ll.num_point_for_scala = 10000

      ll.fit(ll.record_set(train_set[0][:200], train_set[1][:200]))

tests/integ/test_linear_learner.py:74:

self = <sagemaker.session.Session object at 0x113a8c450>, job = 'test-linear-learner-2018-01-01-03-34-54-936'
desc = {'AlgorithmSpecification': {'TrainingImage': '174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:1', 'Trainin...ta_1': '0.1', 'binary_classifier_model_selection_criteria': 'accuracy', 'epochs': '1', 'feature_dim': '784', ...}, ...}

def _check_job_status(self, job, desc):
    """Check to see if the job completed successfully and, if not, construct and
        raise a ValueError.

        Args:
            job (str): The name of the job to check.
            desc (dict[str, str]): The result of ``describe_training_job()``.

        Raises:
            ValueError: If the training job fails.
        """
    status = desc['TrainingJobStatus']

    if status != 'Completed':
        reason = desc.get('FailureReason', '(No reason provided)')

      raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))

E ValueError: Error training test-linear-learner-2018-01-01-03-34-54-936: Failed Reason: ClientError: SageMaker was unable to assume the role 'arn:aws:iam::379899735384:role/SageMakerRole'

.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:390: ValueError
---------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------
....................
---------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:sagemaker:Creating training-job with name: test-linear-learner-2018-01-01-03-34-54-936
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
----------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------------
credentials.py 1031 INFO Found credentials in shared credentials file: ~/.aws/credentials
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
session.py 237 INFO Creating training-job with name: test-linear-learner-2018-01-01-03-34-54-936
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
__________________________________________________________________________________ test_pca __________________________________________________________________________________

def test_pca():
    with timeout(minutes=15):
        sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=REGION))
        data_path = os.path.join(DATA_DIR, 'one_p_mnist', 'mnist.pkl.gz')
        pickle_args = {} if sys.version_info.major == 2 else {'encoding': 'latin1'}

        # Load the data into memory as numpy arrays
        with gzip.open(data_path, 'rb') as f:
            train_set, _, _ = pickle.load(f, **pickle_args)

        pca = sagemaker.amazon.pca.PCA(role='SageMakerRole', train_instance_count=1,
                                       train_instance_type='ml.m4.xlarge',
                                       num_components=48, sagemaker_session=sagemaker_session, base_job_name='test-pca')

        pca.algorithm_mode = 'randomized'
        pca.subtract_mean = True
        pca.extra_components = 5

      pca.fit(pca.record_set(train_set[0][:100]))

tests/integ/test_pca.py:44:

self = <sagemaker.session.Session object at 0x113c51ed0>, job = 'test-pca-2018-01-01-03-39-15-456'
desc = {'AlgorithmSpecification': {'TrainingImage': '174872318107.dkr.ecr.us-west-2.amazonaws.com/pca:1', 'TrainingInputMode'...': {'algorithm_mode': 'randomized', 'extra_components': '5', 'feature_dim': '784', 'mini_batch_size': '100', ...}, ...}

def _check_job_status(self, job, desc):
    """Check to see if the job completed successfully and, if not, construct and
        raise a ValueError.

        Args:
            job (str): The name of the job to check.
            desc (dict[str, str]): The result of ``describe_training_job()``.

        Raises:
            ValueError: If the training job fails.
        """
    status = desc['TrainingJobStatus']

    if status != 'Completed':
        reason = desc.get('FailureReason', '(No reason provided)')

      raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))

E ValueError: Error training test-pca-2018-01-01-03-39-15-456: Failed Reason: ClientError: SageMaker was unable to assume the role 'arn:aws:iam::379899735384:role/SageMakerRole'

.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:390: ValueError
---------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------
....................
---------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:sagemaker:Creating training-job with name: test-pca-2018-01-01-03-39-15-456
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
----------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------------
credentials.py 1031 INFO Found credentials in shared credentials file: ~/.aws/credentials
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
session.py 237 INFO Creating training-job with name: test-pca-2018-01-01-03-39-15-456
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
__________________________________________________________________________________ test_tf ___________________________________________________________________________________

sagemaker_session = <sagemaker.session.Session object at 0x1134ce350>

def test_tf(sagemaker_session):
    with timeout(minutes=15):
        script_path = os.path.join(DATA_DIR, 'iris', 'iris-dnn-classifier.py')
        data_path = os.path.join(DATA_DIR, 'iris', 'data')

        estimator = TensorFlow(entry_point=script_path,
                               role='SageMakerRole',
                               training_steps=1,
                               evaluation_steps=1,
                               hyperparameters={'input_tensor_name': 'inputs'},
                               train_instance_count=1,
                               train_instance_type='ml.c4.xlarge',
                               sagemaker_session=sagemaker_session,
                               base_job_name='test-tf')

        inputs = estimator.sagemaker_session.upload_data(path=data_path, key_prefix='integ-test-data/tf_iris')

      estimator.fit(inputs)

tests/integ/test_tf.py:44:

.tox/py27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py:166: in fit
fit_super()
.tox/py27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py:154: in fit_super
super(TensorFlow, self).fit(inputs, wait, logs, job_name)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:517: in fit
super(Framework, self).fit(inputs, wait, logs, self._current_job_name)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:154: in fit
self.latest_training_job.wait(logs=logs)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:323: in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True)
.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:647: in logs_for_job
self._check_job_status(job_name, description)

self = <sagemaker.session.Session object at 0x1134ce350>, job = 'test-tf-2018-01-01-03-41-00-415'
desc = {'AlgorithmSpecification': {'TrainingImage': '520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tensorflow-py2-cp...ckpoints"', 'evaluation_steps': '1', 'input_tensor_name': '"inputs"', 'sagemaker_container_log_level': '20', ...}, ...}

def _check_job_status(self, job, desc):
    """Check to see if the job completed successfully and, if not, construct and
        raise a ValueError.

        Args:
            job (str): The name of the job to check.
            desc (dict[str, str]): The result of ``describe_training_job()``.

        Raises:
            ValueError: If the training job fails.
        """
    status = desc['TrainingJobStatus']

    if status != 'Completed':
        reason = desc.get('FailureReason', '(No reason provided)')

      raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))

E ValueError: Error training test-tf-2018-01-01-03-41-00-415: Failed Reason: ClientError: SageMaker was unable to assume the role 'arn:aws:iam::379899735384:role/SageMakerRole'

.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:390: ValueError
--------------------------------------------------------------------------- Captured stderr setup ----------------------------------------------------------------------------
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
----------------------------------------------------------------------------- Captured log setup -----------------------------------------------------------------------------
credentials.py 1031 INFO Found credentials in shared credentials file: ~/.aws/credentials
---------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------
....................
---------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:sagemaker:Creating training-job with name: test-tf-2018-01-01-03-41-00-415
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Resetting dropped connection: logs.us-west-2.amazonaws.com
----------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------------
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
session.py 237 INFO Creating training-job with name: test-tf-2018-01-01-03-41-00-415
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
connectionpool.py 238 INFO Resetting dropped connection: logs.us-west-2.amazonaws.com
_________________________________________________________________________________ test_cifar _________________________________________________________________________________

sagemaker_session = <sagemaker.session.Session object at 0x1150fd850>

def test_cifar(sagemaker_session):
    with timeout(minutes=15):
        script_path = os.path.join(DATA_DIR, 'cifar_10', 'source')

        dataset_path = os.path.join(DATA_DIR, 'cifar_10', 'data')

        estimator = TensorFlow(entry_point='resnet_cifar_10.py', source_dir=script_path, role='SageMakerRole',
                               training_steps=20, evaluation_steps=5,
                               train_instance_count=2, train_instance_type='ml.p2.xlarge',
                               sagemaker_session=sagemaker_session,
                               base_job_name='test-cifar')

        inputs = estimator.sagemaker_session.upload_data(path=dataset_path, key_prefix='data/cifar10')

      estimator.fit(inputs)

tests/integ/test_tf_cifar.py:54:

.tox/py27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py:166: in fit
fit_super()
.tox/py27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.py:154: in fit_super
super(TensorFlow, self).fit(inputs, wait, logs, job_name)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:517: in fit
super(Framework, self).fit(inputs, wait, logs, self._current_job_name)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:152: in fit
self.latest_training_job = _TrainingJob.start_new(self, inputs)
.tox/py27/lib/python2.7/site-packages/sagemaker/estimator.py:263: in start_new
hyperparameters=hyperparameters, stop_condition=stop_condition)
.tox/py27/lib/python2.7/site-packages/sagemaker/session.py:239: in train
self.sagemaker_client.create_training_job(**train_request)
.tox/py27/lib/python2.7/site-packages/botocore/client.py:317: in _api_call
return self._make_api_call(operation_name, kwargs)

self = <botocore.client.SageMaker object at 0x1147f4210>, operation_name = 'CreateTrainingJob'
api_params = {'AlgorithmSpecification': {'TrainingImage': '520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tensorflow-py2-gp...-2-379899735384/data/cifar10'}}}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-379899735384/'}, ...}

def _make_api_call(self, operation_name, api_params):
    operation_model = self._service_model.operation_model(operation_name)
    service_name = self._service_model.service_name
    history_recorder.record('API_CALL', {
        'service': service_name,
        'operation': operation_name,
        'params': api_params,
    })
    if operation_model.deprecated:
        logger.debug('Warning: %s.%s() is deprecated',
                     service_name, operation_name)
    request_context = {
        'client_region': self.meta.region_name,
        'client_config': self.meta.config,
        'has_streaming_input': operation_model.has_streaming_input,
        'auth_type': operation_model.auth_type,
    }
    request_dict = self._convert_to_request_dict(
        api_params, operation_model, context=request_context)

    handler, event_response = self.meta.events.emit_until_response(
        'before-call.{endpoint_prefix}.{operation_name}'.format(
            endpoint_prefix=self._service_model.endpoint_prefix,
            operation_name=operation_name),
        model=operation_model, params=request_dict,
        request_signer=self._request_signer, context=request_context)

    if event_response is not None:
        http, parsed_response = event_response
    else:
        http, parsed_response = self._endpoint.make_request(
            operation_model, request_dict)

    self.meta.events.emit(
        'after-call.{endpoint_prefix}.{operation_name}'.format(
            endpoint_prefix=self._service_model.endpoint_prefix,
            operation_name=operation_name),
        http_response=http, parsed=parsed_response,
        model=operation_model, context=request_context
    )

    if http.status_code >= 300:
        error_code = parsed_response.get("Error", {}).get("Code")
        error_class = self.exceptions.from_code(error_code)

      raise error_class(parsed_response, operation_name)

E ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit for training-job/ml.p2.xlarge is 0 Instances, with current utilization of 0 Instances and a request delta of 2 Instances. Please contact AWS support to request an increase for this limit.

.tox/py27/lib/python2.7/site-packages/botocore/client.py:615: ResourceLimitExceeded
--------------------------------------------------------------------------- Captured stderr setup ----------------------------------------------------------------------------
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
----------------------------------------------------------------------------- Captured log setup -----------------------------------------------------------------------------
credentials.py 1031 INFO Found credentials in shared credentials file: ~/.aws/credentials
---------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sts.amazonaws.com
INFO:sagemaker:Creating training-job with name: test-cifar-2018-01-01-03-42-43-090
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
----------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------------
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): s3.us-west-2.amazonaws.com
connectionpool.py 735 INFO Starting new HTTPS connection (1): sts.amazonaws.com
session.py 237 INFO Creating training-job with name: test-cifar-2018-01-01-03-42-43-090
connectionpool.py 735 INFO Starting new HTTPS connection (1): sagemaker.us-west-2.amazonaws.com
==================================================================== 5 failed, 2 error in 599.21 seconds =====================================================================
ERROR: InvocationError: '/Users/andyfeng/dev/sagemaker-python-sdk/.tox/py27/bin/pytest tests/integ'`

How to create model without deploying endpoint?

Is it possible to create a model in SageMaker without doing tensorflow_estimator.deploy(...)?

aws / sagemaker-python-sdk Goto Github PK

sagemaker-python-sdk's Introduction

SageMaker Python SDK

Table of Contents

Installing the SageMaker Python SDK

Supported Operating Systems

Supported Python Versions

AWS Permissions

Licensing

Running tests

Git Hooks

Building Sphinx docs

SageMaker SparkML Serving

sagemaker-python-sdk's People

Contributors

Stargazers

Watchers

Forkers

sagemaker-python-sdk's Issues

Recommend Projects

Recommend Topics

Recommend Org