Code Monkey home page Code Monkey logo

predictive-maintenance-using-machine-learning's Introduction

Predictive Maintenance using Machine Learning

Companies need to monitor their industrial assets to ensure sustained performance and the typical manual routine checkups are time-consuming and reactive. However, with the advent of cheap sensors, companies can get metrics from industrial assets at regular intervals and with this trove of data, companies can use machine learning models to predict when assets might fail.

This project shows how to use Amazon SageMaker to train a deep learning model that uses historical sensor readings to predict how much longer the asset is likely to work for before it becomes critical. As a demonstration, the project trains an MXNet model on the NASA turbofan engine dataset, but can be easily customized to work with other sensor based data.

Getting Started

You will need an AWS account to use this solution. Sign up for an account here.

To run this JumpStart 1P Solution and have the infrastructure deploy to your AWS account you will need to create an active SageMaker Studio instance (see Onboard to Amazon SageMaker Studio). When your Studio instance is Ready, use the instructions in SageMaker JumpStart to 1-Click Launch the solution.

The solution artifacts are included in this GitHub repository for reference.

Note: Solutions are available in most regions including us-west-2, and us-east-1.

Caution: Cloning this GitHub repository and running the code manually could lead to unexpected issues! Use the AWS CloudFormation template. You'll get an Amazon SageMaker Notebook instance that's been correctly setup and configured to access the other resources in the solution.

Architecture

The project architecture deployed by the cloud formation template is shown here.

Project Description

The project uses Amazon SageMaker to train a deep learning model with the MXNet deep learning framework. The model used is a stacked Bidirectional LSTM neural network that can learn from sequential or time series data. The model is robust to the input dataset and does not expect the sensor readings to be smoothed, as the model has 1D convolutional layers with trainable parameter that can to smooth and peform feature transformation of the time series. The deep learning model is trained so that it learns to predict the remaining useful life (RUL) for each sensor.

The model training is orchestrated by running a jupyter notebook on a SageMaker Notebook instance. When you go through the project demonstration, the nasa turbofan engine dataset is automatically downloaded to an S3 bucket created in your account, by the quick launch template above.

In to demonstrate how the project can be used to perform batch inference on new time series data from sensor readings, an AWS Lambda function (https://github.com/awslabs/predictive-maintenance-using-machine-learning/blob/master/source/predictive_maintenance/index.py) is included. The Lambda function can be invoked by an AWS CloudWatch Event so that it runs on a schedule or AWS S3 put event so that it runs as soon as new sensor readings are stored in S3. When invoked, the Lambda function creates a SageMaker Batch Transform job, which uses the SageMaker Model that was saved during training, to obtain model predictions for the new sensor data. The results of the batch transform job are stored back in S3, and can be fed into a dashboard or visualization module for monitoring.

Contents

  • deployment/
    • predictive-maintenance-using-machine-learning.yaml: Creates AWS CloudFormation Stack for solution
  • source/
    • predictive-maintenance/
      • index.py: Lambda function script for creating SageMaker Batch Transforms jobs for batch inference
    • notebooks/
      • sagemaker_predictive_maintenance
        • sagemaker_predictive_maintenance_entry_point
          • requirements.txt: specifies requirements that need to be present in the SageMaker training container
          • sagemaker_predictive_maintenance_entry_point.py: Entry point script containing MXNet implementation for training the model
        • config.py: python config file to read cloudformation stack outputs and parametrize the solution
        • preproces.py: data preprocessing script
        • setup.py: setup the directory as a local python package
        • utils.py: utility function around preparing batch transform input and output
      • sagemaker_predictive_maintenance.ipynb: Orchestrates the solution. Trains the models and saves the trained model

License

This project is licensed under the Apache-2.0 License.

predictive-maintenance-using-machine-learning's People

Contributors

ehsanmok avatar jamesiri avatar sojiadeshina avatar vishaalkapoor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

predictive-maintenance-using-machine-learning's Issues

Can't create transform job in Lambda function

Lambda execution creates this Cloudwatch log:

[ERROR] ClientError: An error occurred (ValidationException) when calling the CreateTransformJob operation: Could not assume role arn:aws:iam::XXX:role/PredictiveMaintenance-NotebookInstanceExecutionRol-YVBYTBKPP92H. Please ensure that the role exists and allows principal 'sagemaker.amazonaws.com' to assume the role.
Traceback (most recent call last):
File "/var/task/index.py", line 30, in lambda_handler
batch_transform_response = run_batch_transform(transform_input)
File "/var/task/index.py", line 76, in run_batch_transform
sm.create_transform_job(**payload)
File "/var/runtime/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/runtime/botocore/client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)

This role does not exist, but another one (PredMaint-SolutionBuilder-NotebookInstanceExecutio-EP4DFXR5ZUO6) was created by the Cloudformation template.

slash in S3 bucket name.

When I try to create the solution with the default values, it complains there is a slash at the end of pred-maintenance-artifacts/ (remove slash).

template file not found

$ ./build-s3-dist.sh pred-maint v1.0
Staring to build distribution
export deployment_dir=/Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment
mkdir -p /Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment/dist
cp -f predictive-maintenance-using-machine-learning.template /Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment/dist
cp: predictive-maintenance-using-machine-learning.template: No such file or directory
Updating code source bucket in template with pred-maint
sed -i '' -e s/%%BUCKET_NAME%%/pred-maint/g /Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment/dist/predictive-maintenance-using-machine-learning.template
sed: /Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment/dist/predictive-maintenance-using-machine-learning.template: No such file or directory
Updating code source bucket in template with v1.0
sed -i '' -e s/%%VERSION%%/v1.0/g /Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment/dist/predictive-maintenance-using-machine-learning.template
sed: /Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment/dist/predictive-maintenance-using-machine-learning.template: No such file or directory
Copying notebooks to /Users/aldemim/ws_Learning/predictive-maintenance-using-machine-learning/deployment/dist
Packaging predictive_maintenance lambda
Completed building distribution

Training job failed

Hi team,

I create CloudFormation Stack from this page:
https://aws.amazon.com/solutions/predictive-maintenance-using-machine-learning/

and start to run the notebook and executed the training job, but there is an error accured:

Invoking script with the following command:

/usr/bin/python -m sagemaker_predictive_maintenance_entry_point --batch-size 1 --epochs 500 --log-interval 100 --num-datasets 4 --num-gpus 1 --optimizer adam


Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/sagemaker_predictive_maintenance_entry_point.py", line 10, in <module>
    import gluonnlp
  File "/usr/local/lib/python3.5/dist-packages/gluonnlp/__init__.py", line 25, in <module>
    from . import data
  File "/usr/local/lib/python3.5/dist-packages/gluonnlp/data/__init__.py", line 23, in <module>
    from . import (batchify, candidate_sampler, conll, corpora, dataloader,
  File "/usr/local/lib/python3.5/dist-packages/gluonnlp/data/question_answering.py", line 31, in <module>
    from mxnet.gluon.utils import download, check_sha1, _get_repo_file_url, replace_file
ImportError: cannot import name 'replace_file'
2020-02-27 09:04:17,944 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/usr/bin/python -m sagemaker_predictive_maintenance_entry_point --batch-size 1 --epochs 500 --log-interval 100 --num-datasets 4 --num-gpus 1 --optimizer adam"

2020-02-27 09:04:31 Failed - Training job failed

Could you please help on check this error? Thanks!

Wrong explaintaion for inference result.

In the notebook sagemaker_predictive_maintenance.ipynb there is one place which explained how to interpret the inference result. That said "The predictions are a fraction of MAX_RUL which is 130.0, therefore the Remaining Useful Life predictions can be obtained by multiplying the output with 130". This is wrong.

By checking the code in entry_point.py, we know that the label divided 300 in training, and thus we need to multiply 300 to get it back, instead of multiply 130.

IAM for Sagemaker Notebook Instance Execution role (from template) lack of access

While running the provided note book, I will get below exception,

Couldn't call 'describe_notebook_instance' to get the Role ARN of the instance PredictiveMaintenanceNotebookInstance.

It's related to the policy definition at below location,

"sagemaker:CreateTrainingJob",
"sagemaker:DescribeTrainingJob",
"sagemaker:CreateModel",
"sagemaker:DescribeModel",
"sagemaker:DeleteModel",
"sagemaker:DescribeTransformJob",
"sagemaker:CreateTransformJob"

After adding "sagemaker:DescribeNotebookInstance" access for above policy, the issue is fixed.

Feature suggestion -- add support to sagemaker endpoint for end2end sample purpose

I have tried to workout an end-2-end example with iot events sample,
https://github.com/aws-samples/aws-iot-events-accelerators

What I have found is the existing sample for predictive-maintenance only support batch processing to do prediction, while I was trying to add endpoint to the trained model, it failed.

Tried to dig the root reason, sound like the transform_fn function in this sample hardcoded to decode the payload which causing lambda ->apiGateway json payload fail (as while calling the api endpoint, it's trying to passing through json string and won't support binary python json object which is hardcoded in transform_fn function implemented here.

BTW, I don't know why the sample deployed by cfn do not allow me to modify the script, I tried to modify the transform_fn but failed even I gave enough access to the notebook execution role.

NameError: name 'config' is not defined

from` sagemaker.mxnet import MXNet
​
from time import gmtime,  strftime
timestamp = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
​
#training_job_name = "{}-{}".format(config.model_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
training_job_name = config.training_job_name+'-'+timestamp
​
train_instance_type = 'ml.p3.2xlarge'
​
# pass in the location of the training script, which is local to s3
# 1 training instance
# using 8 LSTM units,
# using Adam optimizer. These are just hyperparameters
m = MXNet(entry_point='sagemaker_predictive_maintenance_entry_point.py',
          source_dir='sagemaker_predictive_maintenance/sagemaker_predictive_maintenance_entry_point',
          py_version='py3',
          role=role, 
          train_instance_count=1, 
          train_instance_type=train_instance_type,
          output_path=output_location,
          hyperparameters={'num-datasets' : len(train_df),
                           'num-gpus': 1,
                           'num-units': 8, 
                           'num-layers': 2,
                           'epochs': 200,
                           'optimizer': 'adam',
                           'batch-size':1,
                           'log-interval': 100},
         input_mode='File',
        # use_spot_instances = True
         max_run = 3600,
         max_wait = 3600,
      #  train_max_run = 7200,
        framework_version='1.6.0')
​
m.fit({'train': s3_data_prefix}, job_name=training_job_name)

NameError Traceback (most recent call last)
in
5
6 #training_job_name = "{}-{}".format(config.model_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
----> 7 training_job_name = config.training_job_name+'-'+timestamp
8
9 train_instance_type = 'ml.p3.2xlarge'

NameError: name 'config' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.