Code Monkey home page Code Monkey logo

amazon-textract-comprehend-stepfunctions-example's Introduction

AWS Sample: Textract and Comprehend Integration via Step Functions and the CDK

This repository is a sample guide to building a serverless document processing application that can make intelligent flow-control decisions after classifying the input document type.

A video presentation of the architecture and a demo is available in my AWS Innovate AI/ML Edition 2021 talk.

Getting Started

These prequisites should be installed first:

  1. Install and configure Apache Maven
  2. Install and configure the AWS Cloud Development Kit
  3. Install and configure git
  4. Install the JDK

Once those are installed, clone this repository:

git clone https://github.com/aws-samples/amazon-textract-comprehend-stepfunctions-example

Then import the project into your IDE, and review the sample code. There are two models, cdk and lambda. The cdk module contains the application and infrastructure CDK code. It is responsible for building and deploying the serverless Lambda functions. It defines the Step Functions flow.

The lambda module contains the application runtime code as Lambda functions. These functions show examples of calling extracting a single page from a PDF and calling Textract synchronously, classifying its content using a Comprehend custom classifier, and an asynchronous Textract call with an AWS SNS ping on completion. The initial flow is triggered by an upload to S3 which starts a Step Functions execution.

Train a sample classifier

Before deploying the sample, you will need to train a Comprehend classifier. A very simple reference example is located in /samples/trainer.csv. (It contains the plain text of three sample documents, one per row; in practice you will train with many more samples each.) To train the classifier:

  1. Visit the Comprehend console (switch to the region you wish to use)
  2. Click Train classifier
  3. Give it a name and check the other details (the defaults are fine to start - use a Multi-class classifier)
  4. Specify the S3 location of the training file (upload it first)
  5. Click Train classifier

Once it's trained (it will take a few minutes), start an endpoint by clicking Create endpoint from the classifier's console page. (You can also use batch mode classification which does not require a running endpoint, but is not available for real-time processing.) Note the ARN of the classifier.

Build the project and build the CDK

From the project root directory, run the command: mvn install to compile the project and create the deployable Lambda artefacts. This will download all of the project dependencies first, so the initial run may take a few minutes. You should re-run this command each time you make an application change.

Once that is successful, you will need to bootstrap the CDK environment:

cdk bootstrap --profile YOUR_AWS_CLI_PROFILE_HERE

Make sure to update the command with the relevant Comprehend ARN and AWS CLI profile.

Permissions

The CDK application needs some permissions to be able to create the stack. Visit the IAM console to set these up. You can run the CDK deploy step iteratively and correct any missing permissions. I defined a Group with these policies and set that to a permission boundary restricted User on an account (to allow it only to self-created resources). You may restrict these further as appropriate.

  1. AmazonComprehendServicePolicy-InnovateClassifier
  2. AWSLambdaFullAccess
  3. IAMFullAccess
  4. AmazonS3FullAccess
  5. AmazonTextractFullAccess
  6. AmazonSNSFullAccess
  7. AWSStepFunctionsFullAccess
  8. AWSCloudFormationFullAccess

Note that these are only build-time permissions, required when running the CDK. The run-time permissions of the application are configured (and more restricted) in the CDK stack.

Deploy the application

Once that's done, run the command:

cdk --context ComprehendArn=YOUR_COMPREHEND_ARN_HERE --profile YOUR_AWS_CLI_PROFILE_HERE deploy

Once completed, you'll see a resource stack like:

textract-comprehend-sample: deploying...                                                                                                  
[0%] start: Publishing 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current                                           
[100%] success: Published 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current                                        
textract-comprehend-sample: creating CloudFormation changeset...                                                                          
  0/11 | 12:53:56 pm | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack       | textract-comprehend-sample User Initiated                
 0/11 Currently in progress: textract-comprehend-sample                                                                                   
  2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | FirstPageFunction (FirstPageFunctionBEE9EA11)            
  2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | StartTextractFunction (StartTextractFunctionCD0185FF)    
  2/11 | 12:54:36 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | FirstPageFunction (FirstPageFunctionBEE9EA11)            
  2/11 | 12:54:36 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | StartTextractFunction (StartTextractFunctionCD0185FF)    
  3/11 | 12:54:39 pm | UPDATE_IN_PROGRESS   | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)                          
  3/11 | 12:54:41 pm | UPDATE_COMPLETE      | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)                          
  5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | S3UploadListener (S3UploadListener4E242122)              
  5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | TextractCompletion (TextractCompletion5AE7AEDD)          
  5/11 | 12:54:46 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | S3UploadListener (S3UploadListener4E242122)              
  5/11 | 12:54:47 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | TextractCompletion (TextractCompletion5AE7AEDD)          
  6/11 | 12:54:49 pm | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack       | textract-comprehend-sample                               
  6/11 | 12:54:50 pm | UPDATE_COMPLETE      | AWS::CloudFormation::Stack       | textract-comprehend-sample                               
                                                                                                                                          
 โœ…  textract-comprehend-sample                                                                                                            

Running the samples

After deployment, visit the deployed stack in the CloudFormation console. Check the different tabs for details on the stack including the different resources created.

When you upload a PDF (example are in the samples directory) to the source S3 bucket that was created, if everything was set up correctly, a Step Functions flow will be started. View that to see the flow of the document and the Lambda logs.

Congratulations! You have successfully created a scalable, serverless application stack using the CDK to intelligenly process documents on demand.

Resources and pricing

Note that for as long as you have the stack deployed, charges may apply to your account. You should delete the resources (using cdk destroy) when you are done with the sample. You will need to empty the buckets prior to deletion, and also you will need to terminate the Comprehend endpoint.

amazon-textract-comprehend-stepfunctions-example's People

Contributors

amazon-auto avatar jhy-aws avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.