Code Monkey home page Code Monkey logo

aws-document-classifier-and-splitter's Introduction

Intelligently Split Multi-Form Document Packages with Amazon Textract and Amazon Comprehend

Workflow 1: Endpoint Builder

Blog: https://aws.amazon.com/blogs/machine-learning/intelligently-split-multi-form-document-packages-with-amazon-textract-and-amazon-comprehend/

Authors: Aditi Rajnish, Raj Pathak

Description

Workflow 1 will take documents stored on Amazon S3 and send them through a series of steps to extract the data from the documents via Amazon Textract. Then, the extracted data will be used to create an Amazon Comprehend Custom Classification Endpoint.

Input:

The Amazon S3 URI of the folder containing the training dataset files (these can be images, single-page PDFs, and/or multi-page PDFs). The structure of the folder must be:

root dataset directory
---- class directory
-------- files

or

root dataset directory
---- class directory
-------- nested subdirectories
------------ files

Note that the name of the class subdirectories (2nd directory level) become the names of the classes used in the Amazon Comprehend custom classification model. For example, see the file structure below; the class for "form123.pdf" would be "tax_forms".

training_dataset
---- tax_forms
-------- page_1
------------ form123.pdf

Output:

The ARN (Amazon Resource Number) of the model's endpoint, which can be used to classify documents in Workflow 2.

Usage:

You must have Docker installed, configured and running. You must also have the AWS CLI and AWS SAM CLI installed and configured with your AWS credentials. Navigate to the workflow1_endpointbuilder/sam-app directory. Run 'sam build' and 'sam deploy --guided' in your CLI and provide the Stack Name and AWS Region (ensure that this is the same region as your Amazon S3 dataset folder). You must respond with "y" for the "Allow SAM CLI IAM role creation" and "Create managed ECR repositories for all functions?" prompts. This will deploy CloudFormation stacks that create the infrastructure for the Endpoint Builder state machine in Amazon Step Functions. For example, see below the CLI input:

(venv) Lambda_Topic_Classifier % cd workflow1_endpointbuilder/sam-app 
(venv) sam-app % sam build
Build Succeeded

(venv) sam-app % sam deploy --guided
Configuring SAM deploy
=========================================
Stack Name [sam-app]: endpointbuilder
AWS Region []: us-east-1
#Shows you resources changes to be deployed and require a 'Y' to initiate deploy
Confirm changes before deploy [y/N]: n
#SAM needs permission to be able to create roles to connect to the resources in your template
Allow SAM CLI IAM role creation [Y/n]: y
Save arguments to configuration file [Y/n]: n

Looking for resources needed for deployment:
Creating the required resources...
Successfully created!
Managed S3 bucket: {your_bucket}
#Managed repositories will be deleted when their functions are removed from the template and deployed
Create managed ECR repositories for all functions? [Y/n]: y

and CLI output:

CloudFormation outputs from deployed stack
-------------------------------------------------------------------------------------------------
Outputs                                                                                         
-------------------------------------------------------------------------------------------------
Key                 StateMachineArn                                                             
Description         Arn of the EndpointBuilder State Machine                                    
Value              {state_machine_arn}                      
-------------------------------------------------------------------------------------------------

Find the corresponding state machine in Amazon Step Functions and start a new execution, providing your Amazon S3 folder URI as a string in the folder_uri key-pair argument.

{
  "folder_uri": "s3://{your_bucket_name}"
}

The state machine returns the Amazon Comprehend classifier's endpoint ARN as a string in the endpoint_arn key-pair:

{
  "endpoint_arn": "{your_end_point_ARN}"
}

Workflow 2: Document Splitter API

Description

Workflow 2 will take the endpoint you created in Workflow 1 and split the documents based on the classes with which model has been trained.

Input:

The endpoint ARN of the Amazon Comprehend custom classifier, the name of an existing Amazon S3 bucket you have access to, and one of the following:

  • Amazon S3 URI of the multi-page PDF document that needs to be split
  • local multi-page PDF file that needs to be split

Output:

The Amazon S3 URI of a ZIP file containing multi-class PDFs, each titled with their class name, containing the pages from the input PDF that belong to the same class.

Usage:

You must have Docker installed, configured and running. You must also have the AWS CLI and AWS SAM CLI installed and configured with your AWS credentials. Navigate to the workflow2_docsplitter/sam-app directory. Run 'sam build' and 'sam deploy --guided' in your CLI and provide the Stack Name and AWS Region (ensure that this is the same region as your Amazon S3 bucket). You must respond with "y" for the "Allow SAM CLI IAM role creation" and "Create managed ECR repositories for all functions?" prompts. This will deploy CloudFormation stacks that create the infrastructure for the Application Load Balancer its Lambda Function Target where you can send document splitting requests. For example, see below the CLI input:

(venv) Lambda_Topic_Classifier % cd workflow2_docsplitter/sam-app 
(venv) sam-app % sam build
Build Succeeded

(venv) sam-app % sam deploy --guided
Configuring SAM deploy
=========================================
Stack Name [sam-app]: docsplitter
AWS Region []: us-east-1
#Shows you resources changes to be deployed and require a 'Y' to initiate deploy
Confirm changes before deploy [y/N]: n
#SAM needs permission to be able to create roles to connect to the resources in your template
Allow SAM CLI IAM role creation [Y/n]: y
Save arguments to configuration file [Y/n]: n

Looking for resources needed for deployment:
Managed S3 bucket: {bucket_name}
#Managed repositories will be deleted when their functions are removed from the template and deployed
Create managed ECR repositories for all functions? [Y/n]: y

and CLI output:

CloudFormation outputs from deployed stack
-------------------------------------------------------------------------------------------------
Outputs                                                                                         
-------------------------------------------------------------------------------------------------
Key                 LoadBalancerDnsName                                                                                         
Description         DNS Name of the DocSplitter application load balancer with the Lambda target group.                         
Value               {ALB_DNS}                     
-------------------------------------------------------------------------------------------------

Send a POST request to the outputted LoadBalancerDnsName, providing your Amazon Comprehend endpoint ARN, Amazon S3 bucket name, and either of the following:

  • Amazon S3 URI of the multi-page PDF document that needs to be split
  • local multi-page PDF file that needs to be split

See below for the POST request input format.

# sample_s3_request.py
dns_name = "{ALB_DNS}"
bucket_name = "{Bucket_name}"
endpoint_arn = "{Comprehend ARN}"
input_pdf_uri = "{document_s3_uri}"

# sample_local_request.py
dns_name = "{ALB_DNS}"
bucket_name = "{Bucket_name}"
endpoint_arn = "{Comprehend ARN}"
input_pdf_path = "{local_doc_path}"

The Load Balancer's response is the output ZIP file's Amazon S3 URI as a string in the output_zip_file_s3_uri key-pair:

{
    "output_zip_file_s3_uri": "s3://{file_of_split_docs}.zip"
}

Workflow 3: Local Endpoint Builder and Doc Splitter

Description

Workflow 3 follows a similar purpose to Workflow 1 and Workflow 2 to generate an Amazon Comprehend Endpoint; however, all processing will be done using the user’s local machine to generate an Amazon Comprehend compatible CSV file. This workflow was created for customers in highly regulated industries where persisting PDF documents on Amazon S3 may not be possible.

Local Endpoint Builder

Input:

The absolute path of the local folder containing the training dataset files (these can be images, single-page PDFs, and/or multi-page PDFs). The structure of the folder must be the same as that shown in the "Input" section of "Workflow 1: EndpointBuilder". The name of an existing Amazon S3 bucket you have access to must also be inputted, as this is where the training dataset CSV file is uploaded and accessed by Amazon Comprehend.

Output:

The ARN (Amazon Resource Number) of the model's endpoint, which can be used to classify documents in Workflow 2.

Usage:

Navigate to the local_endpointbuilder.py file in the workflow3_local directory. Run the file and follow the prompts to input your local dataset folder's absolute path and your Amazon S3 bucket's name.

When the program has finished running, the model's endpoint ARN is printed to the console. See the final lines in local_endpointbuilder.py:

print("Welcome to the local endpoint builder!\n")
existing_bucket_name = input("Please enter the name of an existing S3 bucket you are able to access: ")
local_dataset_path = input("Please enter the absolute local file path of the dataset folder: ")
print("\nComprehend model endpoint ARN: " + main(existing_bucket_name, local_dataset_path))

Local Doc Splitter

Input:

The endpoint ARN of the Amazon Comprehend custom classifier and one of the following:

  • Amazon S3 URI of the multi-page PDF document that needs to be split
  • local multi-page PDF file that needs to be split

Output:

A local output folder is created. It contains the multi-class PDFs, each titled with their class name, containing the pages from the input PDF that belong to the same class.

Usage:

Navigate to the local_docsplitter.py file in the workflow3_local directory. Run the file and follow the prompts to input your endpoint ARN and either your input PDF's S3 URI or absolute local path.

When the program has finished running, the console displays the path of the output folder containing the multi-class PDFs. See the final lines in local_docsplitter.py:

print("Welcome to the local document splitter!")
endpoint_arn = input("Please enter the ARN of the Comprehend classification endpoint: ")

print("Would you like to split a local file or a file stored in S3?")
choice = input("Please enter 'local' or 's3': ")
if choice == 'local':
    input_pdf_info = input("Please enter the absolute local file path of the input PDF: ")
elif choice == 's3':
    input_pdf_info = input("Please enter the S3 URI of the input PDF: ")

main(endpoint_arn, choice, input_pdf_info)

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

aws-document-classifier-and-splitter's People

Contributors

amazon-auto avatar dependabot[bot] avatar rppth avatar schadem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

aws-document-classifier-and-splitter's Issues

MemorySize Limits

When building out Workflow1 it's failing due to MemorySize limits needing to be 3008, according to the returned build errors. I have edited the yaml to enable the maximum allowable by aws upon creation, but the result is the same: a build error due to the 3008 MemorySize req.

I'll continue to track this down and update here, but 'out of the box' (so to speak) it will seemingly fail to build.

If anyone can offer additional insight here, please do!

State machine looping between Check Table Status, Is table Complete, Wait for Object Processing

Hi, I believe I followed the instructions for this state machine to a T, but I believe i am stuck in an infinite loop. When I open up the state machine, “Wait for Object Processing” is briefly green before returning to blue, and continues to then check table status -> is table complete -> wait for object processing. I have checked the dynamoDB table and nothing has been added for over 24 hours and I believe it is complete. Is there a way that I can force it to begin creating the csv and continue on with the flow of the state machine?

Possible to add individual multi-page PDF's in a folder for classification?

Hi, I have the endpoint set up already, but I would like to avoid having to combine all our mail into a single pdf document in order to be classified. Is there a way to change the function so that I could upload a folder of individual multi-page pdf's for classification? For example, we store our documents locally in individual folders (ie all class1 pdf's have their own folder, class2 has its own folder, etc.). Would it be possible to combine these different folders in a zip file, so that they could be classified separately instead of combining them as one document and rely on the endpoint to accurately split them up? Would this be possible with a code change? If so, how?

Unable to upload artifact

Hi,
I did all the steps for workflow1 but I got
Error: Unable to upload artifact endpointbuilderfunction3processobjectsandupdaterows:v1 referenced by ImageUri parameter of EndpointBuilderFunction3ProcessObjectsAndUpdateRows resource.

The endpoint seems to be created but I don't know the impact of this error on the functionality.
Can you please advise ?
Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.