aws-samples / amazon-textract-code-samples Goto Github PK

Amazon Textract Code Samples

License: MIT No Attribution

Python 0.19% C# 0.65% Jupyter Notebook 99.16%

amazon-textract-code-samples's Issues

Multipage pdf breaks if there is one blank page in between

Hello!
Thank you for such a wonderful library. We are using this extensively. We have one issue at hand. If we run a multipage pdf say of 200 pages and in between if any page is blank then it just breaks the complete pdf conversion.
Please suggest if there is a way we could avoid this so that the pdf gets converted by skipping the blank page or page with error.

Please guide.

Build Error on the .Net Version

Hi I am getting the below build errors on the .net version

no region found when running 01-detect-text-local.py

I've been trying to set up my textract with python, and I'm using PyCharm IDE for this.

So, I wanted to test the first file on my local machine, and I ran "01-detect-text-local.py", but for some reason, it's yelling at me that "you must specify a region", that stopped at a package from botocore\region.py. But this is supposed to be running local, why is it trying to find an aws region at this point? Am I missing something here?

What does it mean to run on a local machine?

Grouping lines together

I'm using this example: https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/01-detect-text-local.py

For this image:

I'm getting:

this is one
line
second line
more here

How can we group 'LINES' tokens in a single line?
The output I'm looking for is:

this is one     line
second line                more here

ImportError convert_table_to_kv_dict

While executing "cms1500-parser.ipynb", getting this import error for 'convert_table_to_kv_dict' and even i am not able to find the function in the t_pretty_print file.

Can you please help why i am getting this Import error. If any mistake from my end, please correct me

Traceback (most recent call last):
File "c:\textract\amazon-textract-code-samples\python\pavan\form.py", line 10, in
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_forms_string, convert_table_to_kv_dict, convert_table_to_list
ImportError: cannot import name 'convert_table_to_kv_dict' from 'textractprettyprinter.t_pretty_print'

Integration Test

How can we write integration tests for them?

Broken blocks relations

When getting more blocks than the "MaxResults" parameter, the parsing of the document fails.

Therefore the parse must be able to support broken references.

The failure is reproduced in the following repository:

https://github.com/Eitol/test_problematic_file

Textract queries in other languages?

Textract Queries seems very helpful and interesting but I would like to know if it can handle other languages?

E.g. in Swedish invoice number is "fakturanummer" so can I get a response for "What is the fakturanummer?"?

Thanks!

Unable to parse Document result in Python

using textract-trp 0.1.3

When parsing "get_document_analysis" response the following output is generated:

Traceback (most recent call last):
  File "G:\dev\OCR\main.py", line 17, in <module>
    result = (textract.receive_document_result('52c4a450c667a18d89f4e26a1cf4b56859ad239f1a63279bec8f60458ae2284e'))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\textract.py", line 62, in receive_document_result
    return Document(response)
           ^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 633, in __init__
    self._parse()
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 530, in _parse
    l = Line(item, blockMap)
        ^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
       ~~~~~~~~^^^^^
KeyError: '9e2f5e38-f865-4b79-a37b-ac8ed7a19f02'

Is it possible to make async calls without uploading the document to S3 bucket?

I am reading the pdf as a bytearray and pass it to my analyser as follows:

response = self.textract.start_document_analysis(DocumentLocation={'Bytes': {docBytes}}, FeatureTypes=["TABLES", "FORMS"], NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn})
But this gives the following error:
TypeError: unhashable type: 'bytearray'

Build Error on the .Net Version

I get the dollowing error when trying to view an invoice as a table.(dotnet run --tables)

Error in Table.cs

[Read ME Request]: Can you please consider adding necessary packages to install?

The repo seems great. No doubt on that.

What would make this greater is the list of libraries/modules that needs to be installed for the files to run smoothly.

Something like a requirements.txt would help.

textract-textractor-tools.ipynb fails for overlay

the overlayer is missing and fails the notebook execution

Chain of Errors

py version 3 and up dont support print " "
after fixing that it threw scraper error, installed scraper still error

Textract form type - not getting data in sequential order

Hello,
Currently I am performing OCR on 1 page document over there I am having multiple same name entity and in front of it there is a checkbox. I am able to detect all values and the checkbox is selected or not using form in AWS textract but I am not getting any data in sequence.
Below I have attached 2 files with same data but in both file it is detecting all entities but in random order.
Here is the code I am using:

import boto3
import sys
import re
import json
from collections import defaultdict


def get_kv_map(file_name):
    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', file_name)

    # process using image bytes
    client = boto3.client('textract')
    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])

    # Get the text blocks
    blocks = response['Blocks']

    # get key and value maps
    key_map = {}
    value_map = {}
    block_map = {}
    for block in blocks:
        block_id = block['Id']
        block_map[block_id] = block
        if block['BlockType'] == "KEY_VALUE_SET":
            if 'KEY' in block['EntityTypes']:
                key_map[block_id] = block
            else:
                value_map[block_id] = block
    return key_map, value_map, block_map


def get_kv_relationship(key_map, value_map, block_map):
    kvs = defaultdict(list)
    for block_id, key_block in key_map.items():
        value_block = find_value_block(key_block, value_map)
        key = get_text(key_block, block_map)
        val = get_text(value_block, block_map)

        kvs[key].append(val)
    return kvs


def find_value_block(key_block, value_map):
    for relationship in key_block['Relationships']:
        if relationship['Type'] == 'VALUE':
            for value_id in relationship['Ids']:
                value_block = value_map[value_id]
    return value_block


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] == 'SELECTED':
                            text += 'X '
    return text


def print_kvs(kvs):
    for key, value in kvs.items():
        print(key, ":", value)

        

def search_value(kvs, search_key):
    for key, value in kvs.items():
        if re.search(search_key, key, re.IGNORECASE):
            return value


def main(file_name):
    key_map, value_map, block_map = get_kv_map(file_name)

    # Get Key Value relationship
    kvs = get_kv_relationship(key_map, value_map, block_map)
    print("\n\n== FOUND KEY : VALUE pairs ===\n")
    print_kvs(kvs)
    return kvs


if __name__ == "__main__":
    file_name = sys.argv[1]
    d = main("./data.png")

file1.pdf
file2.pdf

So how can I get the details in sequence rather than in random order:

For buyer entity this is data from 1 file:['', '', '', '', '', '', '', '', '', '', '', '', 'X ', '', 'X ', '', '']
For the same data this is response of buyer for other file: ['', '', '', '', '', '', '', '', '', 'X ', '', '', '', 'X ', '', '', '']

Bug with Detecting Merged Cells And Headers on fictitious bank statement

When running the notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb you get following error

`Traceback (most recent call last):

File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 5, in
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string, get_tables_string

File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/init.py", line 3, in
from .t_pretty_print import Pretty_Print_Table_Format as Pretty_Print_Table_Format

File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/t_pretty_print.py", line 1, in
import trp

File "/opt/conda/lib/python3.7/site-packages/trp/init.py", line 31
print ip
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(ip)?`

This is fixed by adding
!pip install textract-trp

Once that is fixed you get following error

AttributeError: 'Table' object has no attribute 'get_header_field_names'

InvalidParameterException when starting a Textract job using SNS Notification channel.

When I start a textract StartDocumentTextDetection and try to set the NotificationChannel as follows

 response = client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': objectName
            },
        },
        ClientRequestToken='string',
        JobTag='string',
        NotificationChannel={
            'RoleArn': snsRoleArn,
            'SNSTopicArn': snsTopicArn
        }
    )

I am getting an InvalidParameterException

An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters

python 3.7
boto3 1.12.35

Textract-Caller importing package for call_textract()

I'm having trouble using the call_textract() function for paginated outputs in the recent package you created.

Referring to https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller
After successfully installing the dependencies, what am I suppose to import?

`TABLES` and `FORMS` type data extraction from PDF file

Dear Team,
I want to extract TABLES and FORMS type data from a PDF file.
I did not found any API where I can pass the parameter (FORMS and TABLE) to convert from a PDF file.

Could you please help me out?

trp example fails with empty key or value

Will receive AttributeError: 'NoneType' object has not attribute 'text'

in case of empty key or value for a KEY_VALUE_SET

The job never completes

I ran this code on a sample pdf document but it never completes . The PDF is valid since it works from the console

When I terminate the job, this is what I get

Started job with id: eb64ef4b16fd56ae8756387b9aff71b1f661a55a6128e5c4de8a2b43c8a2e397
Job status: IN_PROGRESS
^CTraceback (most recent call last):
File "run.py", line 50, in
if(isJobComplete(jobId)):
File "run.py", line 23, in isJobComplete
time.sleep(5)
KeyboardInterrupt

Multi Columns Variables

In this example the number of columns is manually(2 columns) defined, I have a case where the pages of the pdf file vary from 1 to 5 columns, how can I detect this? Could you please give me an example with this code?

Samples for table extraction in java.

It would be nice to have samples for extraction of table data in Java

textract failed to get the file which has spaces in the the filename.

 bucket = event['Records'][0]['s3']['bucket']['name']
    #bucket = 'textract-input-files'
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    
    try:
        textract = boto3.client('textract')
        
        textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket,
                'Name': key
            }
        },``

the above code is failed to execute when i pass the file which has the 'spaces' in the file name

An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Same code works fine if i remove spaces from the filename

Error getting object Arkilo and Pierce.pdf from bucket textract-input-files. Make sure they exist and your bucket is in the same region as this function.

[ERROR] InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 37, in lambda_handler
    raise e
  File "/var/task/lambda_function.py", line 30, in lambda_handler
    'SNSTopicArn': 'arn:aws:sns:us-east-**************:SNStopicTextract'
  File "/opt/python/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/python/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)

InvalidParameterException: Request has invalid parameters when using startDocumentAnalysis

Description
I'm encountering an InvalidParameterException: Request has invalid parameters error when attempting to use the startDocumentAnalysis method with AWS Textract in a Node.js application. The error occurs despite ensuring that all parameters are correctly specified.

Environment

Node.js Version: 18.20.3
AWS SDK Version: ^2.1548.0
AWS Region: ap-south-1

Here's the code I'm using

public async startAnalyiseDocument(key: string) {
    try {
      const response = await this.textreact
        .startDocumentAnalysis({
          DocumentLocation: {
            S3Object: {
              Bucket: env.AWS_BUCKET_NAME as string,
              Name: key as string,
            },
          },
          FeatureTypes: [
            "TABLES", "FORMS", "SIGNATURES"
          ],
          NotificationChannel:{
            RoleArn:env.AWS_ROLE_ARN as string,
            SNSTopicArn: env.AWS_SNS_TOPIC_ARN as string,
          },
        })
        .promise();
        console.log('ERROR:',response.$response.error)
      return response;
    } catch (error) {
      console.log("AWS_TXT:", error);
      throw new Error(error);
    }
  }

Error Details

InvalidParameterException: Request has invalid parameters
    at Request.extractError (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/protocol/json.js:80:27)
    at Request.callListeners (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/Users/arunroot/Documents/mark93r/node_modules/aws-sdk/lib/request.js:686:14)
    at Request.transition (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/arunroot/Documents/mark93r/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/request.js:688:12)
    at Request.callListeners (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'InvalidParameterException',
  '[__type]': 'See error.__type for details.',
  '[Message]': 'See error.Message for details.',
  time: 2024-07-04T18:31:17.960Z,
  requestId: 'd05e66d4-2b18-446e-938e-de9b56864963',
  statusCode: 400,
  retryable: false,
  retryDelay: 79.76554747215391
}

Additional Information

The RoleArn and SNSTopicArn are correctly specified and the IAM role has the necessary permissions.
Before implementing NotificationChannel it was working fine.

Bug Code sample not working Amazon Textract Queries FeatureType

Description

I am trying to use the analyze_document method from the amazon-textract-code-samples repository with a document stored in S3. I followed the example from the paystub.ipynb notebook, but I changed the Document parameter to use S3Object instead of Bytes. However, when I run the code, I get a ParamValidationError saying that QueriesConfig is an unknown parameter.

Code

This is the code block from the https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/queries/paystub.ipynb notebook that I used as a reference:

response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
        "Queries": [{
            "Text": "What is the year to date gross pay",
            "Alias": "PAYSTUB_YTD_GROSS"
        },
        {
            "Text": "What is the current gross pay?",
            "Alias": "PAYSTUB_CURRENT_GROSS"
        },
        {
            "Text": "What is the current net pay?",
            "Alias": "PAYSTUB_CURRENT_NET"
        },
        {
            "Text": "What is the current social security tax?",
            "Alias": "PAYSTUB_CURRENT_SOCIAL_SECURITY_TAX"
        },
        {
            "Text": "How much is the current medicare?",
            "Alias": "PAYSTUB_MEDICARE_TAX"
        },
        {
            "Text": "What is the vacation hours balance?",
            "Alias": "PAYSTUB_VACATION_HOURS"
        },
        {
            "Text": "What is the sick hours balance?",
            "Alias": "PAYSTUB_SICK_HOURS"
        },
        {
            "Text": "What is the employee name?",
            "Alias": "PAYSTUB_EMPLOYEE_NAME"
        }]
    })

This is the code block that I am using:

response = client.analyze_document(
    Document={
        'S3Object': {'Bucket': bucket, 'Name': document}
    },
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        'Queries': [
            { 'Text': 'What is the Name ?', 'Alias': 'PATIENT_NAME' },
            { 'Text': 'What is the Test Name ?', 'Alias': 'TEST_NAME' },
        ]
    }
)

The only difference is that I am using S3Object instead of Bytes for the Document parameter.

Error

This is the error message that I get:

ParamValidationError: Parameter validation failed:
Unknown parameter in input: "QueriesConfig", must be one of: Document, FeatureTypes, HumanLoopConfig

This is the full traceback of the error:

Expected behavior

I expected the code to run without errors and return the results of the queries for the document in S3.

Actual behavior

The code raises a ParamValidationError and does not return any results.

Environment

Python version: 3.9.7
Textract version: 1.18.0
OS: Windows 10

Additional context

I searched for similar issues on GitHub, but I could not find any. I also checked the documentation for the analyze_document method, but I did not see any mention of QueriesConfig being incompatible with S3Object. I wonder if this is a bug or a limitation of the API.

textract-trp issue in python 3.8

Version: 0.13

Using merged cell example:

headers = table.get_header_field_names()

'Table' object has no attribute 'get_header_field_names'

Bug in parsing for Document

When parsing some multi-page outputs, there's a bug in the trp.py file.
Due to which the keys in the blockmap at line 119 is not found & KeyError Exception is thrown.

Issue with Generating Key-Value Pairs from CMS 1500 Form

Hi,
I am working on reading the form and processing data from CMS 1500 using the provided code. However, I have noticed several issues with the key-value pair generation. The current implementation fails to fetch many details from the form accurately and incorrectly processes some elements. REPO LINK
Problems Encountered:

Many key-value pairs are not being extracted correctly.

Some values are not being associated correctly with their keys.

The code seems unable to handle certain sections of the form, such as tables.

In your test sample form I am not seeing any of the details belongs to the table
Please do respond asap :(

Textract Analyze document for Tables issue

Hi,

I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.

Below Image is the cropped image which I got using Bbox info from textract ocr output.

Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"

If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.

For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.

Now, I went ahead and tried to add thresholding and got this image as an output.

Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"

Here is the way, I created sample threshold -
_, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.

This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution.
I ask AWS team to have a look over and fix this issue.

merged cells not working as expected

I followed this example (same code, same image/pdf):
https://aws.amazon.com/blogs/machine-learning/merge-cells-and-column-headers-in-amazon-textract-tables/

print (df) shows:

        Date              Description Details  Credits  Debits   Balance
0   2/4/2022  Life Insurance Payments  Credit              445   9500.45
1   2/4/2022      Property Management  Credit              300   9945.45
2   2/4/2022           Retail Store 4  Credit            65.75  10245.45
3   2/3/2022         Electricity Bill  Credit           245.45   10311.2
4   2/3/2022               Water Bill  Credit           312.85  10556.65
5   2/3/2022           Rental Deposit  Credit     3000           10869.5
6   2/2/2022           Retail Store 3  Credit              125    7869.5
7   2/2/2022    Retail Store 2 Refund   Debit      5.5            7994.5
8   2/2/2022           Retail Store 1  Credit             45.5      8000
9   2/1/2022        Shoe Store Refund  Credit       33            8045.5
10  2/1/2022    Snack Vending Machine   Debit                4    8012.5

Note the lack of "Amount".
Can anyone shed some light on this?

Broken s3 bucket Detecting Merged Cells And Headers on fictitious bank statement

On this notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb the provided S3 bucket and object "s3://textract-table-merged-cells-data-sample/Textract-MergeCell-Statement.pdf" throws error even when you have proper access etc. Workaround is to have sample pdf in your bucket or local to notebook

InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.