Code Monkey home page Code Monkey logo

amazon-textract-code-samples's Issues

Multipage pdf breaks if there is one blank page in between

Hello!
Thank you for such a wonderful library. We are using this extensively. We have one issue at hand. If we run a multipage pdf say of 200 pages and in between if any page is blank then it just breaks the complete pdf conversion.
Please suggest if there is a way we could avoid this so that the pdf gets converted by skipping the blank page or page with error.

Please guide.

no region found when running 01-detect-text-local.py

I've been trying to set up my textract with python, and I'm using PyCharm IDE for this.

So, I wanted to test the first file on my local machine, and I ran "01-detect-text-local.py", but for some reason, it's yelling at me that "you must specify a region", that stopped at a package from botocore\region.py. But this is supposed to be running local, why is it trying to find an aws region at this point? Am I missing something here?

What does it mean to run on a local machine?

ImportError convert_table_to_kv_dict

While executing "cms1500-parser.ipynb", getting this import error for 'convert_table_to_kv_dict' and even i am not able to find the function in the t_pretty_print file.

Can you please help why i am getting this Import error. If any mistake from my end, please correct me

Traceback (most recent call last):
File "c:\textract\amazon-textract-code-samples\python\pavan\form.py", line 10, in
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_forms_string, convert_table_to_kv_dict, convert_table_to_list
ImportError: cannot import name 'convert_table_to_kv_dict' from 'textractprettyprinter.t_pretty_print'

Textract queries in other languages?

Textract Queries seems very helpful and interesting but I would like to know if it can handle other languages?

E.g. in Swedish invoice number is "fakturanummer" so can I get a response for "What is the fakturanummer?"?

Thanks!

Unable to parse Document result in Python

using textract-trp 0.1.3

When parsing "get_document_analysis" response the following output is generated:

Traceback (most recent call last):
  File "G:\dev\OCR\main.py", line 17, in <module>
    result = (textract.receive_document_result('52c4a450c667a18d89f4e26a1cf4b56859ad239f1a63279bec8f60458ae2284e'))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\textract.py", line 62, in receive_document_result
    return Document(response)
           ^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 633, in __init__
    self._parse()
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 530, in _parse
    l = Line(item, blockMap)
        ^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
       ~~~~~~~~^^^^^
KeyError: '9e2f5e38-f865-4b79-a37b-ac8ed7a19f02'

Is it possible to make async calls without uploading the document to S3 bucket?

I am reading the pdf as a bytearray and pass it to my analyser as follows:

response = self.textract.start_document_analysis(DocumentLocation={'Bytes': {docBytes}}, FeatureTypes=["TABLES", "FORMS"], NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn})
But this gives the following error:
TypeError: unhashable type: 'bytearray'

Chain of Errors

py version 3 and up dont support print " "
after fixing that it threw scraper error, installed scraper still error

Textract form type - not getting data in sequential order

Hello,
Currently I am performing OCR on 1 page document over there I am having multiple same name entity and in front of it there is a checkbox. I am able to detect all values and the checkbox is selected or not using form in AWS textract but I am not getting any data in sequence.
Below I have attached 2 files with same data but in both file it is detecting all entities but in random order.
Here is the code I am using:

import boto3
import sys
import re
import json
from collections import defaultdict


def get_kv_map(file_name):
    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', file_name)

    # process using image bytes
    client = boto3.client('textract')
    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])

    # Get the text blocks
    blocks = response['Blocks']

    # get key and value maps
    key_map = {}
    value_map = {}
    block_map = {}
    for block in blocks:
        block_id = block['Id']
        block_map[block_id] = block
        if block['BlockType'] == "KEY_VALUE_SET":
            if 'KEY' in block['EntityTypes']:
                key_map[block_id] = block
            else:
                value_map[block_id] = block
    return key_map, value_map, block_map


def get_kv_relationship(key_map, value_map, block_map):
    kvs = defaultdict(list)
    for block_id, key_block in key_map.items():
        value_block = find_value_block(key_block, value_map)
        key = get_text(key_block, block_map)
        val = get_text(value_block, block_map)

        kvs[key].append(val)
    return kvs


def find_value_block(key_block, value_map):
    for relationship in key_block['Relationships']:
        if relationship['Type'] == 'VALUE':
            for value_id in relationship['Ids']:
                value_block = value_map[value_id]
    return value_block


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] == 'SELECTED':
                            text += 'X '
    return text


def print_kvs(kvs):
    for key, value in kvs.items():
        print(key, ":", value)

        

def search_value(kvs, search_key):
    for key, value in kvs.items():
        if re.search(search_key, key, re.IGNORECASE):
            return value


def main(file_name):
    key_map, value_map, block_map = get_kv_map(file_name)

    # Get Key Value relationship
    kvs = get_kv_relationship(key_map, value_map, block_map)
    print("\n\n== FOUND KEY : VALUE pairs ===\n")
    print_kvs(kvs)
    return kvs


if __name__ == "__main__":
    file_name = sys.argv[1]
    d = main("./data.png")

file1.pdf
file2.pdf

So how can I get the details in sequence rather than in random order:

For buyer entity this is data from 1 file:['', '', '', '', '', '', '', '', '', '', '', '', 'X ', '', 'X ', '', '']
For the same data this is response of buyer for other file: ['', '', '', '', '', '', '', '', '', 'X ', '', '', '', 'X ', '', '', '']

Bug with Detecting Merged Cells And Headers on fictitious bank statement

When running the notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb you get following error

`Traceback (most recent call last):

File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 5, in
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string, get_tables_string

File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/init.py", line 3, in
from .t_pretty_print import Pretty_Print_Table_Format as Pretty_Print_Table_Format

File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/t_pretty_print.py", line 1, in
import trp

File "/opt/conda/lib/python3.7/site-packages/trp/init.py", line 31
print ip
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(ip)?`

This is fixed by adding
!pip install textract-trp

Once that is fixed you get following error

AttributeError: 'Table' object has no attribute 'get_header_field_names'

InvalidParameterException when starting a Textract job using SNS Notification channel.

When I start a textract StartDocumentTextDetection and try to set the NotificationChannel as follows

 response = client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': objectName
            },
        },
        ClientRequestToken='string',
        JobTag='string',
        NotificationChannel={
            'RoleArn': snsRoleArn,
            'SNSTopicArn': snsTopicArn
        }
    )

I am getting an InvalidParameterException

An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters

python 3.7
boto3 1.12.35

The job never completes

I ran this code on a sample pdf document but it never completes . The PDF is valid since it works from the console

When I terminate the job, this is what I get

Started job with id: eb64ef4b16fd56ae8756387b9aff71b1f661a55a6128e5c4de8a2b43c8a2e397
Job status: IN_PROGRESS
^CTraceback (most recent call last):
File "run.py", line 50, in
if(isJobComplete(jobId)):
File "run.py", line 23, in isJobComplete
time.sleep(5)
KeyboardInterrupt

Multi Columns Variables

In this example the number of columns is manually(2 columns) defined, I have a case where the pages of the pdf file vary from 1 to 5 columns, how can I detect this? Could you please give me an example with this code?

textract failed to get the file which has spaces in the the filename.

 bucket = event['Records'][0]['s3']['bucket']['name']
    #bucket = 'textract-input-files'
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    
    try:
        textract = boto3.client('textract')
        
        textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket,
                'Name': key
            }
        },``

the above code is failed to execute when i pass the file which has the 'spaces' in the file name

An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Same code works fine if i remove spaces from the filename

Error getting object Arkilo and Pierce.pdf from bucket textract-input-files. Make sure they exist and your bucket is in the same region as this function.

[ERROR] InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 37, in lambda_handler
    raise e
  File "/var/task/lambda_function.py", line 30, in lambda_handler
    'SNSTopicArn': 'arn:aws:sns:us-east-**************:SNStopicTextract'
  File "/opt/python/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/python/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)

InvalidParameterException: Request has invalid parameters when using startDocumentAnalysis

Description
I'm encountering an InvalidParameterException: Request has invalid parameters error when attempting to use the startDocumentAnalysis method with AWS Textract in a Node.js application. The error occurs despite ensuring that all parameters are correctly specified.

Environment

  • Node.js Version: 18.20.3
  • AWS SDK Version: ^2.1548.0
  • AWS Region: ap-south-1

Here's the code I'm using

public async startAnalyiseDocument(key: string) {
    try {
      const response = await this.textreact
        .startDocumentAnalysis({
          DocumentLocation: {
            S3Object: {
              Bucket: env.AWS_BUCKET_NAME as string,
              Name: key as string,
            },
          },
          FeatureTypes: [
            "TABLES", "FORMS", "SIGNATURES"
          ],
          NotificationChannel:{
            RoleArn:env.AWS_ROLE_ARN as string,
            SNSTopicArn: env.AWS_SNS_TOPIC_ARN as string,
          },
        })
        .promise();
        console.log('ERROR:',response.$response.error)
      return response;
    } catch (error) {
      console.log("AWS_TXT:", error);
      throw new Error(error);
    }
  }

Error Details

InvalidParameterException: Request has invalid parameters
    at Request.extractError (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/protocol/json.js:80:27)
    at Request.callListeners (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/Users/arunroot/Documents/mark93r/node_modules/aws-sdk/lib/request.js:686:14)
    at Request.transition (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/arunroot/Documents/mark93r/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/request.js:688:12)
    at Request.callListeners (/Users/arunroot/Documents/mark93/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'InvalidParameterException',
  '[__type]': 'See error.__type for details.',
  '[Message]': 'See error.Message for details.',
  time: 2024-07-04T18:31:17.960Z,
  requestId: 'd05e66d4-2b18-446e-938e-de9b56864963',
  statusCode: 400,
  retryable: false,
  retryDelay: 79.76554747215391
}

Additional Information

  • The RoleArn and SNSTopicArn are correctly specified and the IAM role has the necessary permissions.
  • Before implementing NotificationChannel it was working fine.

Bug Code sample not working Amazon Textract Queries FeatureType

Description

I am trying to use the analyze_document method from the amazon-textract-code-samples repository with a document stored in S3. I followed the example from the paystub.ipynb notebook, but I changed the Document parameter to use S3Object instead of Bytes. However, when I run the code, I get a ParamValidationError saying that QueriesConfig is an unknown parameter.

Code

This is the code block from the https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/queries/paystub.ipynb notebook that I used as a reference:

response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
        "Queries": [{
            "Text": "What is the year to date gross pay",
            "Alias": "PAYSTUB_YTD_GROSS"
        },
        {
            "Text": "What is the current gross pay?",
            "Alias": "PAYSTUB_CURRENT_GROSS"
        },
        {
            "Text": "What is the current net pay?",
            "Alias": "PAYSTUB_CURRENT_NET"
        },
        {
            "Text": "What is the current social security tax?",
            "Alias": "PAYSTUB_CURRENT_SOCIAL_SECURITY_TAX"
        },
        {
            "Text": "How much is the current medicare?",
            "Alias": "PAYSTUB_MEDICARE_TAX"
        },
        {
            "Text": "What is the vacation hours balance?",
            "Alias": "PAYSTUB_VACATION_HOURS"
        },
        {
            "Text": "What is the sick hours balance?",
            "Alias": "PAYSTUB_SICK_HOURS"
        },
        {
            "Text": "What is the employee name?",
            "Alias": "PAYSTUB_EMPLOYEE_NAME"
        }]
    })

This is the code block that I am using:

response = client.analyze_document(
    Document={
        'S3Object': {'Bucket': bucket, 'Name': document}
    },
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        'Queries': [
            { 'Text': 'What is the Name ?', 'Alias': 'PATIENT_NAME' },
            { 'Text': 'What is the Test Name ?', 'Alias': 'TEST_NAME' },
        ]
    }
)

The only difference is that I am using S3Object instead of Bytes for the Document parameter.

Error

This is the error message that I get:

ParamValidationError: Parameter validation failed:
Unknown parameter in input: "QueriesConfig", must be one of: Document, FeatureTypes, HumanLoopConfig

This is the full traceback of the error:
image

Expected behavior

I expected the code to run without errors and return the results of the queries for the document in S3.

Actual behavior

The code raises a ParamValidationError and does not return any results.

Environment

  • Python version: 3.9.7
  • Textract version: 1.18.0
  • OS: Windows 10

Additional context

I searched for similar issues on GitHub, but I could not find any. I also checked the documentation for the analyze_document method, but I did not see any mention of QueriesConfig being incompatible with S3Object. I wonder if this is a bug or a limitation of the API.

textract-trp issue in python 3.8

Version: 0.13

Using merged cell example:

headers = table.get_header_field_names()

'Table' object has no attribute 'get_header_field_names'

Bug in parsing for Document

When parsing some multi-page outputs, there's a bug in the trp.py file.
Due to which the keys in the blockmap at line 119 is not found & KeyError Exception is thrown.

Issue with Generating Key-Value Pairs from CMS 1500 Form

Hi,
I am working on reading the form and processing data from CMS 1500 using the provided code. However, I have noticed several issues with the key-value pair generation. The current implementation fails to fetch many details from the form accurately and incorrectly processes some elements. REPO LINK
Problems Encountered:

  • Many key-value pairs are not being extracted correctly.
    
  • Some values are not being associated correctly with their keys.
    
  • The code seems unable to handle certain sections of the form, such as tables.
    

image
In your test sample form I am not seeing any of the details belongs to the table
Please do respond asap :(

Textract Analyze document for Tables issue

Hi,

I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.

Below Image is the cropped image which I got using Bbox info from textract ocr output.
temp_crop

Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"

If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.

For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.

temp_crop (1)

Now, I went ahead and tried to add thresholding and got this image as an output.
temp_crop (2)

Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"

Here is the way, I created sample threshold -
_, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.

This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution.
I ask AWS team to have a look over and fix this issue.

merged cells not working as expected

I followed this example (same code, same image/pdf):
https://aws.amazon.com/blogs/machine-learning/merge-cells-and-column-headers-in-amazon-textract-tables/

image

print (df) shows:

        Date              Description Details  Credits  Debits   Balance
0   2/4/2022  Life Insurance Payments  Credit              445   9500.45
1   2/4/2022      Property Management  Credit              300   9945.45
2   2/4/2022           Retail Store 4  Credit            65.75  10245.45
3   2/3/2022         Electricity Bill  Credit           245.45   10311.2
4   2/3/2022               Water Bill  Credit           312.85  10556.65
5   2/3/2022           Rental Deposit  Credit     3000           10869.5
6   2/2/2022           Retail Store 3  Credit              125    7869.5
7   2/2/2022    Retail Store 2 Refund   Debit      5.5            7994.5
8   2/2/2022           Retail Store 1  Credit             45.5      8000
9   2/1/2022        Shoe Store Refund  Credit       33            8045.5
10  2/1/2022    Snack Vending Machine   Debit                4    8012.5

Note the lack of "Amount".
Can anyone shed some light on this?

Broken s3 bucket Detecting Merged Cells And Headers on fictitious bank statement

On this notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb the provided S3 bucket and object "s3://textract-table-merged-cells-data-sample/Textract-MergeCell-Statement.pdf" throws error even when you have proper access etc. Workaround is to have sample pdf in your bucket or local to notebook

InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.