Code Monkey home page Code Monkey logo

amazon-textract-code-samples's People

Contributors

darwaishx avatar ezzeddin avatar kmascar avatar maran-cs avatar mludvig avatar po-aleksandar-vucetic avatar rbcaixeta avatar sahays avatar schadem avatar udaynarayanan avatar vicrojo avatar vucetica avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-textract-code-samples's Issues

textract-trp issue in python 3.8

Version: 0.13

Using merged cell example:

headers = table.get_header_field_names()

'Table' object has no attribute 'get_header_field_names'

Bug Code sample not working Amazon Textract Queries FeatureType

Description

I am trying to use the analyze_document method from the amazon-textract-code-samples repository with a document stored in S3. I followed the example from the paystub.ipynb notebook, but I changed the Document parameter to use S3Object instead of Bytes. However, when I run the code, I get a ParamValidationError saying that QueriesConfig is an unknown parameter.

Code

This is the code block from the https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/queries/paystub.ipynb notebook that I used as a reference:

response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
        "Queries": [{
            "Text": "What is the year to date gross pay",
            "Alias": "PAYSTUB_YTD_GROSS"
        },
        {
            "Text": "What is the current gross pay?",
            "Alias": "PAYSTUB_CURRENT_GROSS"
        },
        {
            "Text": "What is the current net pay?",
            "Alias": "PAYSTUB_CURRENT_NET"
        },
        {
            "Text": "What is the current social security tax?",
            "Alias": "PAYSTUB_CURRENT_SOCIAL_SECURITY_TAX"
        },
        {
            "Text": "How much is the current medicare?",
            "Alias": "PAYSTUB_MEDICARE_TAX"
        },
        {
            "Text": "What is the vacation hours balance?",
            "Alias": "PAYSTUB_VACATION_HOURS"
        },
        {
            "Text": "What is the sick hours balance?",
            "Alias": "PAYSTUB_SICK_HOURS"
        },
        {
            "Text": "What is the employee name?",
            "Alias": "PAYSTUB_EMPLOYEE_NAME"
        }]
    })

This is the code block that I am using:

response = client.analyze_document(
    Document={
        'S3Object': {'Bucket': bucket, 'Name': document}
    },
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        'Queries': [
            { 'Text': 'What is the Name ?', 'Alias': 'PATIENT_NAME' },
            { 'Text': 'What is the Test Name ?', 'Alias': 'TEST_NAME' },
        ]
    }
)

The only difference is that I am using S3Object instead of Bytes for the Document parameter.

Error

This is the error message that I get:

ParamValidationError: Parameter validation failed:
Unknown parameter in input: "QueriesConfig", must be one of: Document, FeatureTypes, HumanLoopConfig

This is the full traceback of the error:
image

Expected behavior

I expected the code to run without errors and return the results of the queries for the document in S3.

Actual behavior

The code raises a ParamValidationError and does not return any results.

Environment

  • Python version: 3.9.7
  • Textract version: 1.18.0
  • OS: Windows 10

Additional context

I searched for similar issues on GitHub, but I could not find any. I also checked the documentation for the analyze_document method, but I did not see any mention of QueriesConfig being incompatible with S3Object. I wonder if this is a bug or a limitation of the API.

Unable to parse Document result in Python

using textract-trp 0.1.3

When parsing "get_document_analysis" response the following output is generated:

Traceback (most recent call last):
  File "G:\dev\OCR\main.py", line 17, in <module>
    result = (textract.receive_document_result('52c4a450c667a18d89f4e26a1cf4b56859ad239f1a63279bec8f60458ae2284e'))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\textract.py", line 62, in receive_document_result
    return Document(response)
           ^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 633, in __init__
    self._parse()
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 530, in _parse
    l = Line(item, blockMap)
        ^^^^^^^^^^^^^^^^^^^^
  File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
       ~~~~~~~~^^^^^
KeyError: '9e2f5e38-f865-4b79-a37b-ac8ed7a19f02'

Textract queries in other languages?

Textract Queries seems very helpful and interesting but I would like to know if it can handle other languages?

E.g. in Swedish invoice number is "fakturanummer" so can I get a response for "What is the fakturanummer?"?

Thanks!

textract failed to get the file which has spaces in the the filename.

 bucket = event['Records'][0]['s3']['bucket']['name']
    #bucket = 'textract-input-files'
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    
    try:
        textract = boto3.client('textract')
        
        textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket,
                'Name': key
            }
        },``

the above code is failed to execute when i pass the file which has the 'spaces' in the file name

An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Same code works fine if i remove spaces from the filename

Error getting object Arkilo and Pierce.pdf from bucket textract-input-files. Make sure they exist and your bucket is in the same region as this function.

[ERROR] InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 37, in lambda_handler
    raise e
  File "/var/task/lambda_function.py", line 30, in lambda_handler
    'SNSTopicArn': 'arn:aws:sns:us-east-**************:SNStopicTextract'
  File "/opt/python/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/python/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)

Bug with Detecting Merged Cells And Headers on fictitious bank statement

When running the notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb you get following error

`Traceback (most recent call last):

File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 5, in
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string, get_tables_string

File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/init.py", line 3, in
from .t_pretty_print import Pretty_Print_Table_Format as Pretty_Print_Table_Format

File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/t_pretty_print.py", line 1, in
import trp

File "/opt/conda/lib/python3.7/site-packages/trp/init.py", line 31
print ip
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(ip)?`

This is fixed by adding
!pip install textract-trp

Once that is fixed you get following error

AttributeError: 'Table' object has no attribute 'get_header_field_names'

Bug in parsing for Document

When parsing some multi-page outputs, there's a bug in the trp.py file.
Due to which the keys in the blockmap at line 119 is not found & KeyError Exception is thrown.

The job never completes

I ran this code on a sample pdf document but it never completes . The PDF is valid since it works from the console

When I terminate the job, this is what I get

Started job with id: eb64ef4b16fd56ae8756387b9aff71b1f661a55a6128e5c4de8a2b43c8a2e397
Job status: IN_PROGRESS
^CTraceback (most recent call last):
File "run.py", line 50, in
if(isJobComplete(jobId)):
File "run.py", line 23, in isJobComplete
time.sleep(5)
KeyboardInterrupt

Is it possible to make async calls without uploading the document to S3 bucket?

I am reading the pdf as a bytearray and pass it to my analyser as follows:

response = self.textract.start_document_analysis(DocumentLocation={'Bytes': {docBytes}}, FeatureTypes=["TABLES", "FORMS"], NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn})
But this gives the following error:
TypeError: unhashable type: 'bytearray'

Broken s3 bucket Detecting Merged Cells And Headers on fictitious bank statement

On this notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb the provided S3 bucket and object "s3://textract-table-merged-cells-data-sample/Textract-MergeCell-Statement.pdf" throws error even when you have proper access etc. Workaround is to have sample pdf in your bucket or local to notebook

InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

Chain of Errors

py version 3 and up dont support print " "
after fixing that it threw scraper error, installed scraper still error

merged cells not working as expected

I followed this example (same code, same image/pdf):
https://aws.amazon.com/blogs/machine-learning/merge-cells-and-column-headers-in-amazon-textract-tables/

image

print (df) shows:

        Date              Description Details  Credits  Debits   Balance
0   2/4/2022  Life Insurance Payments  Credit              445   9500.45
1   2/4/2022      Property Management  Credit              300   9945.45
2   2/4/2022           Retail Store 4  Credit            65.75  10245.45
3   2/3/2022         Electricity Bill  Credit           245.45   10311.2
4   2/3/2022               Water Bill  Credit           312.85  10556.65
5   2/3/2022           Rental Deposit  Credit     3000           10869.5
6   2/2/2022           Retail Store 3  Credit              125    7869.5
7   2/2/2022    Retail Store 2 Refund   Debit      5.5            7994.5
8   2/2/2022           Retail Store 1  Credit             45.5      8000
9   2/1/2022        Shoe Store Refund  Credit       33            8045.5
10  2/1/2022    Snack Vending Machine   Debit                4    8012.5

Note the lack of "Amount".
Can anyone shed some light on this?

Multipage pdf breaks if there is one blank page in between

Hello!
Thank you for such a wonderful library. We are using this extensively. We have one issue at hand. If we run a multipage pdf say of 200 pages and in between if any page is blank then it just breaks the complete pdf conversion.
Please suggest if there is a way we could avoid this so that the pdf gets converted by skipping the blank page or page with error.

Please guide.

InvalidParameterException when starting a Textract job using SNS Notification channel.

When I start a textract StartDocumentTextDetection and try to set the NotificationChannel as follows

 response = client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': objectName
            },
        },
        ClientRequestToken='string',
        JobTag='string',
        NotificationChannel={
            'RoleArn': snsRoleArn,
            'SNSTopicArn': snsTopicArn
        }
    )

I am getting an InvalidParameterException

An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters

python 3.7
boto3 1.12.35

Multi Columns Variables

In this example the number of columns is manually(2 columns) defined, I have a case where the pages of the pdf file vary from 1 to 5 columns, how can I detect this? Could you please give me an example with this code?

Textract Analyze document for Tables issue

Hi,

I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.

Below Image is the cropped image which I got using Bbox info from textract ocr output.
temp_crop

Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"

If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.

For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.

temp_crop (1)

Now, I went ahead and tried to add thresholding and got this image as an output.
temp_crop (2)

Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"

Here is the way, I created sample threshold -
_, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.

This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution.
I ask AWS team to have a look over and fix this issue.

Textract form type - not getting data in sequential order

Hello,
Currently I am performing OCR on 1 page document over there I am having multiple same name entity and in front of it there is a checkbox. I am able to detect all values and the checkbox is selected or not using form in AWS textract but I am not getting any data in sequence.
Below I have attached 2 files with same data but in both file it is detecting all entities but in random order.
Here is the code I am using:

import boto3
import sys
import re
import json
from collections import defaultdict


def get_kv_map(file_name):
    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', file_name)

    # process using image bytes
    client = boto3.client('textract')
    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])

    # Get the text blocks
    blocks = response['Blocks']

    # get key and value maps
    key_map = {}
    value_map = {}
    block_map = {}
    for block in blocks:
        block_id = block['Id']
        block_map[block_id] = block
        if block['BlockType'] == "KEY_VALUE_SET":
            if 'KEY' in block['EntityTypes']:
                key_map[block_id] = block
            else:
                value_map[block_id] = block
    return key_map, value_map, block_map


def get_kv_relationship(key_map, value_map, block_map):
    kvs = defaultdict(list)
    for block_id, key_block in key_map.items():
        value_block = find_value_block(key_block, value_map)
        key = get_text(key_block, block_map)
        val = get_text(value_block, block_map)

        kvs[key].append(val)
    return kvs


def find_value_block(key_block, value_map):
    for relationship in key_block['Relationships']:
        if relationship['Type'] == 'VALUE':
            for value_id in relationship['Ids']:
                value_block = value_map[value_id]
    return value_block


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] == 'SELECTED':
                            text += 'X '
    return text


def print_kvs(kvs):
    for key, value in kvs.items():
        print(key, ":", value)

        

def search_value(kvs, search_key):
    for key, value in kvs.items():
        if re.search(search_key, key, re.IGNORECASE):
            return value


def main(file_name):
    key_map, value_map, block_map = get_kv_map(file_name)

    # Get Key Value relationship
    kvs = get_kv_relationship(key_map, value_map, block_map)
    print("\n\n== FOUND KEY : VALUE pairs ===\n")
    print_kvs(kvs)
    return kvs


if __name__ == "__main__":
    file_name = sys.argv[1]
    d = main("./data.png")

file1.pdf
file2.pdf

So how can I get the details in sequence rather than in random order:

For buyer entity this is data from 1 file:['', '', '', '', '', '', '', '', '', '', '', '', 'X ', '', 'X ', '', '']
For the same data this is response of buyer for other file: ['', '', '', '', '', '', '', '', '', 'X ', '', '', '', 'X ', '', '', '']

no region found when running 01-detect-text-local.py

I've been trying to set up my textract with python, and I'm using PyCharm IDE for this.

So, I wanted to test the first file on my local machine, and I ran "01-detect-text-local.py", but for some reason, it's yelling at me that "you must specify a region", that stopped at a package from botocore\region.py. But this is supposed to be running local, why is it trying to find an aws region at this point? Am I missing something here?

What does it mean to run on a local machine?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.