aws-samples / amazon-textract-code-samples Goto Github PK
View Code? Open in Web Editor NEWAmazon Textract Code Samples
License: MIT No Attribution
Amazon Textract Code Samples
License: MIT No Attribution
Version: 0.13
Using merged cell example:
headers = table.get_header_field_names()
'Table' object has no attribute 'get_header_field_names'
I am trying to use the analyze_document
method from the amazon-textract-code-samples
repository with a document stored in S3. I followed the example from the paystub.ipynb notebook, but I changed the Document
parameter to use S3Object
instead of Bytes
. However, when I run the code, I get a ParamValidationError
saying that QueriesConfig
is an unknown parameter.
This is the code block from the https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/queries/paystub.ipynb notebook that I used as a reference:
response = textract.analyze_document(
Document={'Bytes': imageBytes},
FeatureTypes=["QUERIES"],
QueriesConfig={
"Queries": [{
"Text": "What is the year to date gross pay",
"Alias": "PAYSTUB_YTD_GROSS"
},
{
"Text": "What is the current gross pay?",
"Alias": "PAYSTUB_CURRENT_GROSS"
},
{
"Text": "What is the current net pay?",
"Alias": "PAYSTUB_CURRENT_NET"
},
{
"Text": "What is the current social security tax?",
"Alias": "PAYSTUB_CURRENT_SOCIAL_SECURITY_TAX"
},
{
"Text": "How much is the current medicare?",
"Alias": "PAYSTUB_MEDICARE_TAX"
},
{
"Text": "What is the vacation hours balance?",
"Alias": "PAYSTUB_VACATION_HOURS"
},
{
"Text": "What is the sick hours balance?",
"Alias": "PAYSTUB_SICK_HOURS"
},
{
"Text": "What is the employee name?",
"Alias": "PAYSTUB_EMPLOYEE_NAME"
}]
})
This is the code block that I am using:
response = client.analyze_document(
Document={
'S3Object': {'Bucket': bucket, 'Name': document}
},
FeatureTypes=['QUERIES'],
QueriesConfig={
'Queries': [
{ 'Text': 'What is the Name ?', 'Alias': 'PATIENT_NAME' },
{ 'Text': 'What is the Test Name ?', 'Alias': 'TEST_NAME' },
]
}
)
The only difference is that I am using S3Object
instead of Bytes
for the Document
parameter.
This is the error message that I get:
ParamValidationError: Parameter validation failed:
Unknown parameter in input: "QueriesConfig", must be one of: Document, FeatureTypes, HumanLoopConfig
This is the full traceback of the error:
I expected the code to run without errors and return the results of the queries for the document in S3.
The code raises a ParamValidationError
and does not return any results.
I searched for similar issues on GitHub, but I could not find any. I also checked the documentation for the analyze_document
method, but I did not see any mention of QueriesConfig
being incompatible with S3Object
. I wonder if this is a bug or a limitation of the API.
using textract-trp 0.1.3
When parsing "get_document_analysis" response the following output is generated:
Traceback (most recent call last):
File "G:\dev\OCR\main.py", line 17, in <module>
result = (textract.receive_document_result('52c4a450c667a18d89f4e26a1cf4b56859ad239f1a63279bec8f60458ae2284e'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\dev\OCR\textract.py", line 62, in receive_document_result
return Document(response)
^^^^^^^^^^^^^^^^^^
File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 633, in __init__
self._parse()
File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 667, in _parse
page = Page(documentPage["Blocks"], self._blockMap)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 516, in __init__
self._parse(blockMap)
File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 530, in _parse
l = Line(item, blockMap)
^^^^^^^^^^^^^^^^^^^^
File "G:\dev\OCR\venv\Lib\site-packages\trp\__init__.py", line 142, in __init__
if(blockMap[cid]["BlockType"] == "WORD"):
~~~~~~~~^^^^^
KeyError: '9e2f5e38-f865-4b79-a37b-ac8ed7a19f02'
Dear Team,
I want to extract TABLES
and FORMS
type data from a PDF file.
I did not found any API where I can pass the parameter (FORMS and TABLE) to convert from a PDF file.
Could you please help me out?
Textract Queries seems very helpful and interesting but I would like to know if it can handle other languages?
E.g. in Swedish invoice number is "fakturanummer" so can I get a response for "What is the fakturanummer?"?
Thanks!
It would be nice to have samples for extraction of table data in Java
Will receive AttributeError: 'NoneType' object has not attribute 'text'
in case of empty key or value for a KEY_VALUE_SET
bucket = event['Records'][0]['s3']['bucket']['name']
#bucket = 'textract-input-files'
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
textract = boto3.client('textract')
textract.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': bucket,
'Name': key
}
},``
the above code is failed to execute when i pass the file which has the 'spaces' in the file name
An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Same code works fine if i remove spaces from the filename
Error getting object Arkilo and Pierce.pdf from bucket textract-input-files. Make sure they exist and your bucket is in the same region as this function.
[ERROR] InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 37, in lambda_handler
raise e
File "/var/task/lambda_function.py", line 30, in lambda_handler
'SNSTopicArn': 'arn:aws:sns:us-east-**************:SNStopicTextract'
File "/opt/python/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/python/botocore/client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)
When running the notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb you get following error
`Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 5, in
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string, get_tables_string
File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/init.py", line 3, in
from .t_pretty_print import Pretty_Print_Table_Format as Pretty_Print_Table_Format
File "/opt/conda/lib/python3.7/site-packages/textractprettyprinter/t_pretty_print.py", line 1, in
import trp
File "/opt/conda/lib/python3.7/site-packages/trp/init.py", line 31
print ip
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(ip)?`
This is fixed by adding
!pip install textract-trp
Once that is fixed you get following error
AttributeError: 'Table' object has no attribute 'get_header_field_names'
When parsing some multi-page outputs, there's a bug in the trp.py
file.
Due to which the keys in the blockmap
at line 119 is not found & KeyError Exception
is thrown.
I ran this code on a sample pdf document but it never completes . The PDF is valid since it works from the console
When I terminate the job, this is what I get
Started job with id: eb64ef4b16fd56ae8756387b9aff71b1f661a55a6128e5c4de8a2b43c8a2e397
Job status: IN_PROGRESS
^CTraceback (most recent call last):
File "run.py", line 50, in
if(isJobComplete(jobId)):
File "run.py", line 23, in isJobComplete
time.sleep(5)
KeyboardInterrupt
I am reading the pdf as a bytearray and pass it to my analyser as follows:
response = self.textract.start_document_analysis(DocumentLocation={'Bytes': {docBytes}}, FeatureTypes=["TABLES", "FORMS"], NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn})
But this gives the following error:
TypeError: unhashable type: 'bytearray'
On this notebook https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract-Table-Merged-Cells-And-Headers.ipynb the provided S3 bucket and object "s3://textract-table-merged-cells-data-sample/Textract-MergeCell-Statement.pdf" throws error even when you have proper access etc. Workaround is to have sample pdf in your bucket or local to notebook
InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.
py version 3 and up dont support print " "
after fixing that it threw scraper error, installed scraper still error
I followed this example (same code, same image/pdf):
https://aws.amazon.com/blogs/machine-learning/merge-cells-and-column-headers-in-amazon-textract-tables/
print (df)
shows:
Date Description Details Credits Debits Balance
0 2/4/2022 Life Insurance Payments Credit 445 9500.45
1 2/4/2022 Property Management Credit 300 9945.45
2 2/4/2022 Retail Store 4 Credit 65.75 10245.45
3 2/3/2022 Electricity Bill Credit 245.45 10311.2
4 2/3/2022 Water Bill Credit 312.85 10556.65
5 2/3/2022 Rental Deposit Credit 3000 10869.5
6 2/2/2022 Retail Store 3 Credit 125 7869.5
7 2/2/2022 Retail Store 2 Refund Debit 5.5 7994.5
8 2/2/2022 Retail Store 1 Credit 45.5 8000
9 2/1/2022 Shoe Store Refund Credit 33 8045.5
10 2/1/2022 Snack Vending Machine Debit 4 8012.5
Note the lack of "Amount".
Can anyone shed some light on this?
Hello!
Thank you for such a wonderful library. We are using this extensively. We have one issue at hand. If we run a multipage pdf say of 200 pages and in between if any page is blank then it just breaks the complete pdf conversion.
Please suggest if there is a way we could avoid this so that the pdf gets converted by skipping the blank page or page with error.
Please guide.
the overlayer is missing and fails the notebook execution
The repo seems great. No doubt on that.
What would make this greater is the list of libraries/modules that needs to be installed for the files to run smoothly.
Something like a requirements.txt
would help.
When I start a textract StartDocumentTextDetection and try to set the NotificationChannel as follows
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
},
},
ClientRequestToken='string',
JobTag='string',
NotificationChannel={
'RoleArn': snsRoleArn,
'SNSTopicArn': snsTopicArn
}
)
I am getting an InvalidParameterException
An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
python 3.7
boto3 1.12.35
In this example the number of columns is manually(2 columns) defined, I have a case where the pages of the pdf file vary from 1 to 5 columns, how can I detect this? Could you please give me an example with this code?
Hi,
I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.
Below Image is the cropped image which I got using Bbox info from textract ocr output.
Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"
If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.
For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.
Now, I went ahead and tried to add thresholding and got this image as an output.
Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"
Here is the way, I created sample threshold -
_, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.
This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution.
I ask AWS team to have a look over and fix this issue.
I'm using this example: https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/01-detect-text-local.py
I'm getting:
this is one
line
second line
more here
How can we group 'LINES' tokens in a single line?
The output I'm looking for is:
this is one line
second line more here
Hello,
Currently I am performing OCR on 1 page document over there I am having multiple same name entity and in front of it there is a checkbox. I am able to detect all values and the checkbox is selected or not using form in AWS textract but I am not getting any data in sequence.
Below I have attached 2 files with same data but in both file it is detecting all entities but in random order.
Here is the code I am using:
import boto3
import sys
import re
import json
from collections import defaultdict
def get_kv_map(file_name):
with open(file_name, 'rb') as file:
img_test = file.read()
bytes_test = bytearray(img_test)
print('Image loaded', file_name)
# process using image bytes
client = boto3.client('textract')
response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])
# Get the text blocks
blocks = response['Blocks']
# get key and value maps
key_map = {}
value_map = {}
block_map = {}
for block in blocks:
block_id = block['Id']
block_map[block_id] = block
if block['BlockType'] == "KEY_VALUE_SET":
if 'KEY' in block['EntityTypes']:
key_map[block_id] = block
else:
value_map[block_id] = block
return key_map, value_map, block_map
def get_kv_relationship(key_map, value_map, block_map):
kvs = defaultdict(list)
for block_id, key_block in key_map.items():
value_block = find_value_block(key_block, value_map)
key = get_text(key_block, block_map)
val = get_text(value_block, block_map)
kvs[key].append(val)
return kvs
def find_value_block(key_block, value_map):
for relationship in key_block['Relationships']:
if relationship['Type'] == 'VALUE':
for value_id in relationship['Ids']:
value_block = value_map[value_id]
return value_block
def get_text(result, blocks_map):
text = ''
if 'Relationships' in result:
for relationship in result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
word = blocks_map[child_id]
if word['BlockType'] == 'WORD':
text += word['Text'] + ' '
if word['BlockType'] == 'SELECTION_ELEMENT':
if word['SelectionStatus'] == 'SELECTED':
text += 'X '
return text
def print_kvs(kvs):
for key, value in kvs.items():
print(key, ":", value)
def search_value(kvs, search_key):
for key, value in kvs.items():
if re.search(search_key, key, re.IGNORECASE):
return value
def main(file_name):
key_map, value_map, block_map = get_kv_map(file_name)
# Get Key Value relationship
kvs = get_kv_relationship(key_map, value_map, block_map)
print("\n\n== FOUND KEY : VALUE pairs ===\n")
print_kvs(kvs)
return kvs
if __name__ == "__main__":
file_name = sys.argv[1]
d = main("./data.png")
So how can I get the details in sequence rather than in random order:
For buyer entity this is data from 1 file:['', '', '', '', '', '', '', '', '', '', '', '', 'X ', '', 'X ', '', '']
For the same data this is response of buyer for other file: ['', '', '', '', '', '', '', '', '', 'X ', '', '', '', 'X ', '', '', '']
I'm having trouble using the call_textract() function for paginated outputs in the recent package you created.
Referring to https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller
After successfully installing the dependencies, what am I suppose to import?
I've been trying to set up my textract with python, and I'm using PyCharm IDE for this.
So, I wanted to test the first file on my local machine, and I ran "01-detect-text-local.py", but for some reason, it's yelling at me that "you must specify a region", that stopped at a package from botocore\region.py. But this is supposed to be running local, why is it trying to find an aws region at this point? Am I missing something here?
What does it mean to run on a local machine?
When getting more blocks than the "MaxResults" parameter, the parsing of the document fails.
Therefore the parse must be able to support broken references.
The failure is reproduced in the following repository:
How can we write integration tests for them?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.