When parsing some multi-page outputs, there's a bug in the t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Bug in parsing for Document about amazon-textract-code-samples HOT 6 CLOSED

aws-samples commented on September 22, 2024

Bug in parsing for Document

from amazon-textract-code-samples.

Comments (6)

stephenh20 commented on September 22, 2024

Had same problem, even though the code loops over pages first it seems that it cannot find things that are on subsequent pages. I placed if cid in blockMap: around line 119 which resulted in getting the keys found for page1 only in a three page pdf.
If I split the pdf into 3 separate pages I get the expected results for each page.
The code has a loop around the pages but it cant seem to find anything past page 1 in trp.Document(response)

from amazon-textract-code-samples.

tshrjn commented on September 22, 2024

@stephenh20 Did you find a workaround or did you have some script which breaks into separate pages uses the parser on each separately & stitches them later on?

A small workaround, though I've yet to test it properly:

        document_by_pages = []
        for page in response['DocumentMetadata']['Pages']:
            page_resp = {k: v for k, v in response.items() if k != 'Blocks'}
            page_resp['DocumentMetadata']['Pages'] = 1 # since we're splitting into multiple documents

            page_resp['Blocks'] = [x for x in response['Blocks'] if x['Page'] == page]
            document_by_pages.append(Document(page_resp))

from amazon-textract-code-samples.

stephenh20 commented on September 22, 2024

I split the pdf into pages then work on the pages separately.
I then search through each page to see if the info I need is there.

from amazon-textract-code-samples.

mcaires2 commented on September 22, 2024

Hey Guys,
I am from Brazil. I am not a expert on coding (only do it for my personal projects)..
Look at my github:

https://github.com/mcaires2/textratc_with_aws_pdf_multiple_pages_tables_text/blob/master/textract_tables_pdf_multiple_pages_to_excel.py

From the readmeEnglish look to this part:

"Extracting text from a PDF went smoothly when the file has many pages.
The same cannot be said when we went to extract tables. In this case, when the return file has more than one batch ('token'), an error occurred to find a key when building the tables whose information was in the first and second batch.
We get around this using a simple technique of looping forwards in the 'pages' list and adding (+)(not append) all the blocks ['Blocks'] before building the tables."

contador=0 # ( translating to English 'contador" means counter) it is just a integer variable
csv = ''
blocks=[]
for item in pages:
blocks= blocks + pages[contador]['Blocks'] # when you have tables on more than one Token
contador= contador+1
print(contador)

from amazon-textract-code-samples.

tshrjn commented on September 22, 2024

Actually, the issue I was facing was a bit different, when I tried to search for the missing blockid, the textract response didn't actually have that block but it did have in CHILD relationships.

There were many "LINE" blocks (~9% from 200 docs) that had CHILD relationships with all child blockID for the words in the sentence that were missing from the api's response.

So, the issue in my case was from textract's response.

from amazon-textract-code-samples.

schadem commented on September 22, 2024

I added a package and method to call Textract which also works for paginated output. https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller
Just use call_textract from that and you'll be good.
Closing this for now

from amazon-textract-code-samples.

Bug in parsing for Document about amazon-textract-code-samples HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent