Code Monkey home page Code Monkey logo

Comments (6)

stephenh20 avatar stephenh20 commented on September 22, 2024

Had same problem, even though the code loops over pages first it seems that it cannot find things that are on subsequent pages. I placed if cid in blockMap: around line 119 which resulted in getting the keys found for page1 only in a three page pdf.
If I split the pdf into 3 separate pages I get the expected results for each page.
The code has a loop around the pages but it cant seem to find anything past page 1 in trp.Document(response)

from amazon-textract-code-samples.

tshrjn avatar tshrjn commented on September 22, 2024

@stephenh20 Did you find a workaround or did you have some script which breaks into separate pages uses the parser on each separately & stitches them later on?

A small workaround, though I've yet to test it properly:

        document_by_pages = []
        for page in response['DocumentMetadata']['Pages']:
            page_resp = {k: v for k, v in response.items() if k != 'Blocks'}
            page_resp['DocumentMetadata']['Pages'] = 1 # since we're splitting into multiple documents

            page_resp['Blocks'] = [x for x in response['Blocks'] if x['Page'] == page]
            document_by_pages.append(Document(page_resp))

from amazon-textract-code-samples.

stephenh20 avatar stephenh20 commented on September 22, 2024

I split the pdf into pages then work on the pages separately.
I then search through each page to see if the info I need is there.

from amazon-textract-code-samples.

mcaires2 avatar mcaires2 commented on September 22, 2024

Hey Guys,
I am from Brazil. I am not a expert on coding (only do it for my personal projects)..
Look at my github:

https://github.com/mcaires2/textratc_with_aws_pdf_multiple_pages_tables_text/blob/master/textract_tables_pdf_multiple_pages_to_excel.py

From the readmeEnglish look to this part:

"Extracting text from a PDF went smoothly when the file has many pages.
The same cannot be said when we went to extract tables. In this case, when the return file has more than one batch ('token'), an error occurred to find a key when building the tables whose information was in the first and second batch.
We get around this using a simple technique of looping forwards in the 'pages' list and adding (+)(not append) all the blocks ['Blocks'] before building the tables."

contador=0 # ( translating to English 'contador" means counter) it is just a integer variable
csv = ''
blocks=[]
for item in pages:
blocks= blocks + pages[contador]['Blocks'] # when you have tables on more than one Token
contador= contador+1
print(contador)

from amazon-textract-code-samples.

tshrjn avatar tshrjn commented on September 22, 2024

Actually, the issue I was facing was a bit different, when I tried to search for the missing blockid, the textract response didn't actually have that block but it did have in CHILD relationships.

There were many "LINE" blocks (~9% from 200 docs) that had CHILD relationships with all child blockID for the words in the sentence that were missing from the api's response.

So, the issue in my case was from textract's response.

from amazon-textract-code-samples.

schadem avatar schadem commented on September 22, 2024

I added a package and method to call Textract which also works for paginated output. https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller
Just use call_textract from that and you'll be good.
Closing this for now

from amazon-textract-code-samples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.