Comments (6)
Had same problem, even though the code loops over pages first it seems that it cannot find things that are on subsequent pages. I placed if cid in blockMap: around line 119 which resulted in getting the keys found for page1 only in a three page pdf.
If I split the pdf into 3 separate pages I get the expected results for each page.
The code has a loop around the pages but it cant seem to find anything past page 1 in trp.Document(response)
from amazon-textract-code-samples.
@stephenh20 Did you find a workaround or did you have some script which breaks into separate pages uses the parser on each separately & stitches them later on?
A small workaround, though I've yet to test it properly:
document_by_pages = []
for page in response['DocumentMetadata']['Pages']:
page_resp = {k: v for k, v in response.items() if k != 'Blocks'}
page_resp['DocumentMetadata']['Pages'] = 1 # since we're splitting into multiple documents
page_resp['Blocks'] = [x for x in response['Blocks'] if x['Page'] == page]
document_by_pages.append(Document(page_resp))
from amazon-textract-code-samples.
I split the pdf into pages then work on the pages separately.
I then search through each page to see if the info I need is there.
from amazon-textract-code-samples.
Hey Guys,
I am from Brazil. I am not a expert on coding (only do it for my personal projects)..
Look at my github:
From the readmeEnglish look to this part:
"Extracting text from a PDF went smoothly when the file has many pages.
The same cannot be said when we went to extract tables. In this case, when the return file has more than one batch ('token'), an error occurred to find a key when building the tables whose information was in the first and second batch.
We get around this using a simple technique of looping forwards in the 'pages' list and adding (+)(not append) all the blocks ['Blocks'] before building the tables."
contador=0 # ( translating to English 'contador" means counter) it is just a integer variable
csv = ''
blocks=[]
for item in pages:
blocks= blocks + pages[contador]['Blocks'] # when you have tables on more than one Token
contador= contador+1
print(contador)
from amazon-textract-code-samples.
Actually, the issue I was facing was a bit different, when I tried to search for the missing blockid, the textract response didn't actually have that block but it did have in CHILD relationships.
There were many "LINE" blocks (~9% from 200 docs) that had CHILD relationships with all child blockID
for the words in the sentence that were missing from the api's response.
So, the issue in my case was from textract's response.
from amazon-textract-code-samples.
I added a package and method to call Textract which also works for paginated output. https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller
Just use call_textract from that and you'll be good.
Closing this for now
from amazon-textract-code-samples.
Related Issues (20)
- Textract-Caller importing package for call_textract() HOT 1
- Broken blocks relations HOT 1
- Multipage pdf breaks if there is one blank page in between HOT 1
- Integration Test HOT 1
- textract-textractor-tools.ipynb fails for overlay HOT 1
- The job never completes HOT 1
- trp example fails with empty key or value
- Textract queries in other languages? HOT 1
- Multi Columns Variables
- Textract form type - not getting data in sequential order
- Build Error on the .Net Version
- Build Error on the .Net Version
- textract-trp issue in python 3.8 HOT 1
- Unable to parse Document result in Python
- Samples for table extraction in java.
- Grouping lines together
- merged cells not working as expected
- Bug with Detecting Merged Cells And Headers on fictitious bank statement
- Broken s3 bucket Detecting Merged Cells And Headers on fictitious bank statement
- Bug Code sample not working Amazon Textract Queries FeatureType HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-textract-code-samples.