hegghammer / dair Goto Github PK
View Code? Open in Web Editor NEWR package for Google Document AI
Home Page: https://dair.info/
License: Other
R package for Google Document AI
Home Page: https://dair.info/
License: Other
I followed the various vignettes and created and stored the processor ID, but I get the following error when trying to use dai_async():
Input: A PDF
Output: A PDF
The output will look like the input, yet will have invisible OCR'd text in the same position as the text-as-image. I guess this could be accomplished using the bounding boxes from Google's JSON output and then using something like rMarkdown to recreate the PDFs (i.e., add the image to a page, then add the OCR'd text to the page)?
There is an example Python script here that essentially seems to do the job: https://shreevatsa.net/post/add-ocr-layer-to-pdf/
Great docs and very excited to start using this package.
I'm having a problem with setup.
I get this error when I run dai_sync
> response1 <- dai_sync("test-jtm-1.pdf")
File submitted at 2021-06-01 15:02:26. HTTP status: 404 - unsuccessful.
Error: "Processor with id '4***********4' not found."
I set the processor id variable using DAI_PROCESSOR_ID=
in .Renviron.
I am able to process the document in the application console.
I'm sure I'm not giving everything I need to identify the problem...sorry.
The vignette on mathematical splitting gives this function: new_block_df <- split_block(df = block_df, block = 12, cut_point = 50)
However, the dataframe argument does not require df =
and gives an error. Update vignette to new_block_df <- split_block(block_df, block = 12, cut_point = 50)
?
Hi Thomas, it's me again. Obviously I'm enjoying daiR!
I'm working on multipage journals with complex layouts. I can reorder and split blocks. However I'm finding I'm getting a bit lost when I make a lot of changes. I wonder if you might be able to suggest a procedure that would redraw blocks based on my reordering and splits? That way, I could confirm that I have things in the right order.
I'm thinking that what's needed is to update my .json file using the new tokens and blocks, then pass that revised file through draw_blocks
. But I can't see how. Any suggestions?
If it's not too complex, this might be a useful addition to the vignette, as I'd assume that some others would have the same need.
Thanks for your work! Will
Hi Thomas, thank you for this amazing package. I was trying to replicate the User Guide example daiR::extracting_tables with the following code:
library(daiR)
dest_path <- file.path(tempdir(), "tobacco.pdf")
download.file("https://archive.org/download/tobacco_lpnn0000/lpnn0000.pdf",
destfile = dest_path,
mode = "wb")
resp <- dai_sync_tab(dest_path)
However, the function dai_sync_tab always results in a error message:
File submitted at 2021-12-21 18:16:25. HTTP status: 408 - unsuccessful.
Did you encounter this before? Do you know what causes this problem?
Thanks in advance
Tobias
P.S: Fyi, I set up authentication following this guide (https://dair.info/articles/setting_up_google_storage.html) and everything seems to be working fine.
Hi Thomas,
First of all thanks for the package and for the very detailed vignettes on how to use it.
I have encountered a problem when using the draw_blocks(). I get the following error:
Error in base64enc::base64decode(page_imgs[i], outconn) : I can only decode base64 strings
Here is the code I use:
resp <- gcs_upload("image1.pdf")
resp1 <- dai_async_tab("image1.pdf", bucket = bucket,
dest_folder = "process_json/")
our_file <- "process_json/image1.pdf-output-page-1-to-1.json"
json_file <- "process_json/image1.json"
gcs_get_object(our_file,
saveToDisk = json_file,
overwrite = TRUE)
draw_blocks(json_file, dir = tempdir())
Do you know what causes this problem?
Thanks in advance!
When a Processor ID is entered incorrectly to environment variable, if you submit a request using dai_sync
the request will hang, with no timeout
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.