hegghammer / dair Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 3.0 8.72 MB

R package for Google Document AI

Home Page: https://dair.info/

License: Other

R 98.99% CSS 0.03% TeX 0.97%

google-cloud ocr r

dair's People

Contributors

Stargazers

Watchers

Forkers

gazetedentarihebakis openjusticeok cjbarrie

dair's Issues

[PROBLEM]: Processor ID not found

I followed the various vignettes and created and stored the processor ID, but I get the following error when trying to use dai_async():

Error: "Processor with id 'XYZ' not found."

macOS Ventura 13.4
R version 4.3.0
RStudio 2023.06.0+421

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image?

Input: A PDF
Output: A PDF

The output will look like the input, yet will have invisible OCR'd text in the same position as the text-as-image. I guess this could be accomplished using the bounding boxes from Google's JSON output and then using something like rMarkdown to recreate the PDFs (i.e., add the image to a page, then add the OCR'd text to the page)?

There is an example Python script here that essentially seems to do the job: https://shreevatsa.net/post/add-ocr-layer-to-pdf/

Processor ID problem

Great docs and very excited to start using this package.

I'm having a problem with setup.

I get this error when I run dai_sync

> response1 <- dai_sync("test-jtm-1.pdf")
File submitted at 2021-06-01 15:02:26. HTTP status: 404 - unsuccessful.
Error: "Processor with id '4***********4' not found."

I set the processor id variable using DAI_PROCESSOR_ID= in .Renviron.

I am able to process the document in the application console.

I'm sure I'm not giving everything I need to identify the problem...sorry.

vignette: mathematical splitting

The vignette on mathematical splitting gives this function: new_block_df <- split_block(df = block_df, block = 12, cut_point = 50)

However, the dataframe argument does not require df = and gives an error. Update vignette to new_block_df <- split_block(block_df, block = 12, cut_point = 50)?

redraw blocks?

Hi Thomas, it's me again. Obviously I'm enjoying daiR!

I'm working on multipage journals with complex layouts. I can reorder and split blocks. However I'm finding I'm getting a bit lost when I make a lot of changes. I wonder if you might be able to suggest a procedure that would redraw blocks based on my reordering and splits? That way, I could confirm that I have things in the right order.

I'm thinking that what's needed is to update my .json file using the new tokens and blocks, then pass that revised file through draw_blocks. But I can't see how. Any suggestions?

If it's not too complex, this might be a useful addition to the vignette, as I'd assume that some others would have the same need.

Thanks for your work! Will

HTTP status: 408 - unsuccessful.

Hi Thomas, thank you for this amazing package. I was trying to replicate the User Guide example daiR::extracting_tables with the following code:

library(daiR)
dest_path <- file.path(tempdir(), "tobacco.pdf")
download.file("https://archive.org/download/tobacco_lpnn0000/lpnn0000.pdf",
destfile = dest_path,
mode = "wb")

resp <- dai_sync_tab(dest_path)

However, the function dai_sync_tab always results in a error message:
File submitted at 2021-12-21 18:16:25. HTTP status: 408 - unsuccessful.

Did you encounter this before? Do you know what causes this problem?

Thanks in advance
Tobias

P.S: Fyi, I set up authentication following this guide (https://dair.info/articles/setting_up_google_storage.html) and everything seems to be working fine.

Problem in draw_blocks

Hi Thomas,
First of all thanks for the package and for the very detailed vignettes on how to use it.
I have encountered a problem when using the draw_blocks(). I get the following error:
Error in base64enc::base64decode(page_imgs[i], outconn) : I can only decode base64 strings

Here is the code I use:

resp <- gcs_upload("image1.pdf")
resp1 <- dai_async_tab("image1.pdf", bucket = bucket,
                       dest_folder = "process_json/")
our_file <- "process_json/image1.pdf-output-page-1-to-1.json"
json_file <- "process_json/image1.json"
gcs_get_object(our_file, 
               saveToDisk = json_file,
               overwrite = TRUE)

draw_blocks(json_file, dir = tempdir())

Do you know what causes this problem?

Thanks in advance!

Long timeout

When a Processor ID is entered incorrectly to environment variable, if you submit a request using dai_sync the request will hang, with no timeout

hegghammer / dair Goto Github PK

dair's People

Contributors

Stargazers

Watchers

Forkers

dair's Issues

[PROBLEM]: Processor ID not found

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image?

Processor ID problem

vignette: mathematical splitting

redraw blocks?

HTTP status: 408 - unsuccessful.

Problem in draw_blocks

Long timeout

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent