Code Monkey home page Code Monkey logo

dair's People

Contributors

hegghammer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dair's Issues

[PROBLEM]: Processor ID not found

I followed the various vignettes and created and stored the processor ID, but I get the following error when trying to use dai_async():

  • Error: "Processor with id 'XYZ' not found."

  • macOS Ventura 13.4
  • R version 4.3.0
  • RStudio 2023.06.0+421

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image?

Input: A PDF
Output: A PDF

The output will look like the input, yet will have invisible OCR'd text in the same position as the text-as-image. I guess this could be accomplished using the bounding boxes from Google's JSON output and then using something like rMarkdown to recreate the PDFs (i.e., add the image to a page, then add the OCR'd text to the page)?

There is an example Python script here that essentially seems to do the job: https://shreevatsa.net/post/add-ocr-layer-to-pdf/

Processor ID problem

Great docs and very excited to start using this package.

I'm having a problem with setup.

I get this error when I run dai_sync

> response1 <- dai_sync("test-jtm-1.pdf")
File submitted at 2021-06-01 15:02:26. HTTP status: 404 - unsuccessful.
Error: "Processor with id '4***********4' not found."

I set the processor id variable using DAI_PROCESSOR_ID= in .Renviron.

I am able to process the document in the application console.

I'm sure I'm not giving everything I need to identify the problem...sorry.

vignette: mathematical splitting

The vignette on mathematical splitting gives this function: new_block_df <- split_block(df = block_df, block = 12, cut_point = 50)

However, the dataframe argument does not require df = and gives an error. Update vignette to new_block_df <- split_block(block_df, block = 12, cut_point = 50)?

redraw blocks?

Hi Thomas, it's me again. Obviously I'm enjoying daiR!

I'm working on multipage journals with complex layouts. I can reorder and split blocks. However I'm finding I'm getting a bit lost when I make a lot of changes. I wonder if you might be able to suggest a procedure that would redraw blocks based on my reordering and splits? That way, I could confirm that I have things in the right order.

I'm thinking that what's needed is to update my .json file using the new tokens and blocks, then pass that revised file through draw_blocks. But I can't see how. Any suggestions?

If it's not too complex, this might be a useful addition to the vignette, as I'd assume that some others would have the same need.

Thanks for your work! Will

HTTP status: 408 - unsuccessful.

Hi Thomas, thank you for this amazing package. I was trying to replicate the User Guide example daiR::extracting_tables with the following code:

library(daiR)
dest_path <- file.path(tempdir(), "tobacco.pdf")
download.file("https://archive.org/download/tobacco_lpnn0000/lpnn0000.pdf",
destfile = dest_path,
mode = "wb")

resp <- dai_sync_tab(dest_path)

However, the function dai_sync_tab always results in a error message:
File submitted at 2021-12-21 18:16:25. HTTP status: 408 - unsuccessful.

Did you encounter this before? Do you know what causes this problem?

Thanks in advance
Tobias

P.S: Fyi, I set up authentication following this guide (https://dair.info/articles/setting_up_google_storage.html) and everything seems to be working fine.

Problem in draw_blocks

Hi Thomas,
First of all thanks for the package and for the very detailed vignettes on how to use it.
I have encountered a problem when using the draw_blocks(). I get the following error:
Error in base64enc::base64decode(page_imgs[i], outconn) : I can only decode base64 strings

Here is the code I use:

resp <- gcs_upload("image1.pdf")
resp1 <- dai_async_tab("image1.pdf", bucket = bucket,
                       dest_folder = "process_json/")
our_file <- "process_json/image1.pdf-output-page-1-to-1.json"
json_file <- "process_json/image1.json"
gcs_get_object(our_file, 
               saveToDisk = json_file,
               overwrite = TRUE)

draw_blocks(json_file, dir = tempdir())

Do you know what causes this problem?

Thanks in advance!

Long timeout

When a Processor ID is entered incorrectly to environment variable, if you submit a request using dai_sync the request will hang, with no timeout

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.