Code Monkey home page Code Monkey logo

pdfsearch's People

Contributors

lebebr01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pdfsearch's Issues

Reading columns correctly.

I would like to thank you for working on this great package. It's extremely useful and has plenty of applications. I hope you continue to work and maintain it.

I noted that one of the limitations (as you mentioned) is text fragmentation when the text in pdf are in columns (eg most scientific articles). I came across this function tabulizer::extract_text(file) which can read multiple columns. I wonder if you can use something similar in your package to fix that issue. This tabulizer function will also still also cause issues with tables and images/table captions but at least will get the flow of the main text correct.

thank you

heading_search is reporting the incorrect line_num

I have tested this with multiple PDFs that were loaded into R as character vectors. In particular there is a PDF (character vector) that has a "CONTENTS" page on page 6. When previewing the text using head(text) the 6th element (page of the text) is the contents page. When searching for it using

heading_search('text',"CONTENTS")  

returns
keyword page_num
CONTENTS 7
I tried using the function directly with the source PDF and the same result occurs.

pdf search error "Bad annotation"

Hi,

I am new to R and R studio so sorry if this is a slightly dumb question but when I try to run the code below, I get most of my results but also a bunch of errors "PDF error : Bad annotation destintation".

What is it due to? How can I fix this?

My code :
library(pdfsearch)
library(dplyr)
library(writexl)

dest <- "C:/Users/me/Documents/2019/"

result_table <- data.frame(keyword_directory(dest,
keyword = c("informatique"),
surround_lines = 1, full_names = TRUE))

result_clean <- result_table %>% select(ID, pdf_name, keyword, line_text)

write_xlsx(x = result_clean, path = "C:/Users/me/Documents/2019/rod_2019.xlsx", col_names = TRUE)

Thank you

page_num

Hi I'm not sure if page_num is working properly. I tried it with different pdf but didn't get good results at all.

Verify line and page numbers with sentence splitting

Verify the page and line numbers are correct when splitting document into sentences.

pdf_url <- "https://arxiv.org/pdf/1610.00147.pdf"

search_result_sent <- keyword_search(pdf_url, keyword = c('measurement error'),
  path = TRUE, remove_hyphen = TRUE, convert_sentence = TRUE)

Page numbers

Hey Brandon, thanks for making this package, I've found it incredibly useful.

I'm not sure this is an issue, maybe more of a question. I was wondering how the page numbers are defined in found text. For example (using the v0.3.0):

library(pdfsearch)
download.file('https://www.tescoplc.com/media/757589/tesco_annual_report_2021.pdf', destfile = 'tesco_2021.pdf')

keyword_search(
  'tesco_2021.pdf', 
  keyword = 'Tesco was built to be a champion for customers', 
  path = TRUE, 
  ignore_case = TRUE
)

# A tibble: 1 × 5
  keyword                                        page_num line_num line_text token_text
  <chr>                                             <int>    <int> <list>    <list>    
1 Tesco was built to be a champion for customers        2        4 <chr [1]> <list [1]>

The result returned here is page 2 - that's sort of correct, as the text is first found on the 2nd numbered page. But there are 2 pages before the numbering begins which I'd prefer were included in the page count (in my case because I want to point something like tabulizer at a specific location in the doc).

Just wondered if there was any way to change how page numbers are defined. Maybe this is already possible?

Anyways, thanks again for all your work with this. It's a terrific package that works a treat!

Alastair

Header search

Incorporate into the master function.

  • Include flag to include this functionality
  • Finalize output formats

Corrections/typos in JOSS article

Corrections for JOSS review

  • Paragraph 1, line 3, comma after ubiquitous
  • Paragraph 1, line 5, no comma after library, add comma after documents
  • Paragraph 2, line 3, remove parentheses in files(s), because if the s were removed the sentence would be ungrammatical.
  • The phrase than is currently done is ambiguous. Please explain what research practice you are improving on.
  • The sentence including example keywords searches and discusses is not parallel and therefore ungrammatical.
  • Please expand the caption in figure 1 to explain what the sample search was. What keywords were you searching for, and over a single file or over a directory?

Corrections to vignette

Issue for JOSS review.

  • Please give the vignette a title.
  • It would also be good to rename the vignette file to something less generic, like intro-to-pdfsearch or so on.

See this image for what the vignette currently shows up as:

screenshot 2018-05-08 14 02 16

Simplify directory search interface?

Issue for JOSS review.

Why is it necessary for the user to specify full_names = TRUE when using the keyword_directory() function? When I passed in an absolute path to a directory, the function returned an error from normalizePaths() unless I used that argument. It would seem like the default option could be made easier for the user.

Change depends to imports

Issue from JOSS review

The package currently uses Depends in the DESCRIPTION for its dependencies. Best practice is to use Imports, and then to refer to the function by its namespace: e.g., somepackage::some_func(). See Hadley Wickham's R Packages.

Split PDF needs further testing

Currently the splitting of PDF with two column layout does not work well. This function does not work at all with the remove_hyphen argument.

Need to explore removal of empty character strings when splitting on white space.

Explain package significance and show research uses

Issue for JOSS review.

This package and paper is currently lacking a "a clear statement of need that illustrates the purpose of the software." The draft paper does explain that this package could be used to make keyword searching more reproducible, and I buy that claim. But I am not aware of a research application for keyword searching beyond looking things up, which would not need to be reproducible. What kinds of research work require this package? I'm willing to be persuaded, of course, but at the moment I don't understand why I would want to use this package instead of, e.g., just using my operating system's search feature.

Ideally the author would provided a worked example with a genuine research problem, either replacing or augmenting the demo example in the vignette.

The statement of need could perhaps be addressed by also meeting this JOSS guideline: "Mentions (if applicable) of any ongoing research projects using the software or recent scholarly publications enabled by it." Could the author please explain what research applications he, and if possible, others, are using this package for?

This explanation of the research purpose definitely needs to go in the JOSS paper, but could also be adapted for the README.

enhancing this package with "OCR" and "translation"

We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.

I will try to integrated this functionality in this package for our purposes and work on it in this on this fork:
https://github.com/openefsa/pdfsearch

I could potentially contribute this as open source here, if you are interested.
Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.

In case, you do not want to couple the package too much to a commercial provide such as Azure,
it might be useful to have 2 extension points on this package,
which allow to "plugin"

  • a OCR-provider
  • a translation-provider

Example produces NAs

This:

# install.packages("pdfsearch")
library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
                         keyword = c('measurement', 'error'),
                         path = TRUE)

head(result$line_text, n = 2)

Results in:

[[1]]
[1] "Reiter, Maria DeYoreo∗ arXiv:1610.00147v1 [stat.ME] 1 Oct 2016 Abstract Often in surveys, key items are subject to measurement errors. "

[[2]]
[1] "In NA NA dividuals with high quality measurements of the error-prone survey items. "

Notice the NA's in the second item.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.