lebebr01 / pdfsearch Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 5.0 981 KB

Search pdf files for keywords.

License: Other

R 96.04% TeX 3.96%

keyword pdf r

pdfsearch's People

Contributors

Stargazers

Watchers

Forkers

guhjy achenxu jixing475 openefsa jdixoncs

pdfsearch's Issues

Reading columns correctly.

I would like to thank you for working on this great package. It's extremely useful and has plenty of applications. I hope you continue to work and maintain it.

I noted that one of the limitations (as you mentioned) is text fragmentation when the text in pdf are in columns (eg most scientific articles). I came across this function tabulizer::extract_text(file) which can read multiple columns. I wonder if you can use something similar in your package to fix that issue. This tabulizer function will also still also cause issues with tables and images/table captions but at least will get the flow of the main text correct.

thank you

Explore languages other than English

Evidence that the search features only work for English. Need to explore ability to include search for other languages.

Add ability to split document into words

Use the tokenizers package for this.

Fix heading search with 0s

Need to include NA when no matching of header location with keyword_search.

heading_search is reporting the incorrect line_num

I have tested this with multiple PDFs that were loaded into R as character vectors. In particular there is a PDF (character vector) that has a "CONTENTS" page on page 6. When previewing the text using head(text) the 6th element (page of the text) is the contents page. When searching for it using

heading_search('text',"CONTENTS")

returns
keyword page_num
CONTENTS 7
I tried using the function directly with the source PDF and the same result occurs.

Split pdfs by columns

More exploration needed for splitting and combining pdfs with multiple columns.

pdf search error "Bad annotation"

Hi,

I am new to R and R studio so sorry if this is a slightly dumb question but when I try to run the code below, I get most of my results but also a bunch of errors "PDF error : Bad annotation destintation".

What is it due to? How can I fix this?

My code :
library(pdfsearch)
library(dplyr)
library(writexl)

dest <- "C:/Users/me/Documents/2019/"

result_table <- data.frame(keyword_directory(dest,
keyword = c("informatique"),
surround_lines = 1, full_names = TRUE))

result_clean <- result_table %>% select(ID, pdf_name, keyword, line_text)

write_xlsx(x = result_clean, path = "C:/Users/me/Documents/2019/rod_2019.xlsx", col_names = TRUE)

Thank you

PDF error: FoFiType1::parse a line has more than 255 characters, we don't support this

I got this message when executing keyword_directory() and keyword_search().

page_num

Hi I'm not sure if page_num is working properly. I tried it with different pdf but didn't get good results at all.

Update README

Platform specific split string?

If I run the example from the readme I see different output:

Is this expected? I noticed you split by \r\n which is windows-specific I think? Did you test the package on non-windows machines?

Verify line and page numbers with sentence splitting

Verify the page and line numbers are correct when splitting document into sentences.

pdf_url <- "https://arxiv.org/pdf/1610.00147.pdf"

search_result_sent <- keyword_search(pdf_url, keyword = c('measurement error'),
  path = TRUE, remove_hyphen = TRUE, convert_sentence = TRUE)

Regarding keyword_directory(), how can one know if a file was not read at all versus no keywords found?

Page numbers

Hey Brandon, thanks for making this package, I've found it incredibly useful.

I'm not sure this is an issue, maybe more of a question. I was wondering how the page numbers are defined in found text. For example (using the v0.3.0):

library(pdfsearch)
download.file('https://www.tescoplc.com/media/757589/tesco_annual_report_2021.pdf', destfile = 'tesco_2021.pdf')

keyword_search(
  'tesco_2021.pdf', 
  keyword = 'Tesco was built to be a champion for customers', 
  path = TRUE, 
  ignore_case = TRUE
)

# A tibble: 1 × 5
  keyword                                        page_num line_num line_text token_text
  <chr>                                             <int>    <int> <list>    <list>    
1 Tesco was built to be a champion for customers        2        4 <chr [1]> <list [1]>

The result returned here is page 2 - that's sort of correct, as the text is first found on the 2nd numbered page. But there are 2 pages before the numbering begins which I'd prefer were included in the page count (in my case because I want to point something like tabulizer at a specific location in the doc).

Just wondered if there was any way to change how page numbers are defined. Maybe this is already possible?

Anyways, thanks again for all your work with this. It's a terrific package that works a treat!

Alastair

Draft paper.md

Header search

Incorporate into the master function.

Include flag to include this functionality
Finalize output formats

Corrections/typos in JOSS article

Corrections for JOSS review

Paragraph 1, line 3, comma after ubiquitous
Paragraph 1, line 5, no comma after library, add comma after documents
Paragraph 2, line 3, remove parentheses in files(s), because if the s were removed the sentence would be ungrammatical.
The phrase than is currently done is ambiguous. Please explain what research practice you are improving on.
The sentence including example keywords searches and discusses is not parallel and therefore ungrammatical.
Please expand the caption in figure 1 to explain what the sample search was. What keywords were you searching for, and over a single file or over a directory?

Corrections to vignette

Issue for JOSS review.

Please give the vignette a title.
It would also be good to rename the vignette file to something less generic, like intro-to-pdfsearch or so on.

See this image for what the vignette currently shows up as:

Simplify directory search interface?

Issue for JOSS review.

Why is it necessary for the user to specify full_names = TRUE when using the keyword_directory() function? When I passed in an absolute path to a directory, the function returned an error from normalizePaths() unless I used that argument. It would seem like the default option could be made easier for the user.

Change depends to imports

Issue from JOSS review

The package currently uses Depends in the DESCRIPTION for its dependencies. Best practice is to use Imports, and then to refer to the function by its namespace: e.g., somepackage::some_func(). See Hadley Wickham's R Packages.

Split PDF needs further testing

Currently the splitting of PDF with two column layout does not work well. This function does not work at all with the remove_hyphen argument.

Need to explore removal of empty character strings when splitting on white space.

Explain package significance and show research uses

Issue for JOSS review.

This package and paper is currently lacking a "a clear statement of need that illustrates the purpose of the software." The draft paper does explain that this package could be used to make keyword searching more reproducible, and I buy that claim. But I am not aware of a research application for keyword searching beyond looking things up, which would not need to be reproducible. What kinds of research work require this package? I'm willing to be persuaded, of course, but at the moment I don't understand why I would want to use this package instead of, e.g., just using my operating system's search feature.

Ideally the author would provided a worked example with a genuine research problem, either replacing or augmenting the demo example in the vignette.

The statement of need could perhaps be addressed by also meeting this JOSS guideline: "Mentions (if applicable) of any ongoing research projects using the software or recent scholarly publications enabled by it." Could the author please explain what research applications he, and if possible, others, are using this package for?

This explanation of the research purpose definitely needs to go in the JOSS paper, but could also be adapted for the README.

remove_equations will make empty text, if no equations found at all

 pdfsearch:::remove_equation(c("a","b","c")
-> character(0)

This is even the default behaviour

enhancing this package with "OCR" and "translation"

We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.

I will try to integrated this functionality in this package for our purposes and work on it in this on this fork:
https://github.com/openefsa/pdfsearch

I could potentially contribute this as open source here, if you are interested.
Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.

In case, you do not want to couple the package too much to a commercial provide such as Azure,
it might be useful to have 2 extension points on this package,
which allow to "plugin"

a OCR-provider
a translation-provider

# install.packages("pdfsearch")
library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
                         keyword = c('measurement', 'error'),
                         path = TRUE)

head(result$line_text, n = 2)

Results in:

[[1]]
[1] "Reiter, Maria DeYoreo∗ arXiv:1610.00147v1 [stat.ME] 1 Oct 2016 Abstract Often in surveys, key items are subject to measurement errors. "

[[2]]
[1] "In NA NA dividuals with high quality measurements of the error-prone survey items. "

Notice the NA's in the second item.

lebebr01 / pdfsearch Goto Github PK

pdfsearch's People

Contributors

Stargazers

Watchers

Forkers

pdfsearch's Issues

Recommend Projects

Recommend Topics

Recommend Org