lebebr01 / pdfsearch Goto Github PK
View Code? Open in Web Editor NEWSearch pdf files for keywords.
License: Other
Search pdf files for keywords.
License: Other
I would like to thank you for working on this great package. It's extremely useful and has plenty of applications. I hope you continue to work and maintain it.
I noted that one of the limitations (as you mentioned) is text fragmentation when the text in pdf are in columns (eg most scientific articles). I came across this function tabulizer::extract_text(file)
which can read multiple columns. I wonder if you can use something similar in your package to fix that issue. This tabulizer function will also still also cause issues with tables and images/table captions but at least will get the flow of the main text correct.
thank you
Evidence that the search features only work for English. Need to explore ability to include search for other languages.
Use the tokenizers package for this.
Need to include NA when no matching of header location with keyword_search.
I have tested this with multiple PDFs that were loaded into R as character vectors. In particular there is a PDF (character vector) that has a "CONTENTS" page on page 6. When previewing the text using head(text) the 6th element (page of the text) is the contents page. When searching for it using
heading_search('text',"CONTENTS")
returns
keyword page_num
CONTENTS 7
I tried using the function directly with the source PDF and the same result occurs.
More exploration needed for splitting and combining pdfs with multiple columns.
Hi,
I am new to R and R studio so sorry if this is a slightly dumb question but when I try to run the code below, I get most of my results but also a bunch of errors "PDF error : Bad annotation destintation".
What is it due to? How can I fix this?
My code :
library(pdfsearch)
library(dplyr)
library(writexl)
dest <- "C:/Users/me/Documents/2019/"
result_table <- data.frame(keyword_directory(dest,
keyword = c("informatique"),
surround_lines = 1, full_names = TRUE))
result_clean <- result_table %>% select(ID, pdf_name, keyword, line_text)
write_xlsx(x = result_clean, path = "C:/Users/me/Documents/2019/rod_2019.xlsx", col_names = TRUE)
Thank you
I got this message when executing keyword_directory() and keyword_search().
Hi I'm not sure if page_num is working properly. I tried it with different pdf but didn't get good results at all.
If I run the example from the readme I see different output:
Is this expected? I noticed you split by \r\n
which is windows-specific I think? Did you test the package on non-windows machines?
Verify the page and line numbers are correct when splitting document into sentences.
pdf_url <- "https://arxiv.org/pdf/1610.00147.pdf"
search_result_sent <- keyword_search(pdf_url, keyword = c('measurement error'),
path = TRUE, remove_hyphen = TRUE, convert_sentence = TRUE)
Hey Brandon, thanks for making this package, I've found it incredibly useful.
I'm not sure this is an issue, maybe more of a question. I was wondering how the page numbers are defined in found text. For example (using the v0.3.0):
library(pdfsearch)
download.file('https://www.tescoplc.com/media/757589/tesco_annual_report_2021.pdf', destfile = 'tesco_2021.pdf')
keyword_search(
'tesco_2021.pdf',
keyword = 'Tesco was built to be a champion for customers',
path = TRUE,
ignore_case = TRUE
)
# A tibble: 1 × 5
keyword page_num line_num line_text token_text
<chr> <int> <int> <list> <list>
1 Tesco was built to be a champion for customers 2 4 <chr [1]> <list [1]>
The result returned here is page 2 - that's sort of correct, as the text is first found on the 2nd numbered page. But there are 2 pages before the numbering begins which I'd prefer were included in the page count (in my case because I want to point something like tabulizer
at a specific location in the doc).
Just wondered if there was any way to change how page numbers are defined. Maybe this is already possible?
Anyways, thanks again for all your work with this. It's a terrific package that works a treat!
Alastair
Incorporate into the master function.
Corrections for JOSS review
ubiquitous
library
, add comma after documents
files(s)
, because if the s
were removed the sentence would be ungrammatical.than is currently done
is ambiguous. Please explain what research practice you are improving on.including example keywords searches and discusses
is not parallel and therefore ungrammatical.Issue for JOSS review.
intro-to-pdfsearch
or so on.See this image for what the vignette currently shows up as:
Issue for JOSS review.
Why is it necessary for the user to specify full_names = TRUE
when using the keyword_directory()
function? When I passed in an absolute path to a directory, the function returned an error from normalizePaths()
unless I used that argument. It would seem like the default option could be made easier for the user.
Issue from JOSS review
The package currently uses Depends
in the DESCRIPTION
for its dependencies. Best practice is to use Imports
, and then to refer to the function by its namespace: e.g., somepackage::some_func()
. See Hadley Wickham's R Packages.
Currently the splitting of PDF with two column layout does not work well. This function does not work at all with the remove_hyphen
argument.
Need to explore removal of empty character strings when splitting on white space.
Issue for JOSS review.
This package and paper is currently lacking a "a clear statement of need that illustrates the purpose of the software." The draft paper does explain that this package could be used to make keyword searching more reproducible, and I buy that claim. But I am not aware of a research application for keyword searching beyond looking things up, which would not need to be reproducible. What kinds of research work require this package? I'm willing to be persuaded, of course, but at the moment I don't understand why I would want to use this package instead of, e.g., just using my operating system's search feature.
Ideally the author would provided a worked example with a genuine research problem, either replacing or augmenting the demo example in the vignette.
The statement of need could perhaps be addressed by also meeting this JOSS guideline: "Mentions (if applicable) of any ongoing research projects using the software or recent scholarly publications enabled by it." Could the author please explain what research applications he, and if possible, others, are using this package for?
This explanation of the research purpose definitely needs to go in the JOSS paper, but could also be adapted for the README.
pdfsearch:::remove_equation(c("a","b","c")
-> character(0)
This is even the default behaviour
We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.
I will try to integrated this functionality in this package for our purposes and work on it in this on this fork:
https://github.com/openefsa/pdfsearch
I could potentially contribute this as open source here, if you are interested.
Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.
In case, you do not want to couple the package too much to a commercial provide such as Azure,
it might be useful to have 2 extension points on this package,
which allow to "plugin"
Add regex support to keyword_directory
.
This:
# install.packages("pdfsearch")
library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
result <- keyword_search(file,
keyword = c('measurement', 'error'),
path = TRUE)
head(result$line_text, n = 2)
Results in:
[[1]]
[1] "Reiter, Maria DeYoreo∗ arXiv:1610.00147v1 [stat.ME] 1 Oct 2016 Abstract Often in surveys, key items are subject to measurement errors. "[[2]]
[1] "In NA NA dividuals with high quality measurements of the error-prone survey items. "
Notice the NA's in the second item.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.