allancameron / pdfr Goto Github PK
View Code? Open in Web Editor NEWAn R package to extract text from pdf.
License: Other
An R package to extract text from pdf.
License: Other
Hi
I've just installed this package and tried to use pdfpage()
and pdfdoc()
to read in some text.
In both cases I immediately get the error
Error in .pdfdoc(pdf) : Couldn't find string in dictionary.
and no other output.
Can you help?
Some fonts have no encodings specified except in the font program (e.g. some book-form pdfs from Project Gutenburg). Although a chunk of the file program is binary, the header contains a text-form encoding map that may be used for ligatures etc.
Need to convert the Unicode ligatures and digraphs to letter pairs
downloaded PDFR. Seems to be installed correctly. Tried pdfboxes ( "my.pdf", 1) received Error: Couldn't open file. Note that pdftools can read this file.
I am using the latest release of R Studio on a Surface Pro 7. Windows 10.
RStudio 2023.03.0+386 "Cherry Blossom" Release (3c53477afb13ab959aeb5b34df1f10c237b256c3, 2023-03-09) for Windows
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) RStudio/2023.03.0+386 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36
install.packages("pak")
Installing package into 'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.2/pak_0.4.0.zip'
Content type 'application/zip' length 11106824 bytes (10.6 MB)
downloaded 10.6 MB
package 'pak' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\sbalakrishnan\AppData\Local\Temp\Rtmp6bsTg6\downloaded_packages
pak::pkg_install("AllanCameron/PDFR")
! Using bundled GitHub PAT. Please add your own PAT usinggitcreds::gitcreds_set()
.
✔ Updated metadata database: 4.94 MB in 12 files.
✔ Updating metadata database ... done
→ Will install 1 package.
→ Will download 1 package with unknown size.
stop_task_build(state, worker)
:Hello @AllanCameron! I really appreciate you creating this package – I'm using it to extract text from some older reports at work and reformat the text into a tabular structure. When I went to look-up the documentation for pdfpage()
I realized that exporting the documentation is one of the things that remained incomplete with the package.
I just forked the package and went ahead and made the following changes to get the documentation filled in and get the package to pass devtools::check()
without errors. Here are all the changes I made:
codemetar::write_codemeta()
use_tidy_description()
(and added URLs + Authors)use_mit_license()
utils::globalVariables()
use_data()
draw_glyph()
If this all looks good, I'm happy to open a pull request.
I also noticed that the size parameter in pdfgraphics()
may need to be changed to linewidth to work with the most recent version of ggplot2 (see this post for more information). I can test this out and add it to the same pull request or open a separate issue if you'd like to discuss.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.