allancameron / pdfr Goto Github PK

View Code? Open in Web Editor NEW

36.0 3.0 3.0 15.74 MB

An R package to extract text from pdf.

License: Other

R 3.78% C++ 96.22%

pdf-format pdf extract-text data-scientists

pdfr's People

Contributors

Stargazers

Watchers

Forkers

elipousson myeongseongpark nemochina2008

pdfr's Issues

Couldn't find string in dictionary

I've just installed this package and tried to use pdfpage() and pdfdoc() to read in some text.
In both cases I immediately get the error

Error in .pdfdoc(pdf) : Couldn't find string in dictionary.

and no other output.

Can you help?

Get encodings from type 1 fonts

Some fonts have no encodings specified except in the font program (e.g. some book-form pdfs from Project Gutenburg). Although a chunk of the file program is binary, the header contains a text-form encoding map that may be used for ligatures etc.

Fix ligatures

Need to convert the Unicode ligatures and digraphs to letter pairs

Couldn't open file.

downloaded PDFR. Seems to be installed correctly. Tried pdfboxes ( "my.pdf", 1) received Error: Couldn't open file. Note that pdftools can read this file.

Installation Problem: Failed to build PDFR 0.1.0

I am using the latest release of R Studio on a Surface Pro 7. Windows 10.
RStudio 2023.03.0+386 "Cherry Blossom" Release (3c53477afb13ab959aeb5b34df1f10c237b256c3, 2023-03-09) for Windows
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) RStudio/2023.03.0+386 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36

install.packages("pak")
Installing package into 'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.2/pak_0.4.0.zip'
Content type 'application/zip' length 11106824 bytes (10.6 MB)
downloaded 10.6 MB

package 'pak' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\sbalakrishnan\AppData\Local\Temp\Rtmp6bsTg6\downloaded_packages

pak::pkg_install("AllanCameron/PDFR")
! Using bundled GitHub PAT. Please add your own PAT using gitcreds::gitcreds_set().
✔ Updated metadata database: 4.94 MB in 12 files.
✔ Updating metadata database ... done
→ Will install 1 package.
→ Will download 1 package with unknown size.

PDFR 0.1.0 [bld][cmp][dl] (GitHub: 9d9806c)
ℹ Getting 1 pkg with unknown size
✔ Got PDFR 0.1.0 (source) (2.07 MB)
✔ Downloaded 1 package (2.07 MB)in 1.9s
ℹ Packaging PDFR 0.1.0
✔ Packaged PDFR 0.1.0 (17.9s)
ℹ Building PDFR 0.1.0
✖ Failed to build PDFR 0.1.0
Error:
! error in pak subprocess
Caused by error in stop_task_build(state, worker):
! Failed to build source package 'PDFR'
Full installation output:

installing source package 'PDFR' ...
** using non-staged installation via StagedInstall field
** libs
g++ -std=gnu++11 -I"C:/PROGRA~~1/R/R-42~~1.3/include" -DNDEBUG -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/Rcpp/include' -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/testthat/include' -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -Wall -pedantic -fdiagnostics-color=always -c RcppExports.cpp -o RcppExports.o
g++ -std=gnu++11 -I"C:/PROGRA~~1/R/R-42~~1.3/include" -DNDEBUG -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/Rcpp/include' -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/testthat/include' -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -Wall -pedantic -fdiagnostics-color=always -c adobetounicode.cpp -o adobetounicode.o
g++ -std=gnu++11 -I"C:/PROGRA~~1/R/R-42~~1.3/include" -DNDEBUG -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/Rcpp/include' -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/testthat/include' -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -Wall -pedantic -fdiagnostics-color=always -c box.cpp -o box.o
In file included from box.cpp:13
box.h:n constructor 'Box::Box(std::vector)
box.h:119:40:error: runtime_erroris not a member of 'std
119 | if (floats.size() != 4) throw std::runtime_error needs four floats");
| ^~~~~~~~~~~~~
box.h:n member function 'float Box::Edge(int) const
box.h:145:27:error: runtime_erroris not a member of 'std
145 | default: throw std::runtime_erroralid box index");
| ^~~~~~~~~~~~~
make: *** [C:/PROGRA~~1/R/R-42~~1.3/etc/x64/Makeconf:260: box.o] Error 1
ERROR: compilation failed for package 'PDFR'
removing 'C:/Users/SBALAK~1/AppData/Local/Temp/RtmpMzQTtA/pkg-lib52243e9e7641/PDFR'
Type .Last.error to see the more details.

Add documentation and minor refactoring to allow PDFR to pass `devtools::check()` without errors

Hello @AllanCameron! I really appreciate you creating this package – I'm using it to extract text from some older reports at work and reformat the text into a tabular structure. When I went to look-up the documentation for pdfpage() I realized that exporting the documentation is one of the things that remained incomplete with the package.

I just forked the package and went ahead and made the following changes to get the documentation filled in and get the package to pass devtools::check() without errors. Here are all the changes I made:

Remove existing NAMESPACE file to replace with NAMESPACE generated by roxygen2
Add httr and grDevices to Imports and move Rcpp11 to Suggests
Add package level documentation to handle imports from ggplot2, Rcpp, and httr
Add package level metadata with codemetar::write_codemeta()
Update function documentation to import from grid and grDevices
Re-formated the DESCRIPTION with use_tidy_description() (and added URLs + Authors)
Update license with use_mit_license()
Added a utils.R file to add a call utils::globalVariables()
Replace the markdown README with a Rmd file
Exported the testfiles data with use_data()
Disabled execution of a broken test in test-pdrf.R
Disabled execution of a broken example for draw_glyph()

If this all looks good, I'm happy to open a pull request.

I also noticed that the size parameter in pdfgraphics() may need to be changed to linewidth to work with the most recent version of ggplot2 (see this post for more information). I can test this out and add it to the same pull request or open a separate issue if you'd like to discuss.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.