Code Monkey home page Code Monkey logo

pdfio.jl's Introduction

PDFIO

GitHub Action codecov.io Doc Latest Doc Stable JOSS status

PDFIO is a native Julia implementation for reading PDF files. It's an 100% Julia implementation of the PDF specification. Other than a few well established algorithms like flate decode (zlib library) or cryptographic operations (openssl library) almost all of the APIs are written in native Julia.

If you are using this work you may cite as following:

@article{Dash_2019,
	doi = {10.21105/joss.01453},
	url = {https://doi.org/10.21105%2Fjoss.01453},
	year = 2019,
	month = {nov},
	publisher = {The Open Journal},
	volume = {4},
	number = {43},
	pages = {1453},
	author = {Sambit Dash},
	title = {{PDFIO}: {PDF} Reader Library for native Julia},
	journal = {Journal of Open Source Software}
} 

Need for a PDF Reader API

The following are some of the benefits of utilizing this approach:

  1. PDF files are in existence for over three decades. Implementations of the PDF writers are not always to the specification or they may even vary significantly from vendor to vendor. Everytime, you get a new PDF file there is a possibility that it may not work to the best interpretation of the specification. A script based language makes it easier for the consumers to quickly modify the code and enhance to their specific needs.

  2. When a higher level scripting language implements a C/C++ PDF library API, the scope is kept limited to achieving certain high level tasks like, graphics or text extraction; annotation or signature content extraction; or page extraction or merging.

    However, PDFIO represents the PDF specification as a model in the Model, View and Controller parlance. A PDF file can be represented as a collection of interconnected Julia structures. Those structures can be utilized in granular tasks or simply can be used to understand the structure of the PDF document.

    As per the PDF specification, text can be presented as part of the page content stream or inside PDF page annotations. An API like PDFIO can create two categories of object types. One representing the text object inside the content stream and the other for the text inside an annotation object. Thus, providing flexibility to the API user.

  3. Since, the API is written as an object model of PDF documents, it's easier to extend with additional PDF write or update capabilities. Although, the current implementation does not provide the PDF writing capabilities, the foundation has been laid for future extension.

There are also certain downsides to this approach:

  1. Any API that represents an object model of a document, tends to carry the complexity of introducing abstract objects. They can be opaque objects (handles) that are representational specific to the API. They may not have any functional meaning. The methods are granular and may not complete one use level task. The amount of code needed to complete a user level task can be substantially higher.

    In PDFIO the following steps have to be carried out: a. Open the PDF document and obtain the document handle.
    b. Query the document handle for all the pages in the document. c. Iterate the pages and obtain the page object handles for each of the pages.
    d. Extract the text from the page objects and write to a file IO.
    e. Close the document ensuring all the document resources are reclaimed.

  2. The API user may need to refer to the PDF specification (PDF-32000-1:2008)[@Adobe:2008] for semantic understanding of PDF files in accomplishing some of the tasks. For example, the workflow of PDF text extraction above is a natural extension from how text is represented in a PDF file as per the specification. A PDF file is composed of pages and text is represented inside each page content object. The object model of PDFIO is a Julia language representation of the PDF specification.

Installation

The package can be added to a project by the command below:

julia> Pkg.add("PDFIO")

The current version of the API requires julia 1.0. The detailed list of packages PDFIO depends on can be seen in the Project.toml file.

Sample Code

The below mentioned code takes a PDF file src as input and writes the text data into a file out. It enumerates all the pages in the document and extracts the text from the pages. The extracted text is written to the output file.

"""
​```
    getPDFText(src, out) -> Dict 
​```
- src - Input PDF file path from where text is to be extracted
- out - Output TXT file path where the output will be written
return - A dictionary containing metadata of the document
"""
function getPDFText(src, out)
    # handle that can be used for subsequence operations on the document.
    doc = pdDocOpen(src)
    
    # Metadata extracted from the PDF document. 
    # This value is retained and returned as the return from the function. 
    docinfo = pdDocGetInfo(doc) 
    open(out, "w") do io
    
        # Returns number of pages in the document       
        npage = pdDocGetPageCount(doc)

        for i=1:npage
        
            # handle to the specific page given the number index. 
            page = pdDocGetPage(doc, i)
            
            # Extract text from the page and write it to the output file.
            pdPageExtractText(io, page)

        end
    end
    # Close the document handle. 
    # The doc handle should not be used after this call
    pdDocClose(doc)
    return docinfo
end

Interactive Code Examples

One can also execute the following interactive commands on a Julia REPL to access objects of a PDF file.

Getting Document Handle

julia> doc = pdDocOpen("test/sample-google-doc.pdf")

PDDoc ==>

CosDoc ==>
	filepath:		/home/sambit/.julia/dev/PDFIO/test/sample-google-doc.pdf
	size:			21236
	hasNativeXRefStm:	 true
	Trailer dictionaries: 

Catalog:
4 0 obj
<<
	/Pages	14 0 R
	/Type	/Catalog
>>
endobj

isTagged: none

Getting Document Info

julia> info = pdDocGetInfo(doc)
Dict{String,Union{CDDate, String, CosObject}} with 1 entry:
  "Producer" => "Skia/PDF m79"

Getting the Number of Pages

julia> npage = pdDocGetPageCount(doc)
1

Get the Page Handle

julia> page = pdDocGetPage(doc, 1)
PDFIO.PD.PDPageImpl(
...
)

View Page Text Contents

julia> pdPageExtractText(stdout, page);
        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut 
        labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco 
        laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in 
        voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non 
        proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

As can be seen above, granular APIs are provided in PDFIO that can be used in combination to achieve a desirable task. For details, please refer to the Architecture and Design.

Features

PDFIO is implemented in layers enabling following features:

  1. Extract and render the Contents in of a PDF page. This ensures the contents are organized in a hierarchical grouping, that can be used for rendering of the content. Rendering is used here in a generic sense and not confined to painting on a raster device. For example, extracting document text can also be considered as a rendering task. pdPageExtractText is an apt example of the same.
  2. Provide functional tasks to PDF document access. A few of such functionalities are:
  3. Access low level PDF objects and obtain information when high level APIs do not exist.

The Architecture and Design discusses some of these scenarios.

Licensing

PDFIO is developed to contribute to both commercial activities and scientific research alike. However, we strongly discourage usage of this product for any illegal, immoral or unethical purposes. PDFIO License while provides rights under a permissible MIT Expat License, is conditioned upon maintaining strong moral, ethical and legal standards of the final outcome.

This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (http://www.openssl.org/)

Contribution

Contributions in form of PRs are welcome for any feature you will like to develop for the PDFIO library. You are requested to review the GitHub Issues section to understand the known issues. You can take up few of the issues, work on them and submit a PR. If you come across a bug or are unable to use the APIs in any manner, feel free to submit an issue.

Similar Packages

Taro.jl is an alternate package in Julia that provides reading and extracting content from a PDF files.

Reference to Adobe

It's almost impossible to talk PDF without reference to Adobe. All copyrights or trademarks that are owned by Adobe or ISO, which have been referred to inadvertently without stating ownership, are owned by them. The author also has been part of Adobe's development culture in early part of his career with specific to PDF technology for about 2 years. However, the author has not been part of any activities related to PDF development from 2003. Hence, this API can be considered a clean room development. Usage of words like Carousel and Cos are pretty much public knowledge and large number of reference to the same can be obtained from industry related websites etc.

The package contains Adobe Font Metrics (AFM) for 14 Core Adobe fonts.

Test files

Not all PDF files that were used to test the library has been owned by the author. Hence, the author cannot make those files available to general public for distribution under the source code license. However, the author is grateful to the PDF document library maintained by [email protected]. However, these files are no longer available in the link above.

Some files are also included from openpreserve. These files can be distributed with CC0.

However, test files may have different licensing that the PDFIO. Hence we have now uploaded most test files to another project under PDFTest.

pdfio.jl's People

Contributors

alexhanna avatar fredrikekre avatar gwierzchowski avatar juliatagbot avatar mkitti avatar sambitdash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pdfio.jl's Issues

Extracting text with a specific font with a rectangular region as selection area

It is common to use different fonts to denote semantic meaning (e.g italics for emphasis or larger font size for section titles). Is it possible to extract text that is in a specific font and size? Also, is it possible to specify a region where to extract from? I would like to be able to, for example, extract all the italic text inside a region.

Outlines from PDF documents should be extracted

PDF document outlines can be extracted from 3 distinct sources:

  1. PDF bookmark which show up in Adobe Reader as TOC.
  2. PDF structure from marked content from tagged PDFs
  3. Document structure analysis by learning or heuristics.

The scope of PDFIO is only 1 and 2. 3 can be created as a separate module over PDFIO to address knowledge oriented problems. Eventually, text extraction APIs should move into the new module.

Move the Zlib and OpenSSL dependency to JuliaBinaryWrappers

JuliaBinaryWrappers has binaries of Zlib and OpenSSL built-in. Instead of building them, it may be ideal to pick them up from pre-built binaries. That way the unnecessary build time can be reduced and it will be consistent with the pre-built binaries and thus consistent test experience. However, the minimal Julia release has to be 1.3.

Once, Julia 1.3 is GA this can be taken up.

Support Forms XObjects

Forms XObject is a PDF content embedded as a whole in a PDF page content. This kind of XObjects can have text also in the content and hence may be relevant to text extraction.

Better support for T3 fonts

Some of the PDF files support T3 fonts that do not have embedded toUnicode mapping. Such fonts cannot be extracted from the document effectively. In such cases, usage of OCR might be useful. An OCR library like tesseract or such which can be helpful in such extraction of font data. This will be a helpful possibility in such scenarios. It has to be made sure that a library used should not violate the MIT Expat License of the PDFIO.

Feature request: add support for reading attachments

This may be low on your priority list, but being able to read PDF attachments would be great. I deal with a lot of PDFs that have xml or excel attachments with the source data used to generate the PDF. There just aren't many tools for dealing with attachments - it seems most people use command line tools.

`cosDocGetPageNumbers` crashes when there is no `PageLabels` in PDF catalog

Hi. Working on Outline support implementation I got following error:

        @test begin
            filename="files/1.pdf"
            DEBUG && println(filename)
            doc = pdDocOpen(filename)
            @assert length(pdDocGetPageRange(doc, "1")) >= 1
            pdDocClose(doc)
            length(utilPrintOpenFiles()) == 0
        end

Fails with:
MethodError: no method matching get(::CosNullType, ::CosName)

I think pdDocGetPageRange should fall back to parsing label as number and return appropriate page if there are no labels dictionary in PDF or at least return empty vector.

I have ready fix so would like to submit PR.

build fails

Following the procedure for building the package on MacOS: 10.15 I get a following error:

(base)
in ~ vlad🅒 base
 mkdir test && cd test
(base)
in ~/test vlad🅒 base
 julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0 (2019-08-20)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

(v1.2) pkg> activate .
Activating new environment at `~/test/Project.toml`

(test) pkg> add PDFIO
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `~/test/Project.toml`
  [4d0d745f] + PDFIO v0.1.8
  Updating `~/test/Manifest.toml`
  [1520ce14] + AbstractTrees v0.2.1
  [715cd884] + AdobeGlyphList v0.1.1
  [9e28174c] + BinDeps v0.8.10
  [34da2185] + Compat v2.2.0
  [2e475f56] + LabelNumerals v0.1.0
  [4d0d745f] + PDFIO v0.1.8
  [27ebfcd6] + Primes v0.4.0
  [9a9db56c] + Rectangle v0.1.2
  [37834d88] + RomanNumerals v0.3.1
  [30578b45] + URIParser v0.4.0
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [8ba89e20] + Distributed
  [b77e0a4c] + InteractiveUtils
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [1a1011a3] + SharedArrays
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode

(test) pkg> test PDFIO
   Testing PDFIO
 Resolving package versions...
    Status `/var/folders/k3/hy22jxt17xd4hggsb2fqhs4m0000gn/T/jl_OC5Fsf/Manifest.toml`
  [1520ce14] AbstractTrees v0.2.1
  [715cd884] AdobeGlyphList v0.1.1
  [9e28174c] BinDeps v0.8.10
  [b99e7846] BinaryProvider v0.5.8
  [34da2185] Compat v2.2.0
  [2e475f56] LabelNumerals v0.1.0
  [4d0d745f] PDFIO v0.1.8
  [27ebfcd6] Primes v0.4.0
  [9a9db56c] Rectangle v0.1.2
  [37834d88] RomanNumerals v0.3.1
  [30578b45] URIParser v0.4.0
  [a5390f91] ZipFile v0.8.3
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [ade2ca70] Dates  [`@stdlib/Dates`]
  [8bb1440f] DelimitedFiles  [`@stdlib/DelimitedFiles`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [76f85450] LibGit2  [`@stdlib/LibGit2`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [a63ad114] Mmap  [`@stdlib/Mmap`]
  [44cfe95a] Pkg  [`@stdlib/Pkg`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [3fa0cd96] REPL  [`@stdlib/REPL`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [ea8e919c] SHA  [`@stdlib/SHA`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [1a1011a3] SharedArrays  [`@stdlib/SharedArrays`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [2f01184e] SparseArrays  [`@stdlib/SparseArrays`]
  [10745b16] Statistics  [`@stdlib/Statistics`]
  [8dfed614] Test  [`@stdlib/Test`]
  [cf7118a7] UUIDs  [`@stdlib/UUIDs`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: LoadError: PDFIO not properly installed. Please run Pkg.build("PDFIO")
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
 [3] include at ./boot.jl:328 [inlined]
 [4] include_relative(::Module, ::String) at ./loading.jl:1094
 [5] include at ./Base.jl:31 [inlined]
 [6] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:1
 [7] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
 [8] include at ./boot.jl:328 [inlined]
 [9] include_relative(::Module, ::String) at ./loading.jl:1094
 [10] include at ./Base.jl:31 [inlined]
 [11] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:3
 [12] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
 [13] include at ./boot.jl:328 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1094
 [15] include(::Module, ::String) at ./Base.jl:31
 [16] top-level scope at none:2
 [17] eval at ./boot.jl:330 [inlined]
 [18] eval(::Expr) at ./client.jl:432
 [19] top-level scope at ./none:3
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
ERROR: LoadError: Failed to precompile PDFIO [4d0d745f-9d9a-592e-8d18-1ad8a0f42b92] to /Users/vlad/.julia/compiled/v1.2/PDFIO/cmOJE.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1253
 [3] _require(::Base.PkgId) at ./loading.jl:1013
 [4] require(::Base.PkgId) at ./loading.jl:911
 [5] require(::Module, ::Symbol) at ./loading.jl:906
 [6] include at ./boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1094
 [8] include(::Module, ::String) at ./Base.jl:31
 [9] include(::String) at ./client.jl:431
 [10] top-level scope at none:5
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/test/runtests.jl:2
ERROR: Package PDFIO errored during testing

(test) pkg>

and this is what happens when I am trying to build the package:

(test) pkg> build PDFIO
  Building PDFIO → `~/.julia/packages/PDFIO/LF83Q/deps/build.log`
┌ Error: Error building `PDFIO`:
│
│ signal (6): Abort trap: 6
│ in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/deps/build.jl:76
│ __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
│ Allocations: 10469954 (Pool: 10467735; Big: 2219); GC: 21
└ @ Pkg.Operations ~/julia/usr/share/julia/stdlib/v1.2/Pkg/src/backwards_compatible_isolation.jl:647

(test) pkg>

Does this error look familiar to anyone?

Writing/modifying pdfs

Is it possible to modify the parsed pdf and write it to a file? Specifically I'm interested in the ideas from here: open-source-ideas/ideas#46. Julia has excellent support for neural networks, so it would be interesting to experiment with something like this.

Google Docs PDF fails at pdPageExtractText

Trying to extract text from a simple Google Docs PDF,

julia> pdPageExtractText(stdout, pdDocGetPage(pdDocOpen("Downloads/GoogleDocs.pdf"), 1))

fails with:

ERROR: MethodError: no method matching setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{UInt16},Int64}, ::Float32, ::Rectangle.Interval{UInt16})
Closest candidates are:
  setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{K},V}, ::V, ::Rectangle.Interval{K}) where {K, V} at /home/jarvist/.julia/packages/Rectangle/SnGUM/src/interval.jl:117
Stacktrace:
 [1] get_cid_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:204
 [2] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:164
 [3] PDFont at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFonts.jl:391 [inlined]
 [4] get_pd_font!(::PDFIO.PD.PDDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDDocImpl.jl:112
 [5] get_font(::PDFIO.PD.PDPageImpl, ::CosName) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:320
 [6] evalContent!(::PDPageElement{:Tf}, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:735
 [7] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
 [8] evalContent!(::PDPageTextObject, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:680
 [9] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
 [10] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:146
 [11] pdPageEvalContent at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:145 [inlined]
 [12] pdPageExtractText(::Base.TTY, ::PDFIO.PD.PDPageImpl) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:179
 [13] top-level scope at REPL[30]:1

pdDocGetInfo() crash (PDF with empty properties)

pdDocGetInfo() crashes when used against PDF with empty properties:

(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: BoundsError: attempt to access 0-element Array{UInt8,1} at index [1:4]
Stacktrace:
 [1] throw_boundserror(::Array{UInt8,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:484
 [2] checkbounds at ./abstractarray.jl:449 [inlined]
 [3] getindex at ./array.jl:737 [inlined]
 [4] convert(::Type{String}, ::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:11
 [5] String(::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:34
 [6] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/PDDoc.jl:135
 [7] top-level scope at none:0

Attached affected pdf file (it is no longer available on-line).
ALM-2009-Aug.pdf

Improve the performance `pdPageExtractText`

pdPageExtractText API is one the core APIs of PDFIO. However, smaller large number of allocations make it a bit slower. This code needs to be refactored to ensure the text extraction speeds are improved further.

Any inputs, proposals and PRs in this direction will be highly appreciated.

Extracting boxes

I have a pdf with important points drawn in boxes using path commands. For example:

Q
538.02 6098.07 3316.68 4.14063 re
f
538.02 5395.17 4.14063 705.059 re
f
3850.56 5395.17 4.14063 705.059 re
f
538.02 5393.19 3316.68 4.13672 re
f
q

How can I extract these?

Support for JPEG filter

Content filter for JPEG and JPEG2000 should be supported.

Since, these are special type filters whether decoding over direct streaming into the graphics channel for rendering should be reviewed.

CDDate(test) == CDDate(test) returns false

Working on unit test for PR #26 I noticed following unexpected behavior:

(v1.0) julia> test = "D:20090807192622";

(v1.0) julia> CDDate(test) == CDDate(test)
false

Looking at code I see 2 following problems:

  1. reg ex for date:
r"D\s*:\s*(?<dt>\d{12})\s*(?<ut>[+-Z])\s*((?<tzh>\d{2})'\s*(?<tzm>\d{2}))?"

do not strictly conforms Adobe PDF date spec.
More correct would be:

r"D\s*:\s*(?<YYYY>\d{4})(?<MM>\d{2})?(?<DD>\d{2})?(?<HH>\d{2})?(?<mm>\d{2})?(?<SS>\d{2})?\s*((?<ut>[-+Z])\s*(?<tzh>\d{2}))?(\s*'\s*(?<tzm>\d{2}))?\s*"
  1. == for CDDate fallback into identity, I would like to implement:
(==)(x::T, y::T) where {T<: CDDate} 

I'm working on fix, let me know if you welcome PR, and if better to do separate PR or altogether with corrected PR #26 ?
Thanks, Best Regards, GW

Not able to execute any functions on a basic PDF. Error: Found ' (32)' Expected '<' here

Hi there.
I am getting an error when I try to execute getPDFText() or pdDocOpen() or any other function. This the error:
Found ' (32)' Expected '<' here

And here is the first few lines of stack trace:
the stacktrace:
[1] error(::String) at ./error.jl:33
[2] skipv at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/BufferParser.jl:25 [inlined]
[3] read_trailer(::IOStream, ::Int64) at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/CosDoc.jl:382

I have searched for this error and come up with nothing. Any ideas on where to go from here?
Thank you.

pdPageExtractText() crash on file created using LaTex

pdPageExtractText() raise following error when used on file created by Latex:

"/home/grzegorz-neo/Dokumenty/Projekty/MatFiz/pdfio-test/outline.pdf"
(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> item_pg = pdDocGetPage(doc, 3);
(v1.0) julia> buf = IOBuffer();
(v1.0) julia> pdPageExtractText(buf, item_pg)
ERROR: InexactError: Int64(Int64, 312.5)
Stacktrace:
 [1] Type at ./float.jl:700 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] setindex!(::Array{Int64,1}, ::Float32, ::Int64) at ./array.jl:769
 [4] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDFontMetrics.jl:164
....

Change:
d[i+1] = widths[ix] into d[i+1] = round(Int,widths[ix]) in PDFontMetrics.jl fix this issue.
Fix is included into PR with implementation for Outlines.

Implement Filespec properly to address the EFF attribute of the security handler

The crypto code decrypts the streams through recursively accessing the indirect objects. For external files it may not easy to determine a file stream is an embedded file from the attributes of the extent dictionary of the stream as all the keys are kind of optional. So Filespecs should be implemented properly to identify the case where EFF flag has to be used judiciously.

Unable to read in PDF

I am attempting to read in this pdf. Unfortunately, the code seems stuck on the first page. Any thoughts on why this is? I was able to run this code on another PDFs.

using PDFIO

fname = "16-969_o7jp.pdf"
doc = pdDocOpen(fname)

open("tmp.txt", "w") do io
    page = pdDocGetPage(doc, 1)
    pdPageExtractText(io, page)
end

pdDocClose(doc)

I've also tried on other pages of the PDF and see similar results - Julia works (indefinitely), but I see no error messages, and nothing is printed to the file.

Tests fail

v1.1) pkg> test PDFIO
   Testing PDFIO
 Resolving package versions...
    Status `/var/folders/t6/ddh10c6j5r54sg19jlc59n580000gn/T/tmpCjMIM2/Manifest.toml`
  [1520ce14] AbstractTrees v0.2.1
  [715cd884] AdobeGlyphList v0.1.1
  [9e28174c] BinDeps v0.8.10
  [b99e7846] BinaryProvider v0.5.4
  [e1450e63] BufferedStreams v1.0.0
  [34da2185] Compat v2.1.0
  [ffbed154] DocStringExtensions v0.7.0
  [e30172f5] Documenter v0.22.4
  [0862f596] HTTPClient v0.2.1
  [682c06a0] JSON v0.20.0
  [2e475f56] LabelNumerals v0.1.0
  [b27032c2] LibCURL v0.5.0
  [522f3ed2] LibExpat v0.5.0
  [2ec943e9] Libz v1.0.0
  [4d0d745f] PDFIO v0.1.3
  [27ebfcd6] Primes v0.4.0
  [9a9db56c] Rectangle v0.1.1
  [37834d88] RomanNumerals v0.3.1
  [30578b45] URIParser v0.4.0
  [c17dfb99] WinRPM v0.4.2
  [a5390f91] ZipFile v0.8.1
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [ade2ca70] Dates  [`@stdlib/Dates`]
  [8bb1440f] DelimitedFiles  [`@stdlib/DelimitedFiles`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [76f85450] LibGit2  [`@stdlib/LibGit2`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [a63ad114] Mmap  [`@stdlib/Mmap`]
  [44cfe95a] Pkg  [`@stdlib/Pkg`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [3fa0cd96] REPL  [`@stdlib/REPL`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [ea8e919c] SHA  [`@stdlib/SHA`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [1a1011a3] SharedArrays  [`@stdlib/SharedArrays`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [2f01184e] SparseArrays  [`@stdlib/SparseArrays`]
  [10745b16] Statistics  [`@stdlib/Statistics`]
  [8dfed614] Test  [`@stdlib/Test`]
  [cf7118a7] UUIDs  [`@stdlib/UUIDs`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: could not open file /Users/malmaud/.julia/packages/ZipFile/YHTbb/deps/deps.jl
Stacktrace:
 [1] include at ./boot.jl:326 [inlined]
 [2] include_relative(::Module, ::String) at ./loading.jl:1038
 [3] include at ./sysimg.jl:29 [inlined]
 [4] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:26
 [5] top-level scope at none:0
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include at ./sysimg.jl:29 [inlined]
 [9] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:36
 [10] top-level scope at none:0
 [11] include at ./boot.jl:326 [inlined]
 [12] include_relative(::Module, ::String) at ./loading.jl:1038
 [13] include(::Module, ::String) at ./sysimg.jl:29
 [14] top-level scope at none:2
 [15] eval at ./boot.jl:328 [inlined]
 [16] eval(::Expr) at ./client.jl:404
 [17] top-level scope at ./none:3
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:50
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:43
ERROR: LoadError: Failed to precompile ZipFile [a5390f91-8eb1-5f08-bee0-b1d1ffed6cea] to /Users/malmaud/.julia/compiled/v1.1/ZipFile/cOum2.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] include(::String) at ./client.jl:403
 [10] top-level scope at none:0
in expression starting at /Users/malmaud/.julia/packages/PDFIO/28lLV/test/runtests.jl:6
ERROR: Package PDFIO errored during testing

Extract text content from the PDF

Extract text content from PDF. Here are some of the high level use cases.

  1. Pure text based documents should be easily be converted to standard text formats like text, word document etc.
  2. Document structure existing in a PDF document should be preserved as much as possible - Being a PDL, PDF text rendering does not depend on the content order. Hence, any marked information in the document in the document should be preserved.
  3. Ideally text for reading vs. text used for clippath or artwork should be distinguishable.

Expose FontDescription flags from PDFonts

Many a times it's required to know the font being a bold, italic, fixed width, allcaps or smallcaps etc. Ideally, these should be captured in TextLayout for subsequent processing,

Table picker for PDF

Natural tabular objects in a PDF document should ideally be picked up for extraction.

The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.

tocPDF

I have created a repository which the plan is to auto-generate bookmarks from the table of contents already available at the beginning of pdf files.
https://github.com/aminya/tocPDF

For now, I plan to start using available software (e.g k2pdfoptdoes), and then later make the functionality Julia native (when you add pdf write capability).

Current algorithm plan: https://github.com/aminya/tocPDF#automated

I looked at the PDFIO doc, however, it is a long one, and it has many functions. Could you help me start using PDFIO?

if anyone is interested in participation, that will be awesome. (@kskyten @sambitdash )

pdDocGetInfo() crash (PDF without properties)

pdDocGetInfo() crashes when used against PDF without any properties:

(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: type CosNullType has no field val
Stacktrace:
 [1] getproperty(::Any, ::Symbol) at ./sysimg.jl:18
 [2] get(::PDFIO.Cos.CosNullType) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/CosObject.jl:39
 [3] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDDoc.jl:133
 [4] top-level scope at none:0

Fix seems to be easy - I will send PR.

Validate the document for tagged PDF.

Tagged PDF has important properties that can help in good text and graphics extraction for usage elsewhere. Hence, it's important to extract such information from PDFs.

Move all the test files to the PDFTest repository.

PDFIO has MIT licensing. However, some of the files may have other forms of license that is not safe to be shipped with PDFIO. The test files will be kept separate from the PDFIO. To be downloaded on demand for test purposes only.

pdDocOpen() crash (PDF created by pdfLatex)

pdDocOpen() crash with following error:

ArgumentError: extra characters after whitespace in "1502\n6"

when called against file created by Latex (attached).
Latex version: pdfTeX 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (KDE Neon / Ubuntu 16.04 based)
I will submit PR to fix soon.
outline.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.