uglytoad / pdfpig Goto Github PK

View Code? Open in Web Editor NEW

1.7K 47.0 234.0 136.25 MB

Read and extract text and other content from PDFs in C# (port of PDFBox)

Home Page: https://github.com/UglyToad/PdfPig/wiki

License: Apache License 2.0

C# 99.60% HTML 0.36% Batchfile 0.01% PowerShell 0.04%

pdfbox pdf pdf-document csharp netstandard pdf-extractor pdf-document-processor pdf-files alto-xml hocr

pdfpig's People

Contributors

Stargazers

Watchers

Forkers

lulzzz omsaaf almostchristian as-you-like cloudhub360 ameriscan bobld numpsy rbanks54 vadik299 benyernest kalonzo cosmez gitforbit huzhiguan radtek ghost1face jansxue zyj0021 ahoisl briankeithsmith yannisdevon inusualz hello-web databill86 marcgrotheer busbina alexanderleung9 mcjt shilonosov ruiyuanxu vipyami vipermaseg chengjunliulcj icnocop dajyaretakuya thommie-echo coldfire169 luozhiping1987 2016xjtuzyt mgnslndh codingseb blodphoenix heyuncoder aiwoba torbjornlandin fuzzerot mcerqueira1509 tars-c plaisted pme8hw0krfqa organizationusername inthemindofadogg uzbekdev1 wylkerd sascodiego lihemin griffinzhang bangush mind-ra mjdhasan poltuu wushian dasdingo kasperdaff d4mo lkicesky d-franklin bobx5 lordtagoh sumedhgaikwad atechltd-fduque starlee onix1990 developerslearnit lauxjpn ygrenier fcgll520 jbraendle jo-twite hephalu mikehutility hutility dyster strogo rajasekarshanmugam apeiris stardxxx mevitae bubdm igionny dos119 pinkuburu xuan2261 zlangner theolivenbaum nasa03 nightangellforks gauss-lvs asbjornu

pdfpig's Issues

Rework public API for letters

A letter in a PDF has the following information:

A placement origin position (x, y)
A bounding box which entirely surrounds the actual visible shape of the glyph
A width by which the rendering advances (advance width) to place the next character which may be greater or less than the width of the bounding box

To illustrate this consider the following SVG of a character 'o' or '0'taken from a PDF:

The red dot illustrates the placement origin, the blue box illustrates the bounding box for the glyph itself, notice how it extends below the origin, it can also go to the left of the origin or in this case not include the origin. The advance width for this character would probably be greater than the bounding box width since the origin is outside the character.

A letter should have origin as PdfPoint, glyph bounding box as PdfRectangle and Width as decimal with comments explaining the above.

Issues reading unicode document properties?

Hi,

I had a try with loading some simple test documents into PdfPig 0.0.6, and noticed that unicode document properties don't seem to be handled correctly.

e.g., if I open the attached minimal.pdf in Acrobat reader it displays:

but in PdfDocument.Information, I get:

Would this be expected to work?

Thanks.

Create 2 page PDF document with wrapped Lorem Ipsum placeholder text

As part of the document creation epic for the next release we should handle wrapping text automatically (nothing fancy like working out the right place to line-break), create a document that shows the following capabilities:

Two or more pages
Text wrapping
Different font sizes, weights and faces

Try using floats instead of decimals for calculated values

Due to the poor performance of PdfPig for end-user scenarios we should see what impact substituting decimals for floats provides where the values are being used in calculations (all TransformationMatrix based code).

If the benefits from #64 aren't considered good enough then it may be that calculated values are better of being float based.

CFF font format support

Type 1 and CID fonts can use the Compact Font Format: http://wwwimages.adobe.com/www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5176.CFF.pdf

Currently two tests are failing because CFF parsing is not yet supported. Implement it.

Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion Test fails

Changeset: 7fab13e

Test Name:	UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion
Test FullName:	UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion
Test Source:	C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig.Tests\Fonts\Type1\Type1FontParserTests.cs : line 15
Test Outcome:	Failed
Test Duration:	0:00:00.005

Result StackTrace:	
at UglyToad.PdfPig.Fonts.Type1.Parser.Type1Tokenizer.ReadNextToken() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1Tokenizer.cs:line 59
   at UglyToad.PdfPig.Fonts.Type1.Parser.Type1Tokenizer.GetNext() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1Tokenizer.cs:line 33
   at UglyToad.PdfPig.Fonts.Type1.Parser.Type1EncryptedPortionParser.Parse(IReadOnlyList`1 bytes, Boolean isLenientParsing) in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1EncryptedPortionParser.cs:line 40
   at UglyToad.PdfPig.Fonts.Type1.Parser.Type1FontParser.Parse(IInputBytes inputBytes, Int32 length1, Int32 length2) in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1FontParser.cs:line 149
   at UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig.Tests\Fonts\Type1\Type1FontParserTests.cs:line 19
Result Message:	System.InvalidOperationException : Encountered an end of string ')' outside of string.

Ensure CFF parser handles PDFBox support case strange document

Once CFF parser is implemented in #6 ensure it handles the case detailed in this ticket: https://issues.apache.org/jira/browse/PDFBOX-4330

Fix Bezier curve behaviour in Type 2 charstring for CFF font

It looks like the interpretation of Bezier curves in Type2CharStringParser has gone slightly wrong:

This is probably in one of the curveto commands.

This B is from the PigProductionHandbookTests.CanReadContent test. Take a look and see where the error is and fix it.

Letters extracted with no Value

a.pdf

In this file all letters on pages 1-54 have Value=null. I guess this is due to font "TTdcr10"?
Starting from page 55 letters are extracted correctly (when font is changing to ArialMT).

Is this a known issue? It seems to be working in PdfBox.

Relatively slow processing

This is not exactly an issue, but more like a general question. While extracting Letters collection I noticed that overall the process runs about 4-5 times slower than PdfBox. I run PdfBox through Ikvm, so I was expecting it to be the other way around :-)
Of course there can be many things contributing to this, but I did one quick test - I ran a mass replace of word "decimal" to "double" across the whole code base. And yes, the speed got right on par with PdfBox! Changing it to "float" made it even a little faster (probably due to smaller memory footprint).
Sure double/float is not precise, but personally I often need to run extraction over hundreds of thousands PDFs, so speed is crucial and the time difference is substantial. On the other hand I think that letters/lines positions and dimensions would be OK with 2 digits precision at most (ok, maybe 3:-))
I see a few ways to approach this:

Change all properties to float. This is quick, but not very user friendly and may have some minor issues if anyone ever tries to compare numbers directly.
Keep values internally as Int multiplied by 100 (or 1000). Do all calculations on Int, then return to the user as decimal (divide by 100) - this may be better, but probably harder to do and potentially more confusing.
Keep everything as is and suggest getting a better server :-) (or optimize somewhere else)

Thoughts? Thank you for your time and all the work put in this library!

ArgumentOutOfRangeException occurs when execute document.GetPage(i + 1)

Hello there,

When I execute the samples you provided, no matter which one,
ArgumentOutOfRangeException will occur when executing var page = document.GetPage(i + 1);,
but when document.NumberOfPages is used to fetch the page number,
The number of pages obtained is correct. The relevant information is as follows

The StackTrace :
at System.DateTime.Add(Double value, Int32 scale)
at UglyToad.PdfPig.Fonts.TrueType.TrueTypeDataBytes.ReadInternationalDate() at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\TrueTypeDataBytes.cs: 行 114
at UglyToad.PdfPig.Fonts.TrueType.Tables.HeaderTable.Load(TrueTypeDataBytes data, TrueTypeHeaderTable table) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Tables\HeaderTable.cs: 行 97
at UglyToad.PdfPig.Fonts.TrueType.Parser.TrueTypeFontParser.ParseTables(Decimal version, IReadOnlyDictionary`2 tables, TrueTypeDataBytes data) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Parser\TrueTypeFontParser.cs: 行 59
at UglyToad.PdfPig.Fonts.TrueType.Parser.TrueTypeFontParser.Parse(TrueTypeDataBytes data) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Parser\TrueTypeFontParser.cs: 行 35
at UglyToad.PdfPig.Fonts.Parser.Parts.CidFontFactory.ReadDescriptorFile(FontDescriptor descriptor) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Parts\CidFontFactory.cs: 行 114
at UglyToad.PdfPig.Fonts.Parser.Parts.CidFontFactory.Generate(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Parts\CidFontFactory.cs: 行 56
at UglyToad.PdfPig.Fonts.Parser.Handlers.Type0FontHandler.ParseDescendant(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Handlers\Type0FontHandler.cs: 行 128
at UglyToad.PdfPig.Fonts.Parser.Handlers.Type0FontHandler.Generate(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Handlers\Type0FontHandler.cs: 行 34
at UglyToad.PdfPig.Fonts.FontFactory.Get(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\FontFactory.cs: 行 51
at UglyToad.PdfPig.Content.ResourceContainer.LoadFontDictionary(DictionaryToken fontDictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\ResourceContainer.cs: 行 93
at UglyToad.PdfPig.Content.ResourceContainer.LoadResourceDictionary(DictionaryToken resourceDictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\ResourceContainer.cs: 行 33
at UglyToad.PdfPig.Parser.PageFactory.LoadResources(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Parser\PageFactory.cs: 行 215
at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Parser\PageFactory.cs: 行 67
at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\Pages.cs: 行 62
at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\PdfDocument.cs: 行 158
at DocumentLayoutAnalysis.ImageTest.Run(String path) at F:\IISC\My Lab\DocumentLayoutAnalysis-master\DocumentLayoutAnalysis\DocumentLayoutAnalysis\ImageTest.cs: 行 25

Details on the system :
OS : MS Windows v10
VS : VS 2017 C#
.NET version : .NET framework 4.6.1

19937571.pdf

Incorrect number format when using not "en-US" number style

Hi. Thank you for this library, this good, but I have problem.
When I use PC with not "en-US" style numbers, by default I have throw when try open any pdf file.
I was try to fix this but still not found all parse function.

for example:

    private static decimal ReadDecimal(IInputBytes input)
    {
        decimal result;

        var str = ReadString(input);

        Decimal.TryParse(str, NumberStyles.Any, new CultureInfo("en-US"), out result); // <-

        return result;
    }

Sorry for my english. Thank you for your work.

GetPage fails with error "Could not find a name for this font"

Steps to reproduce:

Download this PDF File: Tackling the Poor Assumptions of Naive Bayes Text Classiﬁers
Call PdfDocument.Open(...)
Call document.GetPage(1)

The call to GetPage fails with the following error:

UglyToad.PdfPig.Fonts.Exceptions.InvalidFontFormatException: Could not find a name for this font (/Type, /Font) (/Subtype, /Type1) (/FirstChar, COSInt{0}) (/LastChar, COSInt{127}) (/Widths, COSObject{325, 0}) (/BaseFont, COSObject{331, 0}) (/FontDescriptor, COSObject{332, 0}) .
   at UglyToad.PdfPig.Fonts.Parser.FontDictionaryAccessHelper.GetName(PdfDictionary dictionary, FontDescriptor descriptor)
   at UglyToad.PdfPig.Fonts.Parser.Handlers.Type1FontHandler.Generate(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Fonts.FontFactory.Get(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Content.ResourceContainer.LoadFontDictionary(PdfDictionary fontDictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Content.ResourceContainer.LoadResourceDictionary(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, PdfDictionary dictionary, PageTreeMembers pageTreeMembers, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber)
   at FactReapUsage.Program.Main(String[] args) in C:\Code\FactReapUsage\FactReapUsage\FactReapUsage\Program.cs:line 19

Optimize TryReadStream

In profiling done for #47 PdfTokenScanner.TryReadStream takes up 1/6th of the total time for parsing a set of 26 documents being called a total of 1,148 times. This is probably low-hanging fruit for performance optimization since in general we know the length of the stream ahead of time.

Letter width/font is incorrect

Hi, I'm trying to extract text letters and positions from PDFs. For most documents it's working great, but for attached sample (and many others) it's returning Letter.Width=0 and Letter.FontSize=1
Any ideas how to work around this? Thank you!
letter_size_problem.pdf

            using (PdfDocument document = PdfDocument.Open(@"letter_size_problem.pdf"))
            {
                var page = document.GetPage(1);

                Letter l = page.Letters[0];

                decimal x = l.Location.X;
                decimal y = l.Location.Y;
                decimal width = l.Width;
                decimal fontSize = l.FontSize;
            }

Add word extraction

A barrier to adoption of the library is probably the lack of a "batteries included" text extraction API. We support retrieving letters and their size but each client must write their own word generation logic. We should include a naive default with pluggable interface. For example:

var document = PdfDocument.Open("somedocument.pdf");
var page = document.GetPage(1);
IEnumerable<Word> words = page.GetWords();

Where the get words method is using an optional parameter:

IEnumerable<Word> GetWords(IWordExtractor extractor = null)

Which if not set uses the internal library implementation. It doesn't need to be great for now but should at least do the obvious things right...

Support for custom document properties in PDFs?

Hi,

Can I ask if there are any plans and/or possibility of supporting custom document properties in PDF files as well as the 'known' ones (Author and Keywords and such)?

I haven't used PDFBox, but the documentation for PDDocumentInformation does seem to have functions to access custom properties.

Optimize TransformationMatrix

As seen in #47 multiplication operations on TransformationMatrix take over 1/3rd of the total parsing time for PdfPig. We will investigate the optimizations/tradeoffs of using floats instead of decimals which may result in a large speedup however it's also worth checking the performance impact of using values directly (9 internal decimals rather than an array) which may either be slower due to the large value to copy, or improve performance due to removing repeated array access.

Some paths are missing in page

Hi,

I am trying to retrieve all the paths from this pdf document , but it seems some of them are missing.

When drawing all the bounding boxes found, this is what I get (the PdfPath.BezierCurve are in red, and the PdfPath.Line are in blue):

As you can see, for each of the charts, only one line contains bounding boxes, the others seem to be ignored. Same issue for grid lines: some are drawn and some are not.

Am I doing something wrong, or are they really missing?
Thanks,

The code I used is the following:

        using (PdfDocument document = PdfDocument.Open(path))
        {
            for (var i = 0; i < document.NumberOfPages; i++)
            {
                var page = document.GetPage(i + 1);
                var paths = page.ExperimentalAccess.Paths;

                using (var bitmap = converter.GetPage(i + 1, zoom))
                using (var graphics = Graphics.FromImage(bitmap))
                {
                    var imageHeight = bitmap.Height;

                    foreach (var p in paths)
                    {
                        if (p == null) continue;
                        var commands = p.Commands;

                        foreach (var command in commands)
                        {
                            if (command is PdfPath.Line line)
                            {
                                var bbox = line.GetBoundingRectangle();
                                if (bbox.HasValue)
                                {
                                    var rect = new Rectangle(
                                        (int)(bbox.Value.Left * (decimal)zoom),
                                        imageHeight - (int)(bbox.Value.Top * (decimal)zoom),
                                        (int)(bbox.Value.Width == 0 ? 1 : bbox.Value.Width * (decimal)zoom),
                                        (int)(bbox.Value.Height == 0 ? 1 : bbox.Value.Height * (decimal)zoom));
                                    graphics.DrawRectangle(bluePen, rect);
                                }
                            }
                            else if (command is PdfPath.BezierCurve curve)
                            {
                                var bbox = curve.GetBoundingRectangle();
                                if (bbox.HasValue)
                                {
                                    var rect = new Rectangle(
                                        (int)(bbox.Value.Left * (decimal)zoom),
                                        imageHeight - (int)(bbox.Value.Top * (decimal)zoom),
                                        (int)(bbox.Value.Width == 0 ? 1 : bbox.Value.Width * (decimal)zoom),
                                        (int)(bbox.Value.Height == 0 ? 1 : bbox.Value.Height * (decimal)zoom));
                                    graphics.DrawRectangle(redPen, rect);
                                }
                            }
                        }
                    }
                }
            }
        }

Make content stream operators public

The classes in the UglyToad.PdfPig.Graphics.Operations namespace represent all operations a page's content stream can contain. Finish implementing writing for all of them and make them public. Use a reflection based test to ensure they can all be written.

Support reading file from PDFBox support ticket

Reading the file in this support ticket currently throws. Add the necessary steps to support reading it:
https://issues.apache.org/jira/browse/PDFBOX-4299

Page.Letters is empty for document which contains text

Using attached document and program below nothing is written to the console. Sample PDF came from a commercial HTML to PDF library one of our customers uses.

Getting_Started.pdf

using System;
using UglyToad.PdfPig;

namespace ExtractTest
{
    class Program
    {
        static void Main(string[] args)
        {
            using (PdfDocument document = PdfDocument.Open("Getting_Started.pdf"))
            {
                for (var i = 0; i < document.NumberOfPages; i++)
                {
                    var page = document.GetPage(i + 1);
                    foreach (var letter in page.Letters)
                    {
                        Console.WriteLine(letter.Value);
                    }
                }
            }
        }
    }
}

Test, refactor and prepare the FontDescriptor class for public API

From the spec:

A font descriptor specifies metrics and other attributes of a simple font or a CIDFont as a whole, as distinct from the metrics of individual glyphs. These font metrics provide information that enables a consumer application to synthesize a substitute font or select a similar font when the font program is unavailable. The font descriptor may also be used to embed the font program in the PDF file.

We have a class to represent the FontDescriptor but it's a bit of a mess, test it, tidy up its creation and generally improve the public API for this class (it is internal but aim to make it public for 0.0.2).

Complete access to images from the PDF

Somewhere in the code I added support for reading images from the PDF just as the raw object stream bytes. We won't add much more than this for now but this should be nicely wrapped in an image class with a type enum, size and position on the page if this doesn't require a PNG decoder or something fancy.

Any other easily exposed metadata should be included. Add documents to test this.

BoundingBox for Images

Hi,

Unless I am mistaken, there is no support for getting an image's BoundingBox. Is there any plan to add this functionality?

Thanks!

Type 1 CharString performance improvements and subr support

I think I forgot to cache the bounding rectangles for Type 1 CharString decryption and they don't currently interpret calls to the subr (subroutine) command. This should be added and performance profiled.

This code is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig/Fonts/Type1/CharStrings/Type1CharStrings.cs

Test odd page numbered documents

A PDF document can be created containing the pages 3, 5 and 7, test how the current pages API handles this and make any necessary changes to allow consumption of documents which miss pages.

Enable loading a document from a PdfDocument into a PdfDocumentBuilder

In order for the library to be useful it must support editing as well as creating and reading. For this we need a way to read an existing document and convert it to a document builder.

Support AES-256bit encryption

I really like this library. The letters feature that shows me the glyph rectangle is very helpful. The problem I'm having is that a lot of my PDF documents come in encrypted AES-256 bit. Please support it! Thank you!

Inspect Type 1 CFF glyph positions and sizes

The Pig Production Handbook.pdf page 1 has a visual verification test:

As you can see the glyph boxes are way out. Investigate, add a test that asserts the positions of some letters using either xfinium pdf inspector or pdfbox to find the expected positions and fix this.

The test is:

GenerateLetterBoundingBoxImages.PigProductionCompactFontFormat

Some text is missing

For attached PDF the charts and text around them are missing. For example the text "Historical Arrears by Month" is not in Letters collection or page extracted text, as well as all the numbers/labels on the charts. The lines (paths collection) are also missing everything around the charts area. Is there possibly a sub-stream, which is not being processed?

missing_text_sample.pdf

How to get page orientation?

First of all, thank you! This library is great! But it seems there are some minor issues :-)

It seems page width/height is not correctly reported when page is rotated.
E.g. in the attached document it's showing width=612 and height = 792, but it's in landscape. So should it be reversed? Or have some orientation flag similar to PdfBox "page.findRotation()" method?

letter_size_problem.pdf

            
using (PdfDocument document = PdfDocument.Open(@"letter_size_problem.pdf"))
{
    var page = document.GetPage(1);
    decimal width = page.Width;
    decimal height = page.Height;
}

Support PDF documents using named system fonts

The height of letters in the document Multiple Page - from Mortality Statistics.pdf are currently wrong because this uses a TrueType font which is not included in the document (ArialMT), in this case the provider is meant to use the files from the host operating system.

Line 78 of TrueTypeFontHandler has a TODO describing what PDFBox does in this situation, we should do the same thing.

Optimize SystemFontFinder

In profiling done for #47 SystemFontFinder.GetTrueTypeFontNamed was called 18 times for a total of 4 seconds of the 29 second total.

This code is slow because it has to scan all fonts on the host operating system but it can be optimized trivially by using a static cache rather than per-call. It may also quicker to use File.ReadAllBytes rather than using a FileStream as the input to TrueType parsing.

Enable creation of document containing AcroForm elements

Support the inclusion of AcroForm elements when using the document builder.

Prepare TrueType fonts for public read-only API

We read TrueType fonts from PDF files however the current class is a mess. Tidy it up and make it public.

Make all the token classes public. Expose via a StructureExplorer class or similar.

It will be useful for more advanced users to directly access the underlying PDF tokens and objects to work around currently unsupported behaviour.

Suggested API would be something like:

document.ContentExplorer

Which would provide access to the xref table to navigate directly to objects as well as inspecting the tokens forming those objects and being able to decode streams with filters.

To this end the classes in the UglyToad.PdfPig.Tokenization.Tokens namespace should be moved to UglyToad.PdfPig.Tokens, gaps in test coverage fixed and any mutability prevented. A general sanity check before exposing on the public API.

Please add a License file too

Which license type do you choose for your project?

Apache2 as original ?

Thank you

Document metadata/XMP access?

Hi,

Any plans or thoughts about adding a direct means of getting the XMP metadata of a document?
Looks like I can get hold of the data by doing something like

doc.Structure.Catalog.CatalogDictionary.TryGet<IndirectReferenceToken>(NameToken.Metadata, out var token);

var objectToken = doc.Structure.GetObject(token.Data);
var streamToken = objectToken.Data as UglyToad.PdfPig.Tokens.StreamToken;

and then parsing streamToken.Data with XmpCore, but it might be useful to be able to get at the data more directly (not sure what the best format to expose it as would be though).

Enable reading values from the document's AcroForm

Support retrieval of information from a document's AcroForm including textboxes, select lists, radio buttons, checkboxes and other form elements.

Enable SourceLink for the next release

SourceLink enables end users to debug the code for libraries in NuGet packages. We should enable it to make everyone's life easier.

https://www.hanselman.com/blog/ExploringNETCoresSourceLinkSteppingIntoTheSourceCodeOfNuGetPackagesYouDontOwn.aspx

Enable SourceLink and test with another solution to check it works.

[enhancement] Add PdfRectangle.IntersectsWith

It would be helpful to have bool PdfRectangle.IntersectsWith(PdfRectangle other) added.

I often need to extract text from a given location so being able to check the bounding boxes using this would be convenient.

(IntersectsWith is what System.Drawing.Rectangle uses as the name, so suggesting that to be consistent)

Create "Hello World" PDF

The major feature of the next release should be the ability to create PDF documents, for now they will only support the addition of plain text.

This is the first ticket to implement enough of the API to create a single page PDF A4 document containing the text "Hello World!" on a single line.

Inspect Type 1 glyph positions and locations

From the Visual verification test for the Latex integration test document Glyph bounding boxes appear to be roughly the right shape but appear in the wrong position and the wrong scale for Type 1 fonts in PDF documents.

It's possible this is down to not using the right font matrix for Type 1 fonts or something else entirely. Add a test or tests which assert against glyph positions from a 3rd party tool similar to the SinglePageNonLatinAcrobatDistillerTests. You can use https://www.xfiniumpdf.com/xfinium-pdf-downloads.html to get these bounding boxes.

If they prove to be incorrect fix them.

StackOverflowException reading corrupt PDF document

Hi,

I've been doing a few tests with PdfPig 0.0.6, and one of the things I tried was loading the invalid pdf file in corrupt.zip in it, and that seems to result in a StackOverflowException being thrown from mscorlib (via NameTokenizer.TryTokenize I think).

For reference, that file was generated by running SharpFuzz against the PDFClown library (it also fails with a stackoverflow trying to load that file).

GetPage fails with error : 'Cannot convert array to rectangle'

Hello,

When i try to open a PDF file and read it, i have an error :

UglyToad.PdfPig.Exceptions.PdfDocumentFormatException : 'Cannot convert array to rectangle, expected 4 values instead got: [ 0, 0 ].'

UglyToad.PdfPig.Exceptions.PdfDocumentFormatException
HResult=0x80131500
Message=Cannot convert array to rectangle, expected 4 values instead got: [ 0, 0 ].
Source=UglyToad.PdfPig
Arborescence des appels de procédure :
à UglyToad.PdfPig.Util.ArrayTokenExtensions.ToIntRectangle(ArrayToken array)
à UglyToad.PdfPig.Parser.PageFactory.GetMediaBox(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing)
à UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing)
à UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber)
à UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
à WindowsFormsApp2.Form1.button1_Click(Object sender, EventArgs e) dans C:\Users\source\repos\WindowsFormsApp2\WindowsFormsApp2\Form1.cs :ligne 40
à System.Windows.Forms.Control.OnClick(EventArgs e)
à System.Windows.Forms.Button.OnClick(EventArgs e)
à System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
à System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
à System.Windows.Forms.Control.WndProc(Message& m)
à System.Windows.Forms.ButtonBase.WndProc(Message& m)
à System.Windows.Forms.Button.WndProc(Message& m)
à System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
à System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
à System.Windows.Forms.NativeWindow.DebuggableCallback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
à System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
à System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr dwComponentID, Int32 reason, Int32 pvLoopData)
à System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
à System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
à System.Windows.Forms.Application.Run(Form mainForm)
à WindowsFormsApp2.Program.Main() dans C:\Users\source\repos\WindowsFormsApp2\WindowsFormsApp2\Program.cs :ligne 19

Hope it will help you to fix it,

testing page.text

page.text did not give text with newlines.
using this code on lestest code of pdfpig
using (PdfDocument doc = PdfDocument.Open(textBox1.Text, new ParsingOptions { Password = textBox2.Text }))
{
var page = doc.GetPage(1);
string pagetext = page.Text;
File.WriteAllText("text.txt", pagetext);
textBox3.Text = pagetext;
}

Implement support for the gs content stream operator

PDF Page content streams can contain the gs operator:

Set[s] parameters from graphics state parameter dictionary

This currently has no effect which can lead to letters being given the wrong size.

Each entry in the parameter dictionary specifies the value of an individual graphics state parameter, as shown in Table 4.8. All entries need not be present for every invocation of the gs operator; the supplied parameter dictionary may include
any combination of parameter entries.

Implement support for setting the graphics state from the graphics state parameter dictionary.

Make PdfRectangle rotatable

Currently PDF rectangle is always assumed to be horizontal. This does not work for rotated text. Make sure it supports angled rectangles too. The result of these changes can be assessed against the visual verification for Rotated Text Libre Office.pdf

Expected name as dictionary key, instead got: Ghostscript

Hi there,

I'm trying to extract the text of a PDF generated by Ghostscript. The pdf itself seems fine, I tried to display it with a PDF viewer, which works. Also text extraction with iTextSharp seems to work. However, if I try to read the PDF with PdfPig, then I get the following exception:

PdfDocumentFormatException: Expected name as dictionary key, instead got: Ghostscript

I've looked at the pdf source to look for references to 'Ghostscript' and found the following snippet:

<?xpacket end='w'?>
endstream
endobj
2 0 obj
<</Producer(GPL Ghostscript 9.25)
/CreationDate(D:20190813110636Z00'00')
/ModDate(D:20190813110636Z00'00')
/Creator(OpenText Capture Recognition Engine \(RecoStar\) 7.8.0)>>endobj
xref

If I set a breakpoint and inspect the tokens, this indeed seems the place where the exception occurs. It seems that the parser cannot handle this kind of syntax. I must say, I don't have enough knowledge around the PDF format to know if this syntax is allowed, but in any case it exists in the wild with documents generated by RecoStar / Ghostscript.

Do you have any advice?