uglytoad / pdfpig Goto Github PK
View Code? Open in Web Editor NEWRead and extract text and other content from PDFs in C# (port of PDFBox)
Home Page: https://github.com/UglyToad/PdfPig/wiki
License: Apache License 2.0
Read and extract text and other content from PDFs in C# (port of PDFBox)
Home Page: https://github.com/UglyToad/PdfPig/wiki
License: Apache License 2.0
A letter in a PDF has the following information:
To illustrate this consider the following SVG of a character 'o' or '0'taken from a PDF:
The red dot illustrates the placement origin, the blue box illustrates the bounding box for the glyph itself, notice how it extends below the origin, it can also go to the left of the origin or in this case not include the origin. The advance width for this character would probably be greater than the bounding box width since the origin is outside the character.
A letter should have origin as PdfPoint, glyph bounding box as PdfRectangle and Width as decimal with comments explaining the above.
Hi,
I had a try with loading some simple test documents into PdfPig 0.0.6, and noticed that unicode document properties don't seem to be handled correctly.
e.g., if I open the attached minimal.pdf in Acrobat reader it displays:
but in PdfDocument.Information, I get:
Would this be expected to work?
Thanks.
As part of the document creation epic for the next release we should handle wrapping text automatically (nothing fancy like working out the right place to line-break), create a document that shows the following capabilities:
Due to the poor performance of PdfPig for end-user scenarios we should see what impact substituting decimal
s for float
s provides where the values are being used in calculations (all TransformationMatrix
based code).
If the benefits from #64 aren't considered good enough then it may be that calculated values are better of being float
based.
Type 1 and CID fonts can use the Compact Font Format: http://wwwimages.adobe.com/www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5176.CFF.pdf
Currently two tests are failing because CFF parsing is not yet supported. Implement it.
Changeset: 7fab13e
Test Name: UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion
Test FullName: UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion
Test Source: C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig.Tests\Fonts\Type1\Type1FontParserTests.cs : line 15
Test Outcome: Failed
Test Duration: 0:00:00.005
Result StackTrace:
at UglyToad.PdfPig.Fonts.Type1.Parser.Type1Tokenizer.ReadNextToken() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1Tokenizer.cs:line 59
at UglyToad.PdfPig.Fonts.Type1.Parser.Type1Tokenizer.GetNext() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1Tokenizer.cs:line 33
at UglyToad.PdfPig.Fonts.Type1.Parser.Type1EncryptedPortionParser.Parse(IReadOnlyList`1 bytes, Boolean isLenientParsing) in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1EncryptedPortionParser.cs:line 40
at UglyToad.PdfPig.Fonts.Type1.Parser.Type1FontParser.Parse(IInputBytes inputBytes, Int32 length1, Int32 length2) in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1FontParser.cs:line 149
at UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig.Tests\Fonts\Type1\Type1FontParserTests.cs:line 19
Result Message: System.InvalidOperationException : Encountered an end of string ')' outside of string.
Once CFF parser is implemented in #6 ensure it handles the case detailed in this ticket: https://issues.apache.org/jira/browse/PDFBOX-4330
In this file all letters on pages 1-54 have Value=null. I guess this is due to font "TTdcr10"?
Starting from page 55 letters are extracted correctly (when font is changing to ArialMT).
Is this a known issue? It seems to be working in PdfBox.
This is not exactly an issue, but more like a general question. While extracting Letters collection I noticed that overall the process runs about 4-5 times slower than PdfBox. I run PdfBox through Ikvm, so I was expecting it to be the other way around :-)
Of course there can be many things contributing to this, but I did one quick test - I ran a mass replace of word "decimal" to "double" across the whole code base. And yes, the speed got right on par with PdfBox! Changing it to "float" made it even a little faster (probably due to smaller memory footprint).
Sure double/float is not precise, but personally I often need to run extraction over hundreds of thousands PDFs, so speed is crucial and the time difference is substantial. On the other hand I think that letters/lines positions and dimensions would be OK with 2 digits precision at most (ok, maybe 3:-))
I see a few ways to approach this:
Thoughts? Thank you for your time and all the work put in this library!
Hello there,
When I execute the samples you provided, no matter which one,
ArgumentOutOfRangeException will occur when executing var page = document.GetPage(i + 1);,
but when document.NumberOfPages is used to fetch the page number,
The number of pages obtained is correct. The relevant information is as follows
The StackTrace :
at System.DateTime.Add(Double value, Int32 scale)
at UglyToad.PdfPig.Fonts.TrueType.TrueTypeDataBytes.ReadInternationalDate() at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\TrueTypeDataBytes.cs: 行 114
at UglyToad.PdfPig.Fonts.TrueType.Tables.HeaderTable.Load(TrueTypeDataBytes data, TrueTypeHeaderTable table) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Tables\HeaderTable.cs: 行 97
at UglyToad.PdfPig.Fonts.TrueType.Parser.TrueTypeFontParser.ParseTables(Decimal version, IReadOnlyDictionary`2 tables, TrueTypeDataBytes data) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Parser\TrueTypeFontParser.cs: 行 59
at UglyToad.PdfPig.Fonts.TrueType.Parser.TrueTypeFontParser.Parse(TrueTypeDataBytes data) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Parser\TrueTypeFontParser.cs: 行 35
at UglyToad.PdfPig.Fonts.Parser.Parts.CidFontFactory.ReadDescriptorFile(FontDescriptor descriptor) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Parts\CidFontFactory.cs: 行 114
at UglyToad.PdfPig.Fonts.Parser.Parts.CidFontFactory.Generate(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Parts\CidFontFactory.cs: 行 56
at UglyToad.PdfPig.Fonts.Parser.Handlers.Type0FontHandler.ParseDescendant(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Handlers\Type0FontHandler.cs: 行 128
at UglyToad.PdfPig.Fonts.Parser.Handlers.Type0FontHandler.Generate(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Handlers\Type0FontHandler.cs: 行 34
at UglyToad.PdfPig.Fonts.FontFactory.Get(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\FontFactory.cs: 行 51
at UglyToad.PdfPig.Content.ResourceContainer.LoadFontDictionary(DictionaryToken fontDictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\ResourceContainer.cs: 行 93
at UglyToad.PdfPig.Content.ResourceContainer.LoadResourceDictionary(DictionaryToken resourceDictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\ResourceContainer.cs: 行 33
at UglyToad.PdfPig.Parser.PageFactory.LoadResources(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Parser\PageFactory.cs: 行 215
at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Parser\PageFactory.cs: 行 67
at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\Pages.cs: 行 62
at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\PdfDocument.cs: 行 158
at DocumentLayoutAnalysis.ImageTest.Run(String path) at F:\IISC\My Lab\DocumentLayoutAnalysis-master\DocumentLayoutAnalysis\DocumentLayoutAnalysis\ImageTest.cs: 行 25
Details on the system :
OS : MS Windows v10
VS : VS 2017 C#
.NET version : .NET framework 4.6.1
Hi. Thank you for this library, this good, but I have problem.
When I use PC with not "en-US" style numbers, by default I have throw when try open any pdf file.
I was try to fix this but still not found all parse function.
for example:
private static decimal ReadDecimal(IInputBytes input)
{
decimal result;
var str = ReadString(input);
Decimal.TryParse(str, NumberStyles.Any, new CultureInfo("en-US"), out result); // <-
return result;
}
Sorry for my english. Thank you for your work.
Steps to reproduce:
PdfDocument.Open(...)
The call to GetPage fails with the following error:
UglyToad.PdfPig.Fonts.Exceptions.InvalidFontFormatException: Could not find a name for this font (/Type, /Font) (/Subtype, /Type1) (/FirstChar, COSInt{0}) (/LastChar, COSInt{127}) (/Widths, COSObject{325, 0}) (/BaseFont, COSObject{331, 0}) (/FontDescriptor, COSObject{332, 0}) . at UglyToad.PdfPig.Fonts.Parser.FontDictionaryAccessHelper.GetName(PdfDictionary dictionary, FontDescriptor descriptor) at UglyToad.PdfPig.Fonts.Parser.Handlers.Type1FontHandler.Generate(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing) at UglyToad.PdfPig.Fonts.FontFactory.Get(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing) at UglyToad.PdfPig.Content.ResourceContainer.LoadFontDictionary(PdfDictionary fontDictionary, IRandomAccessRead reader, Boolean isLenientParsing) at UglyToad.PdfPig.Content.ResourceContainer.LoadResourceDictionary(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing) at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, PdfDictionary dictionary, PageTreeMembers pageTreeMembers, IRandomAccessRead reader, Boolean isLenientParsing) at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber) at FactReapUsage.Program.Main(String[] args) in C:\Code\FactReapUsage\FactReapUsage\FactReapUsage\Program.cs:line 19
In profiling done for #47 PdfTokenScanner.TryReadStream
takes up 1/6th of the total time for parsing a set of 26 documents being called a total of 1,148 times. This is probably low-hanging fruit for performance optimization since in general we know the length of the stream ahead of time.
Hi, I'm trying to extract text letters and positions from PDFs. For most documents it's working great, but for attached sample (and many others) it's returning Letter.Width=0 and Letter.FontSize=1
Any ideas how to work around this? Thank you!
letter_size_problem.pdf
using (PdfDocument document = PdfDocument.Open(@"letter_size_problem.pdf"))
{
var page = document.GetPage(1);
Letter l = page.Letters[0];
decimal x = l.Location.X;
decimal y = l.Location.Y;
decimal width = l.Width;
decimal fontSize = l.FontSize;
}
A barrier to adoption of the library is probably the lack of a "batteries included" text extraction API. We support retrieving letters and their size but each client must write their own word generation logic. We should include a naive default with pluggable interface. For example:
var document = PdfDocument.Open("somedocument.pdf");
var page = document.GetPage(1);
IEnumerable<Word> words = page.GetWords();
Where the get words method is using an optional parameter:
IEnumerable<Word> GetWords(IWordExtractor extractor = null)
Which if not set uses the internal library implementation. It doesn't need to be great for now but should at least do the obvious things right...
Hi,
Can I ask if there are any plans and/or possibility of supporting custom document properties in PDF files as well as the 'known' ones (Author and Keywords and such)?
I haven't used PDFBox, but the documentation for PDDocumentInformation does seem to have functions to access custom properties.
As seen in #47 multiplication operations on TransformationMatrix
take over 1/3rd of the total parsing time for PdfPig. We will investigate the optimizations/tradeoffs of using floats instead of decimals which may result in a large speedup however it's also worth checking the performance impact of using values directly (9 internal decimals rather than an array) which may either be slower due to the large value to copy, or improve performance due to removing repeated array access.
Hi,
I am trying to retrieve all the paths from this pdf document , but it seems some of them are missing.
When drawing all the bounding boxes found, this is what I get (the PdfPath.BezierCurve
are in red, and the PdfPath.Line
are in blue):
As you can see, for each of the charts, only one line contains bounding boxes, the others seem to be ignored. Same issue for grid lines: some are drawn and some are not.
Am I doing something wrong, or are they really missing?
Thanks,
The code I used is the following:
using (PdfDocument document = PdfDocument.Open(path))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var paths = page.ExperimentalAccess.Paths;
using (var bitmap = converter.GetPage(i + 1, zoom))
using (var graphics = Graphics.FromImage(bitmap))
{
var imageHeight = bitmap.Height;
foreach (var p in paths)
{
if (p == null) continue;
var commands = p.Commands;
foreach (var command in commands)
{
if (command is PdfPath.Line line)
{
var bbox = line.GetBoundingRectangle();
if (bbox.HasValue)
{
var rect = new Rectangle(
(int)(bbox.Value.Left * (decimal)zoom),
imageHeight - (int)(bbox.Value.Top * (decimal)zoom),
(int)(bbox.Value.Width == 0 ? 1 : bbox.Value.Width * (decimal)zoom),
(int)(bbox.Value.Height == 0 ? 1 : bbox.Value.Height * (decimal)zoom));
graphics.DrawRectangle(bluePen, rect);
}
}
else if (command is PdfPath.BezierCurve curve)
{
var bbox = curve.GetBoundingRectangle();
if (bbox.HasValue)
{
var rect = new Rectangle(
(int)(bbox.Value.Left * (decimal)zoom),
imageHeight - (int)(bbox.Value.Top * (decimal)zoom),
(int)(bbox.Value.Width == 0 ? 1 : bbox.Value.Width * (decimal)zoom),
(int)(bbox.Value.Height == 0 ? 1 : bbox.Value.Height * (decimal)zoom));
graphics.DrawRectangle(redPen, rect);
}
}
}
}
}
}
}
The classes in the UglyToad.PdfPig.Graphics.Operations
namespace represent all operations a page's content stream can contain. Finish implementing writing for all of them and make them public. Use a reflection based test to ensure they can all be written.
Reading the file in this support ticket currently throws. Add the necessary steps to support reading it:
https://issues.apache.org/jira/browse/PDFBOX-4299
Using attached document and program below nothing is written to the console. Sample PDF came from a commercial HTML to PDF library one of our customers uses.
using System;
using UglyToad.PdfPig;
namespace ExtractTest
{
class Program
{
static void Main(string[] args)
{
using (PdfDocument document = PdfDocument.Open("Getting_Started.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
foreach (var letter in page.Letters)
{
Console.WriteLine(letter.Value);
}
}
}
}
}
}
From the spec:
A font descriptor specifies metrics and other attributes of a simple font or a CIDFont as a whole, as distinct from the metrics of individual glyphs. These font metrics provide information that enables a consumer application to synthesize a substitute font or select a similar font when the font program is unavailable. The font descriptor may also be used to embed the font program in the PDF file.
We have a class to represent the FontDescriptor but it's a bit of a mess, test it, tidy up its creation and generally improve the public API for this class (it is internal but aim to make it public for 0.0.2).
Somewhere in the code I added support for reading images from the PDF just as the raw object stream bytes. We won't add much more than this for now but this should be nicely wrapped in an image class with a type enum, size and position on the page if this doesn't require a PNG decoder or something fancy.
Any other easily exposed metadata should be included. Add documents to test this.
Hi,
Unless I am mistaken, there is no support for getting an image's BoundingBox. Is there any plan to add this functionality?
Thanks!
I think I forgot to cache the bounding rectangles for Type 1 CharString decryption and they don't currently interpret calls to the subr (subroutine) command. This should be added and performance profiled.
This code is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig/Fonts/Type1/CharStrings/Type1CharStrings.cs
A PDF document can be created containing the pages 3, 5 and 7, test how the current pages API handles this and make any necessary changes to allow consumption of documents which miss pages.
In order for the library to be useful it must support editing as well as creating and reading. For this we need a way to read an existing document and convert it to a document builder.
I really like this library. The letters feature that shows me the glyph rectangle is very helpful. The problem I'm having is that a lot of my PDF documents come in encrypted AES-256 bit. Please support it! Thank you!
The Pig Production Handbook.pdf page 1 has a visual verification test:
As you can see the glyph boxes are way out. Investigate, add a test that asserts the positions of some letters using either xfinium pdf inspector or pdfbox to find the expected positions and fix this.
The test is:
GenerateLetterBoundingBoxImages.PigProductionCompactFontFormat
For attached PDF the charts and text around them are missing. For example the text "Historical Arrears by Month" is not in Letters collection or page extracted text, as well as all the numbers/labels on the charts. The lines (paths collection) are also missing everything around the charts area. Is there possibly a sub-stream, which is not being processed?
First of all, thank you! This library is great! But it seems there are some minor issues :-)
It seems page width/height is not correctly reported when page is rotated.
E.g. in the attached document it's showing width=612 and height = 792, but it's in landscape. So should it be reversed? Or have some orientation flag similar to PdfBox "page.findRotation()" method?
using (PdfDocument document = PdfDocument.Open(@"letter_size_problem.pdf"))
{
var page = document.GetPage(1);
decimal width = page.Width;
decimal height = page.Height;
}
The height of letters in the document Multiple Page - from Mortality Statistics.pdf are currently wrong because this uses a TrueType font which is not included in the document (ArialMT), in this case the provider is meant to use the files from the host operating system.
Line 78 of TrueTypeFontHandler has a TODO describing what PDFBox does in this situation, we should do the same thing.
In profiling done for #47 SystemFontFinder.GetTrueTypeFontNamed
was called 18 times for a total of 4 seconds of the 29 second total.
This code is slow because it has to scan all fonts on the host operating system but it can be optimized trivially by using a static cache rather than per-call. It may also quicker to use File.ReadAllBytes
rather than using a FileStream
as the input to TrueType parsing.
Support the inclusion of AcroForm elements when using the document builder.
We read TrueType fonts from PDF files however the current class is a mess. Tidy it up and make it public.
It will be useful for more advanced users to directly access the underlying PDF tokens and objects to work around currently unsupported behaviour.
Suggested API would be something like:
document.ContentExplorer
Which would provide access to the xref table to navigate directly to objects as well as inspecting the tokens forming those objects and being able to decode streams with filters.
To this end the classes in the UglyToad.PdfPig.Tokenization.Tokens
namespace should be moved to UglyToad.PdfPig.Tokens
, gaps in test coverage fixed and any mutability prevented. A general sanity check before exposing on the public API.
Which license type do you choose for your project?
Apache2 as original ?
Thank you
:)
Hi,
Any plans or thoughts about adding a direct means of getting the XMP metadata of a document?
Looks like I can get hold of the data by doing something like
doc.Structure.Catalog.CatalogDictionary.TryGet<IndirectReferenceToken>(NameToken.Metadata, out var token);
var objectToken = doc.Structure.GetObject(token.Data);
var streamToken = objectToken.Data as UglyToad.PdfPig.Tokens.StreamToken;
and then parsing streamToken.Data with XmpCore, but it might be useful to be able to get at the data more directly (not sure what the best format to expose it as would be though).
Support retrieval of information from a document's AcroForm including textboxes, select lists, radio buttons, checkboxes and other form elements.
SourceLink enables end users to debug the code for libraries in NuGet packages. We should enable it to make everyone's life easier.
Enable SourceLink and test with another solution to check it works.
It would be helpful to have bool PdfRectangle.IntersectsWith(PdfRectangle other)
added.
I often need to extract text from a given location so being able to check the bounding boxes using this would be convenient.
(IntersectsWith
is what System.Drawing.Rectangle
uses as the name, so suggesting that to be consistent)
The major feature of the next release should be the ability to create PDF documents, for now they will only support the addition of plain text.
This is the first ticket to implement enough of the API to create a single page PDF A4 document containing the text "Hello World!" on a single line.
From the Visual verification test for the Latex integration test document Glyph bounding boxes appear to be roughly the right shape but appear in the wrong position and the wrong scale for Type 1 fonts in PDF documents.
It's possible this is down to not using the right font matrix for Type 1 fonts or something else entirely. Add a test or tests which assert against glyph positions from a 3rd party tool similar to the SinglePageNonLatinAcrobatDistillerTests. You can use https://www.xfiniumpdf.com/xfinium-pdf-downloads.html to get these bounding boxes.
If they prove to be incorrect fix them.
Hi,
I've been doing a few tests with PdfPig 0.0.6, and one of the things I tried was loading the invalid pdf file in corrupt.zip in it, and that seems to result in a StackOverflowException being thrown from mscorlib (via NameTokenizer.TryTokenize I think).
For reference, that file was generated by running SharpFuzz against the PDFClown library (it also fails with a stackoverflow trying to load that file).
Hello,
When i try to open a PDF file and read it, i have an error :
UglyToad.PdfPig.Exceptions.PdfDocumentFormatException : 'Cannot convert array to rectangle, expected 4 values instead got: [ 0, 0 ].'
UglyToad.PdfPig.Exceptions.PdfDocumentFormatException
HResult=0x80131500
Message=Cannot convert array to rectangle, expected 4 values instead got: [ 0, 0 ].
Source=UglyToad.PdfPig
Arborescence des appels de procédure :
à UglyToad.PdfPig.Util.ArrayTokenExtensions.ToIntRectangle(ArrayToken array)
à UglyToad.PdfPig.Parser.PageFactory.GetMediaBox(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing)
à UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing)
à UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber)
à UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
à WindowsFormsApp2.Form1.button1_Click(Object sender, EventArgs e) dans C:\Users\source\repos\WindowsFormsApp2\WindowsFormsApp2\Form1.cs :ligne 40
à System.Windows.Forms.Control.OnClick(EventArgs e)
à System.Windows.Forms.Button.OnClick(EventArgs e)
à System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
à System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
à System.Windows.Forms.Control.WndProc(Message& m)
à System.Windows.Forms.ButtonBase.WndProc(Message& m)
à System.Windows.Forms.Button.WndProc(Message& m)
à System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
à System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
à System.Windows.Forms.NativeWindow.DebuggableCallback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
à System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
à System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr dwComponentID, Int32 reason, Int32 pvLoopData)
à System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
à System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
à System.Windows.Forms.Application.Run(Form mainForm)
à WindowsFormsApp2.Program.Main() dans C:\Users\source\repos\WindowsFormsApp2\WindowsFormsApp2\Program.cs :ligne 19
Hope it will help you to fix it,
page.text did not give text with newlines.
using this code on lestest code of pdfpig
using (PdfDocument doc = PdfDocument.Open(textBox1.Text, new ParsingOptions { Password = textBox2.Text }))
{
var page = doc.GetPage(1);
string pagetext = page.Text;
File.WriteAllText("text.txt", pagetext);
textBox3.Text = pagetext;
}
PDF Page content streams can contain the gs
operator:
Set[s] parameters from graphics state parameter dictionary
This currently has no effect which can lead to letters being given the wrong size.
Each entry in the parameter dictionary specifies the value of an individual graphics state parameter, as shown in Table 4.8. All entries need not be present for every invocation of the gs operator; the supplied parameter dictionary may include
any combination of parameter entries.
Implement support for setting the graphics state from the graphics state parameter dictionary.
Hi there,
I'm trying to extract the text of a PDF generated by Ghostscript. The pdf itself seems fine, I tried to display it with a PDF viewer, which works. Also text extraction with iTextSharp seems to work. However, if I try to read the PDF with PdfPig, then I get the following exception:
PdfDocumentFormatException: Expected name as dictionary key, instead got: Ghostscript
I've looked at the pdf source to look for references to 'Ghostscript' and found the following snippet:
<?xpacket end='w'?>
endstream
endobj
2 0 obj
<</Producer(GPL Ghostscript 9.25)
/CreationDate(D:20190813110636Z00'00')
/ModDate(D:20190813110636Z00'00')
/Creator(OpenText Capture Recognition Engine \(RecoStar\) 7.8.0)>>endobj
xref
If I set a breakpoint and inspect the tokens, this indeed seems the place where the exception occurs. It seems that the parser cannot handle this kind of syntax. I must say, I don't have enough knowledge around the PDF format to know if this syntax is allowed, but in any case it exists in the wild with documents generated by RecoStar / Ghostscript.
Do you have any advice?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.