friesey / pdfeventprep Goto Github PK

Prep for the OPF PDF Hackathon

License: GNU Affero General Public License v3.0

Java 100.00%

pdfeventprep's Introduction

PdfEventPrep

Preparation for the OPF PDF Hackathon which takes place at the 1st and 2nd September in Hamburg. These snippets and tools deals with several PDF issues which might come in handy during the Hackathon.

jobs

simple analysis (ss there a PDF Header, which kind of PDF - PDF or PDF/A, which version, which PDF size)
more detailed analyis (Encryption)
validation test (for PDF/A, if it is actually a PDF/A)
repair function (quite simple)
quality analysis after Migration

PDF Analysis

PdfHeaderChecker

Tests if the file starts with "%PDF". This tools works through a selected folder and possible sub-folders. To avoid crashes, there are some other tests like e. g. for the extension, if there is an encryption, if the file is a PDF/A etc. For more information see "documentation.md".

CreationSoftwareDetective

This tool is able to detect which Software was used to create the PDF and puts out all Creation software in an "outpufile.txt" in the folder which was examined. It does not yet count how often each software was used, but this is planned to be implemented.

Furthermore, it detects encrypted PDF-files and is able to deal with some really broken PDF-files. However, some PDF-files still do crash the program, which is planned to be fixed soon (already fixed in "PdfHeaderChecker", which should not crash any more at all).

It would be handy to have some of the functions in this program reused during the Hackathon, as some files (not only PDF-files) can stop the program and during the Hackathon, no time should be wasted to deal with all the exception to be able to get some work done.

The library iText is used, the AGPL-version, which has to be considered when re-using this tool or snippets from it. The library PDFBox is used, too. This tools works through a selected folder and possible sub-folders.

Jhove Statistics

Simple analysis of wordy JHOVE finding files.

PDf Validator Tool(s)

PdfAValidator

Checks via PDFBox if a PDF/A is valid. Runs through a folder and picks out only PDF/A-files.

Migration Tool(s)

iTextRepairPdf

Is able to take a PDF-file and copies the content page-per-page to a new, PDFA1-conform PDF-file. The XMP-Metadata is also copied.

This repairs possible issues with the structure of the PDF. JHOVE will consider the so new-built PDF-files as well-formed and valid. PDFTron, however, will still detect problems and issues with the new-built PDF-files, mostly about fonts and images.

The library iText is used, the AGPL-version, which has to be considered when re-using this tool or snippets from it.

This tools works through a selected folder and possible sub-folders.

PdfToImageConverter

Converts PDF Files in a certain folder to JPEGs page-per-page. Is a prerequesite for later Quality Checking /visual comparison via e. g. matchbox or ImageMagick.

Quality Checking after Migrations

PdfTwinTest

First, two files are chosen. The program takes care that two Pdf-files are chosen that can be examined (the too-broken or too-big-issue is avoided). The tool compares the two PDF line-by-line and puts out differences. This is handy for after-Migration Quality-Checking. Usually, the PDF-files created with the "iTextRepairPdf"-tool do not show any differences.

Helping tools within the programs

PdfUtilities.java

Class contains commonly used methods and one commonly used BufferedReader. Will be extended to be more efficient.

Reusing external libraries

Third-party libraries and tools used:

Apache PDFBox
iText - note that this library is AGPL3 licensed

pdfeventprep's People

Contributors

Watchers

Forkers

skrug bitzl carlwilson

pdfeventprep's Issues

Overload method/function "FileHeaderTest" with Data Type file and String

It should be possible to perform the PdfUtilities.FileHeaderTest-function with files and Strings. Right now it is only possible with the data type file.
Function should be overloaded.

Set DEFAULT_MAX_FILE_LENGTH

1024 * 1024 * 16 is too big. I have a file in my stack which has 16.058.818 Bytes and is also too big and causes out of memory problems.
I will have to check with slightly smaller PDF files to exactly determine the size of the files that do not crash the program.

Program crashes due to rights issues

So far, the program (at least the Part "PdfHeaderChecker", others will follow this example) can deal with all strange files on my laptop (and there are some really strange ones).

However, it crashes if a folder is chosen which cannot be accessed due to rights issues (e. g. "C:/"). It should be tought to omit these folders - check if it has rights or not - gives a short message and then continues with the folders which are ok for it to access.

The push back buffer has to be increased

Some large files create problems, some of them could be solved. There is at least one left that does not crashes the program but throws an error message like this:

"org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 123575 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize54" + file path of the file that causes this error.

PdfReader creates OutOfMemoryError

PdfCreationSoftwareDetective
The instance of the PdfReader leads to a crash of the program, because the "GC overhead limit exceeded". I would guess this is due to a PDF from the "difficult PDF files"-folder, in which I have put all the worst PDF-files I could get a hold of, some of the really big and/or really broken.
Will ingestigate further soon.

PdfTwinTest: if one line is displaced, all the others are wrong

It would be nice if the program notices if one line is completely displaced which will inevitable lead to all following lines having differences.
Example:
Line 1 of the original is similar to Line 2 of the migrated one, as the migrated one has for some reason one empty line added before the actual beginning. This will lead to as many different lines as the PDF actually has.
This is a bit a hazzle to teach to the program but should be possible. E. g. if a line differs it could be compared to the next line and if that's similar, go on from that point on.

PdfUtilities file size check fails if file name passed as string.

The method

public static boolean checkPdfSize(String file) {
    long filesize = file.length();
    if (filesize > 16000000) {
        System.out
                .println("File is bigger than 16 MB and therefore cannot be measured");
        return true;
    } else {
        return false;
    }

}

actually measures the length of the file name, NOT the length of the file.

Maven build does not specify required Java version.

The pdf-tool project requires Java 7, the travis build fails with openJDK6, while the corresponding openJDK build passes. The results of both builds can be seen here.

It's possible to specify the required Java version in the Maven POM file as shown here. These changes should initially go into https://github.com/friesey/PdfEventPrep/blob/mavenise/pdf-tools/pom.xml.

XMLStreamWriter xmlfile puts all the output in one line

PdfHeaderChecker:
XMLStreamWriter xmlfile puts all the output in one line
Online-Help is difficult to get because for XML files there is no specified formatting necessary. I'd guess it does not really hurt when working with the XML output afterwards (the XSLT will fix this), but it is not handy for test purposes and absolutely not correct for an XML file to look like this.
This is stupid. I want to fix that.

The log4j system has to be initialized properly.

The PdfHeaderChecker program throws (usually once) the following error message:
log4j:WARN Please initialize the log4j system properly.
This does not affect the performance or the running of the program, but some google research leads to the opinion that something is wrong and should be fixed.