Apache Maven 3.5.0
Java 1.8
Eclipse Neon
Windows 10 64 bit
Apache Tika 1.15
This repository has the source code for a simple program that extracts text from input pdf.
For this program, the input pdf should contain text and not image.
A runnable jar that can produce the output is available in this repository. You can see the output by running this jar in your machine.
You need to execute the runnable jar
from command prompt.
Make sure you have java
in your path
variable (Environment Variable).
Go into the Runnable Jar folder in git repository to find the jar file.
The size of this file is 57mb and its because of the libraries that exists in Apache Tika. Apache Tika covers many usescases, but for the usecase i covered here not all those libraries are necessary. As i used maven, all the dependencies are downloaded automatically.
Input
for this project is the absolute path of pdf file
java -jar insured-extractdata.jar "c:/file.pdf"
if you face any issues in running this kindly let me know [email protected]
Output
Two files will be generated in the current path where you executed the command.
1. contents.txt - contains the text extracted from pdf. The file will be empty if input pdf has image. The output in this file is in unstructured form.
2. metadata.txt - contains the metadata of the the pdf.
Apache Tika is used for extracting data from pdf, images, audio files. The usecases covered by Apache Tika is vast and on reading some of its documentation i understand it is quite powerful.
The extracted data will be in unstructured format and it can be converting to string.
i referred this article to build this project.
Apache Tika has a GUI version of jar in which we can drag and drop the files and view the extracted content. It's useful for our basic understanding on how Apache Tika works. The GUI version can be downloaded from the offical website