Code Monkey home page Code Monkey logo

apachetika's Introduction

Apache Tika

Softwares Used to build the project

Apache Maven 3.5.0

Java 1.8

Eclipse Neon

Windows 10 64 bit

Jars Used

Apache Tika 1.15

What is in this repository?

This repository has the source code for a simple program that extracts text from input pdf.

For this program, the input pdf should contain text and not image.

A runnable jar that can produce the output is available in this repository. You can see the output by running this jar in your machine.

How can i see the output of this project?

You need to execute the runnable jar from command prompt.

Make sure you have java in your path variable (Environment Variable).

Go into the Runnable Jar folder in git repository to find the jar file.

The size of this file is 57mb and its because of the libraries that exists in Apache Tika. Apache Tika covers many usescases, but for the usecase i covered here not all those libraries are necessary. As i used maven, all the dependencies are downloaded automatically.

Input for this project is the absolute path of pdf file

Sample command to run the jar

java -jar insured-extractdata.jar "c:/file.pdf"

if you face any issues in running this kindly let me know [email protected]

What is the output of this project?

Output Two files will be generated in the current path where you executed the command.

1. contents.txt - contains the text extracted from pdf. The file will be empty if input pdf has image. The output in this file is in unstructured form. 
2. metadata.txt - contains the metadata of the the pdf. 

About Apache Tika

Apache Tika is used for extracting data from pdf, images, audio files. The usecases covered by Apache Tika is vast and on reading some of its documentation i understand it is quite powerful.

The extracted data will be in unstructured format and it can be converting to string.

i referred this article to build this project.

Apache Tika has a GUI version of jar in which we can drag and drop the files and view the extracted content. It's useful for our basic understanding on how Apache Tika works. The GUI version can be downloaded from the offical website

apachetika's People

Contributors

divine1 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.