Code Monkey home page Code Monkey logo

lucene-ir-engine's Introduction

lucene-ir-engine

lucene-ir-engine is an extremely simple Java application based on Apache Tika and Apache Lucene. It provides the following features:

  • Parsing and extraction of metadata and text content from various documents;
  • Indexing of plain text and metadata in order to create an inverted index related to parsed documents;
  • Performing a simple search by term within a previously created inverted index.

To perform the tasks above, ir-engine uses two Java libraries:

  • Apache Tika (1.20) provides Java APIs to detect and extract metadata and data from heterogenous file formats using existing parser libraries.
  • Apache Lucene (7.6.0) is a powerful Java library for indexing and searching of text.

lucene-ir-engine is a Maven project organized as follows:

  • lib: This directory includes all the JAR files required at runtime. Currently, it contains only the package lucene-backward-codecs-5.3.0.jar for backwards compatibility.

  • pom.xml: It is an XML file that contains information about the project and configuration details used by Maven to build the project.

  • src: This directory includes source files. It contains also the shell scripts to easily execute the utilities provided by lucene-ir-engine. These scripts are located into src/main/bin.

  • README.txt: This README plain-text file.

Getting Started

To build the project, you can type the following command:

mvn clean install

To run the utilities of lucene-ir-engine, you can launch the following scripts (in /src/main/bin):

  • indexer.sh aims at indexing metadata and text extracted from heterogeneous documents:

./indexer.sh -i /path/to/data_dir -o /path/to/index_dir -l /path/to/log_file -p /path/to/jar [-update] [-fork] [-ocr]

  • searcher.sh aims at performing search queries against previously built Lucene indexes:

./searcher.sh -i /path/to/index_dir -s seed

  • lister.sh aims at extracting the list of keywords in the Lucene indexes:

./lister.sh -i /path/to/index_dir -o /path/to/output_file

Furthermore, the scripts for Microsoft Windows systems are provided into the same directory.

A suitable dataset for testing lucene-ir-engine is govdocs1 provided by (Digital Corpora](http://digitalcorpora.org/corpora/files)

Backwards Compatibility

The last release of lucene-ir-engine relies on Apache Lucene 5.3.0. Lucene 5.x still supports the numerous 4.x index formats, whereas support for 3.x indexes has been removed. Therefore, lucene-ir-engine is able to perform queries against 4.x indexes if the package lucene-backward-codecs-5.3.0.jar is provided in the classpath. Currently, the script searcher.sh requires that package, that is located into the lib directory.

License

Apache License, version 2.0

lucene-ir-engine's People

Contributors

giuseppetotaro avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.