Code Monkey home page Code Monkey logo

grammar_improved_index's Introduction

Improved Grammar Index

This project is the implementation of a grammar-based index proposed in [1]. Self-indexes are Compact Data Structures that represent the text in a compressed way and allows us to count the occurrences of a pattern (count), return the positions of the original text where these occurrences appear (locate) and extract any substring of the text (display). This index uses O(G lg n) bits and can find the occ occurrences of patterns P[1..m] in time O((m^2 + occ)lg G), where G is the size of the grammar.

[1] Claude, F., Navarro, G., & Pacheco, A. (2020). Grammar-Compressed Indexes with Logarithmic Search Time. https://arxiv.org/pdf/2004.01032.pdf

Compilation ๐Ÿš€

First download the project

git clone https://github.com/apachecom/grammar_improved_index.git

In order to compile the program, it is necessary to download, compile and install the Succinct Data Structure Library (SDSL).

git clone https://github.com/simongog/sdsl-lite
cd sdsl-lite
./install-sh 

You also must replace some original files in the library with the files in the folder sdsl-files.

In order to run benchmark you will need some external libraries as gflags (https://github.com/gflags/gflags.git) and libcds (https://github.com/fclaude/libcds.git)

Now, it is posible to compile the binaries

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=RELEASE 
make 

In order to test the algorithm, you can use the collections located in the repository http://pizzachili.dcc.uchile.cl/repcorpus.html

wget  http://pizzachili.dcc.uchile.cl/repcorpus/real/einstein.en.txt.gz
gunzip einstein.en.txt.gz

You must process collections in order to remove forbidden characters.

./process_coll <input_file> <output_file>

Build the index

To build the index, you must run

./build_indices <collection_file> <index_out_file> 

Since we use google-benchmark, you can generate CSV files with the results

./build_indices <collection_file> <index_out_file> <mem_out_file> --benchmark_out=<results_file> --benchmark_out_format=csv

Display operation

To extract a substring, you must execute:

./display_indices <index_file> <positions_file> min_len max_len gap --benchmark_out=<output_result_file>  --benchmark_out_format=csv

where:

  • "index_file" is the path of the index file without the extension (Note that ./build_indices generates several versions of our index);

  • "positions_file" is a file with the positions to extract separated by a line jump;

  • "min" is the minimum length to extract;

  • "max" is the maximun length to extract;

  • "gap" is the gap betwen lenght;

example:

./display_indices f1 ptos 10 100 10 --benchmark_out=out.csv  --benchmark_out_format=csv

This will perform an experiment which we will measure the extraction procedure for lengths (10,20,30,...,100)

Locate operation

To locate a pattern, you must execute:

./locate_indices <index_file> <pattern_file> min_len max_len gap --benchmark_out=<output_result_file>  --benchmark_out_format=csv

where:

  • "pattern_file" is a file with the patterns to locate NOT separated by a line jump; The patterns must have length = max_len

The rest of the parameters are the same as for the extraction

grammar_improved_index's People

Contributors

apachecom avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.