Code Monkey home page Code Monkey logo

cf2's Introduction

Counting Filter on q-grams using SSE instructions (CF2)

About

All CF2 variants were designed for research purpose [SG16].

Paper is available at ResearchGate

There are a couple of variants of the solution:

  • cf_gfg - applied only for the matching with translocations and inversions problem [GFG11],
  • cf1 - counting filter, a standard implementation [JTU96, N97],
  • cf1_gram - the above with q-gram superalphabet,
  • cf2 - our naive approach,
  • cf2_sse - with SSE instructions,
  • cf2_sse_gram - with SSE instructions and q-gram superalphabet,
  • cf2_skip - cf2 that skips unnecessary steps,
  • cf2_skip_sse - the above with SSE instructions,
  • cf2_skip_sse_gram - the above with q-gram superalphabet.

This source code was written for research purpose and has a minimal error checking. The code may be not be very readable and comments may not be adequate. There is no warranty, your use of this code is at your own risk.

Citing

@incollection {SG16,
  author    = {R. Susik and {Sz.} Grabowski},
  title     = {Engineering the Counting Filter for String Matching Algorithms},
  booktitle = {Algorithms, Networking and Sensing for Data Processing, Mobile Computing and Applications},
  publisher = {\L{}\'od\'z University of Technology Press},
  year      = {2016},
  pages     = {125--140},
  isbn      = {978-83-7283-7},
}

Requirements

  • C++ Compiler - MS Visual Studio
  • CPU with SSE instructions support
  • Python (for testing)

Compilation

To compile all variants of algorithm simply run:

git clone https://github.com/rsusik/cf2 cf2
cd cf2
nmake all

Usage

The make all compilation generates multiple variants of the algorithm. All of them can be executed from command line as follows:

ALGORITHM_NAME PATTERNS_FILE PATERN_SIZE TEXT_FILE

where:

  • PATTERNS_FILE - location of file with patterns
  • PATERN_SIZE - pattern size (e.g. 32)
  • TEXT_FILE - text file

There are also additional parameters (Q_GRAM size and NEW_SIGMA) defined in cpp files. It is not recomended to change these parameters.

For more details please refer to mentioned paper.

Example:

cf2_skip_k1.exe "D:/patterns.r10/patterns.dna.200MB.32.bin" 8 "D:/sets/dna.200MB"

The files with patterns has all the patterns written one by one with no separator. Algorithm reads m fist bytes as the first pattern, then next m bytes as the second and so on.

Testing

For testing purpose test_cf_win.py can be used.

Example:

python test_cf_win.py cf2_k2.exe,cf2_simd_k2.exe dna.200MB 10 8,12,16,24,32,64 

Please replace the paths in the test_cf_win.py.

References

[SG16] R. Susik and Sz. Grabowski. Engineering the Counting Filter for String Matching Algorithms, pages 125–140. Lodz University of Technology Press, 2016.

[JTU96] P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison of approximate string matching algorithms, pages 1439–1458. Software Practice & Experience, 1996.

[N97] G. Navarro. Multiple approximate string matching by counting, pages 125–139. In Proc. 4th South American Workshop on String Processing (WSP’97), 1997.

[GFG11] Sz. Grabowski, S. Faro, E. Giaquinta. String Matching with Inversions and Translocations in Linear Average Time (Most of the Time), pages 516–520. Information Processing Letters, 2011.

Authors

  • Robert Susik
  • Szymon Grabowski

cf2's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.