Code Monkey home page Code Monkey logo

microsearch's Introduction

microsearch

A small search library.

Primarily intended to be a learning tool to teach the fundamentals of search.

Useful for embedding into Python apps where you don't want/need something as complex as Lucene.

Part of my (upcoming) 2012 PyCon talk - https://us.pycon.org/2012/schedule/presentation/66/

Requirements

  • Python 2.5+ or Python 3.2+
  • (Optional) simplejson
  • (Optional) unittest2 (Python 2.5 - for runnning the tests)

Usage

Example:

import microsearch

# Create an instance, pointing it to where the data should be stored.
ms = microsearch.Microsearch('/tmp/microsearch')

# Index some data.
ms.index('email_1', {'text': "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh"})
ms.index('email_2', {'text': 'Everyone,\n\nM-m-m-m-my red stapler has gone missing. H-h-has a-an-anyone seen it?\n\nMilton'})
ms.index('email_3', {'text': "Peter,\n\nYeah, I'm going to need you to come in on Saturday. Don't forget those reports.\n\nLumbergh"})
ms.index('email_4', {'text': 'How do you feel about becoming Management?\n\nThe Bobs'})

# Search on it.
ms.search('Peter')
ms.search('tps report')

Shortcomings

This library is meant to help others learn. While it has full test coverage, it may not be suitable for production use. Reasons you may not want to use it in Real Code(tm):

  • No concurrency support
    • Tries to work atomically with files
    • But there are no locks
    • So it's possible for writes to overlap between processes
  • Maybe thread-safe?
    • Pretty much everything is on an instance
    • But I haven't tested it extensively with threading
  • No support for deleting documents
    • If an existing document changes or gets deleted, stale data will be left in the index
    • A workaround would be blowing away the index directory, moving the docs out and reindexing them :/
  • Only n-grams are supported
    • Because writing a full Porter or Snowball stemmer is beyond the needs of this library
  • No clue on performance at scale
    • This is a proof-of-concept & learning tool, not Lucene!
    • With a 2011 MBP on the first 1.2K docs of the Enron corpus:
      • Indexing is pretty slow at ~1 document per second
      • Search is pretty fast at ~0.007 sec per query
      • RAM never exceeded 15Mb when indexing, 10Mb when searching
      • Script in the source repo as enron_bench.py.

Running Tests

With a source checkout, run:

In Python 2:

python -m unittest2 tests

In Python 3:

python -m unittest tests

Tests should be passing at all times under both Python 2.7 & Python 3.2.

Contributions

If you wish to contribute to improving microsearch, the code you submit must:

  • Be your own work & BSD-licensed
  • Include a working fix/feature
  • Follow the existing style of the codebase
  • Include passing test coverage of the new code
  • If it's user-facing, must include documentation

Other submissions are welcome, but won't get merged until all of these requirements are met.

author:Daniel Lindsley <[email protected]>
date:2011/02/22

microsearch's People

Contributors

titusz avatar toastdriven avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.