Code Monkey home page Code Monkey logo

search-tfidf-word2vec-poc's Introduction

Searching documents with TF-IDF

Demo with 500k+ documents should be running at search.lookies.io. Preview on test data

Setup

# sudo apt-get install python3.4-dev
virtualenv -p /usr/local/bin/python3 py3env # see: which python3
source py3env/bin/activate
pip install Flask pymongo
pip install nltk    # for the stemmer (todo)

pip install gensim  # for word2vec # cython numpy word2vec
wget https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz
# mirror of: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

Run

python app.py
# listenning on localhost:6001

Tests

green

I clearly need to transition from build then test/maintain to test then build/maintain...

  • The tokenizer is tested.
  • The rest still not so much...

Organisation and classes

Term-document data structure

  • All the logic is in the class term_document_matrix_abstract in term_document.py
  • Low-level details are drafted in abstract methods and left to be implemented
  • A implementation using a dict-of-dicts is available.
  • A sparse matrix could be usefull as well and interesting for a comparaison.

Tokenizer

  • String transforms and token filters can be used easily to create a tokenizer
  • Stemming, lowercasing and some others are shown as an example in tokenizers.py
  • Tokenization is available through
tokens = my_tokenizer.tokenize(string)

Data loading

  • Extensible : we just need to provide the term_document_matrix_abstract constructor with an iterable over documents
  • Available : a JSON file reader and one fetching docs from MongoDB.
  • We could read the JSON in chunks but since we are keeping all the data in memory anyway...

What is bad

  • Maybe more SOLID to have the term-doc-freq data structure as member of the main data structure
  • A full-fledged class for documents instead of a dict could help
  • Typing is poor (Python..)
  • Python3 only.

Todo

  • Add bigrams transformation (continue work from train_word2vec..)
  • See how to improve perf.

Performance

  • Indexing : Time should grow in O(tokens) ~= O(documents)
  • Index size : the dict-of-dicts approach is heavy...
  • Search : O(n * ln(n)) where n is the number of documents where the query terms appear.

search-tfidf-word2vec-poc's People

Contributors

arthur-flam avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.