Code Monkey home page Code Monkey logo

latent-semantic-indexing's Introduction

Latent Semantic Analysis

Introduction

The program lsi.py implements the a simple latent semantic analysis engine using svd. Simple changes can be made to the program to try out Non-negative matrix factorization or Vector quantization.

It implements the following:

  • Given a document title, it outputs k similar documents
  • Given any word, it outputs k related words from all the documents. If this word occurs in none of the documents it outputs k random words
  • Given a query, it outputs k relevant documents for the query.

The code has been optimised to work well in large cases as well. The addition methods of removing stopwords, tfidf and normalising are implemented within the same file but are kept commented. A user can simply uncomment the required things and get it working with only slight modifications.

Setting up the environment?

Needs scipy,numpy and a few other basic python libraries. To save yourself from the struggle of setting up the environment, use the requirements.txt file to setup the virtual environment for python

  • virtualenv venv
  • source venv/bin/activate
  • pip install requirements.txt

To deactivate the virtualenv use: deactivate

Running the latent semantic search engine?

lsi.py can be run as follows:

python lsi.py -z 200 -k 10 --dir Directory --doc_in <name of input document file> --doc_out <name of output document file to be generated by code> --term_in <name of input term file> --term_out <name of output term file to be generated by code> --query_in <name of input query file> --query_out <name of output query file to be generated by code>

where
-z: Dimensionality of lower dimensional space
-k: # of similar terms/documents to be returned
--dir: Directory containing input documents
--doc_in: Input file containing list of document titles (one per line) corresponding to whom k similar documents are to be returned.
--doc_out: Each line of this file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the document in corresponding line of doc_in
--term_in: Input file containing list of words (one per line) corresponding to whom k similar words/terms are to be returned.
--term_out: Each line of this output file will have k words (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the word in corresponding line of term_in
--query_in: Input file containing list of queries (one per line) corresponding to whom k relevant documents are to be returned.
--query_out: Each line of this output file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are relevant to the query in corresponding line of query_in
Note: The documents in the directory must be numbered 1,2,3,4....n

Contributing

  • Make a fork
  • branchout naming the new branch as an abbreviation of the feature
  • implement the new feature
  • send pull request

License

MIT

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.