Code Monkey home page Code Monkey logo

lmw-tree's Introduction

LMW-tree: learning m-way tree

See the project homepage for the latest news!

LMW-tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms and data structures are generic to support different data representations such as dense real valued and bit vectors, and sparse vectors. Additionally, it can index any object type that can form a prototype representation of a set of objects.

The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications.

The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.

LMW-tree is licensed under the BSD license.

See the ClueWeb09 clusters and the ClueWeb12 clusters for examples of clusters produced by the EM-tree algorithm. The ClueWeb09 dataset contains 500 million web pages and was clustered into 700,000 clusters. The ClueWeb12 datasets contains 733 million web pages and was clustered into 600,000 clusters. The document to cluster mappings and other related files area available at SourceForge.

The following people have contributed to the project (sorted lexicographically by last name)

  • Lance De Vine
  • Chris de Vries

Directory Structure

The LMW-tree project uses several external libraries. It also has several modules contained in namespaces.

Currently we use:

Directory structure:

/

/src - all source contributed by this project where each subdirectory
       is a namespace
/src/lmw - LMW-tree data structures and algorithms
/src/indexer - english language document indexing

/external - all source for external 3rd party libraries
/external/Makefile - GNU Makefile to build external libraries
/external/packages - source packages for external libraries
/external/build - build directory for external libraries
/external/install - installation directory for external libraries

Building and Running

Make dependencies using a GNU Makefile (only tested on Linux)

$ cd external
$ make
$ cd ..

We use CMake for making the main project

$ mkdir build
$ cd build
$ cmake ..
$ make
$ cd ..

Fetch some data to cluster

$ mkdir data
$ cd data
$ wget http://downloads.sourceforge.net/project/ktree/docclust_ir/inex_xml_mining_subset_2010.txt
$ wget http://downloads.sourceforge.net/project/ktree/docclust_ir/wikisignatures.tar.gz
$ tar xzf wikisignatures.tar.gz
$ cd ..

Run the program

$ LD_LIBRARY_PATH=./external/install/lib ./build/emtree

lmw-tree's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lmw-tree's Issues

Use folly and jemalloc

Switch to follow and jemalloc to replace std::string and std::vector. Might be faster.

Introduce string and vector types in the lmwtree namespace.

Use likely and unlikely where useful.

Finish implementing indexer

It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.

This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.

It would be great to do the same with bi-grams and invent or reuse a weighting scheme that uses pointwise mutual information (http://nlpwp.org/book/chap-ngrams.xhtml#chap-ngrams-bigrams) in the weighting calculation.

Compile with GCC 4.9

GCC 4.9 brings useful new features like template lambdas and declaring templates using auto. Refactor the code where it makes the code cleaner.

Make CMake build cross platform

Currently the CMake build is specific for GCC.

To make it cross platform it needs to use the find_package() function instead of manually setting GCC compile and link flags.

Implement mini-batch stochastic gradient descent parallel streaming EM-tree

With a large enough mini-batch size, parallelism can be exploited within the mini-batch of say 1 million signatures.

Hopefully this will converge in 1 iteration or less for excessively large datasets like ClueWeb.

This also works in a distributed setting by simply batching and broadcasting updates to the tree as the parallel mini-batches proceed.

Fast randomization of the input signature file is also useful.

Unroll loops for Hamming distance

Unrolling the loops for Hamming distance may give further performance improvements by keeping all integer execution units busy inside newer processors. This basically unrolls the uint64_t chunk1, chunk2; result = chunk1 ^ chunk2; POPCNT result; operations inside the Hamming distance calculation.

Reduce memory overhead of vector IDs

There is some memory overhead on vectors.

std::string is around 32 bytes and we use this for a vector ID, but vectors are usually 512 bytes at most. So make the ID a template parameter and switch to char* for strings.

Create unit tests

Some lower level unit tests for vector types and other concepts would also be useful.

Fix the K-tree implementation

The K-tree implementation was broken somewhere along the way. It needs to be fixed. Maybe it never worked completely in the first place.

Remove manual memory management

Using smart pointers for cases where the pointer overhead is not critical such as tree nodes.

Switch to move semantics for vectors.

Improve memory allocation and locality

Use memory pools, custom allocators, and, allocation of nearby vectors in contiguous memory to reduce allocation overhead and improve locality of reference.

Setup regression testing

Create a set of standard test datasets and their expected quality in terms of internal and external measures. This can be used to test for any regressions modifications may introduce.

It also serves as documentation.

Get some non-document data sets for UCI.

Weighting functions for document vectors

TF-IDF
BM25 - probably quite useful with reflexive random indexing because it preserves the inner product space where BM25 works well
Log Likelihood from TopSig paper

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.