cmdevries / lmw-tree Goto Github PK

Learning M-Way Tree - Web Scale Clustering - EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures

Home Page: http://lmwtree.devries.ninja

License: BSD 3-Clause "New" or "Revised" License

C++ 99.73% CMake 0.27%

lmw-tree's Introduction

LMW-tree: learning m-way tree

See the project homepage for the latest news!

LMW-tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms and data structures are generic to support different data representations such as dense real valued and bit vectors, and sparse vectors. Additionally, it can index any object type that can form a prototype representation of a set of objects.

The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications.

The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.

LMW-tree is licensed under the BSD license.

See the ClueWeb09 clusters and the ClueWeb12 clusters for examples of clusters produced by the EM-tree algorithm. The ClueWeb09 dataset contains 500 million web pages and was clustered into 700,000 clusters. The ClueWeb12 datasets contains 733 million web pages and was clustered into 600,000 clusters. The document to cluster mappings and other related files area available at SourceForge.

The following people have contributed to the project (sorted lexicographically by last name)

Lance De Vine
Chris de Vries

Directory Structure

The LMW-tree project uses several external libraries. It also has several modules contained in namespaces.

Currently we use:

Boost 1.57.0
Intel Theading Building Blocks 4.2 Update 5
strtk

Directory structure:

/

/src - all source contributed by this project where each subdirectory
       is a namespace
/src/lmw - LMW-tree data structures and algorithms
/src/indexer - english language document indexing

/external - all source for external 3rd party libraries
/external/Makefile - GNU Makefile to build external libraries
/external/packages - source packages for external libraries
/external/build - build directory for external libraries
/external/install - installation directory for external libraries

Building and Running

Make dependencies using a GNU Makefile (only tested on Linux)

$ cd external
$ make
$ cd ..

We use CMake for making the main project

$ mkdir build
$ cd build
$ cmake ..
$ make
$ cd ..

Fetch some data to cluster

$ mkdir data
$ cd data
$ wget http://downloads.sourceforge.net/project/ktree/docclust_ir/inex_xml_mining_subset_2010.txt
$ wget http://downloads.sourceforge.net/project/ktree/docclust_ir/wikisignatures.tar.gz
$ tar xzf wikisignatures.tar.gz
$ cd ..

Run the program

$ LD_LIBRARY_PATH=./external/install/lib ./build/emtree

lmw-tree's People

Stargazers

Watchers

lmw-tree's Issues

Implement configurable document processing pipeline

TBB flow graph would make this a very flexible set of producers and consumers.

Use folly and jemalloc

Switch to follow and jemalloc to replace std::string and std::vector. Might be faster.

Introduce string and vector types in the lmwtree namespace.

Use likely and unlikely where useful.

Write user manual

Check tests memory and threads with valgrind and helgrind as part of build

Static analysis with Coverity

Sparsify bitmap lookups

Sparsifying storage of bitmap lookups can make them more efficient.

See https://github.com/cmdevries/LMW-tree/blob/master/src/lmw/BitMapList8.h#L53 and https://github.com/cmdevries/LMW-tree/blob/master/src/lmw/BitMapList16.h#L55.

Also can make it a single class while we are at it.

Move all external libraries to external/ and create Makefile

Switch accumulator vectors to atomic<int> vectors and remove locking

Also compare performance before and after.

This can just be done by switching the ACCUMULATOR vector type to an atomic version. Note that whole vector operations do NOT need to be atomic, just updating a single dimension in the accumulator vector.

Support compressed storage of document vectors

After indexing compress document vectors so they can be written out in a compressed binary format.

Delta encode and variable byte sparse vectors.

Use https://github.com/lemire/FastPFOR.

Create a collection of examples of using the library

Finish implementing indexer

It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.

This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.

It would be great to do the same with bi-grams and invent or reuse a weighting scheme that uses pointwise mutual information (http://nlpwp.org/book/chap-ngrams.xhtml#chap-ngrams-bigrams) in the weighting calculation.

Compile with GCC 4.9

GCC 4.9 brings useful new features like template lambdas and declaring templates using auto. Refactor the code where it makes the code cleaner.

Load standard data formats

Such as SVM-light and Matlab formats.

Implement reflexive random indexing

https://code.google.com/p/semanticvectors/wiki/ReflectiveRandomIndexing

However, no binary vector version exists yet.

Investigate using Armadillo for everything but bit vectors.

http://arma.sourceforge.net/

Make streaming EM-tree working for non-bitvector types

There is a TODO in the code for this.

Make CMake build cross platform

Currently the CMake build is specific for GCC.

To make it cross platform it needs to use the find_package() function instead of manually setting GCC compile and link flags.

Implement mini-batch stochastic gradient descent parallel streaming EM-tree

With a large enough mini-batch size, parallelism can be exploited within the mini-batch of say 1 million signatures.

Hopefully this will converge in 1 iteration or less for excessively large datasets like ClueWeb.

This also works in a distributed setting by simply batching and broadcasting updates to the tree as the parallel mini-batches proceed.

Fast randomization of the input signature file is also useful.

Unroll loops for Hamming distance

Unrolling the loops for Hamming distance may give further performance improvements by keeping all integer execution units busy inside newer processors. This basically unrolls the uint64_t chunk1, chunk2; result = chunk1 ^ chunk2; POPCNT result; operations inside the Hamming distance calculation.

Evaluate full SIMD Hamming distance performance

Use compiler intrinsics to try different SIMD implementations of Hamming distance for both exlusive or and population count

Fix const correctness

Consider switching non-performance critical sections to interfaces

change non-performance critical concepts to interfaces instead of concepts; i.e. virtual functions instead of templates

Reduce memory overhead of vector IDs

There is some memory overhead on vectors.

std::string is around 32 bytes and we use this for a vector ID, but vectors are usually 512 bytes at most. So make the ID a template parameter and switch to char* for strings.

Create unit tests

Some lower level unit tests for vector types and other concepts would also be useful.

Implement distributed version of streaming parallel EM-tree

Compress transmission of integer accumulators between machines vectors using https://github.com/lemire/FastPFOR.

Hadoop + HDFS (just get hadoop to hand over the bytes, or use HDFS directly).

ZeroMQ + GlusterFS.

Apache Spark might work well with python bindings for library, https://github.com/apache/spark.

HDFS + Erlang scheduler (gascheduler) + C++ code as a simple TCP server.

Get some non-document data sets for UCI.

Weighting functions for document vectors

TF-IDF
BM25 - probably quite useful with reflexive random indexing because it preserves the inner product space where BM25 works well
Log Likelihood from TopSig paper

Add random projections and signature generation

Use Lance's idea for cyclic generation of random index vectors. Very cache friendly.

cmdevries / lmw-tree Goto Github PK

lmw-tree's Introduction

LMW-tree: learning m-way tree

Directory Structure

Building and Running

lmw-tree's People

Stargazers

Watchers

Forkers

lmw-tree's Issues

Recommend Projects

Recommend Topics

Recommend Org