Code Monkey home page Code Monkey logo

sparkdistributedmatrix's Introduction

SparkDistributedMatrix

This is a research project which aims to provide high performance support for distributed matrix algebra on Apache Spark system. Several distributed matrix format, such as distributed vectors and block partitioned matrices are extensions to the MLlib of Apache Spark machine learning library. For example, several partition schemes are implemented, i.e., MatrixRangePartitioner, MatrixRangeGreedyPartitioner, and BlockCyclicPartitioner.

These partitioners are tailored to provide efficient partitioning over matrix data format. MatrixRangePartitioner simply partitions a matrix based on its rows/cols according to the underlying storage. MatrixRangeGreedyPartitioner makes efforts to achieve the goal that each partition contains approximately the same number of nonzero elements in a greedy way. That means this partitioner does not guarantee exact equal number of nonzero elements on each partition. For block partitioned matrices, we implemented cyclic partitioner, which adopts block-cyclic distribution strategy for load balancing.

Specially, we are aiming at enhancing the performance of matrix operation on block matrices. Block partitioned matrices have better data locality property than other types. Also, operations, (e.g, multiplication) can better utilize the sparsity of the input matrices. For more details, please refer to the source code.

Usage

A typical usage of the library is as follows, first to load the data into the block partitioned matrices (recommended for better performance).

// load data items into an RDD of Entry's
def generateRDD(sc: SparkContext, input: String): RDD[Entry] = {
    val lines = sc.textFile(input, 8)
    lines.map { line => 
        val strs = line.split("\t")
        // add customized logic to process each line
        ...
    }
}

def main(args: Array[String]) {
    // load data from generateRDD()...
    val sc = ... // create spark context
    val RDD = generateRDD(sc, input)
    val blkSize = BlockPartitionMatrix.estimateBlockSizeWithDim(rowNum, colNum)
    val matrix1 = BlockPartitionMatrix.createFromCoordinateEntries(RDD, blkSize, blkSize, dim1, dim2)
    val matrix2 = ...
    var matrix3 = matrix1 + matrix2
    matrix3 = matrix1 %*% matrix2
    // and many other operations are supported
}

sparkdistributedmatrix's People

Contributors

yuyongyang800 avatar

Watchers

James Cloos avatar 小辉 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.