Code Monkey home page Code Monkey logo

spark-pagerank's Introduction

buildstatus

PageRank in Spark

This is an implementation of PageRank in Spark, using Spark's standard RDD API.

Features

  • Fast iterations
  • Parameterised "teleport" probability
  • Weighted edges (out-edge weights must be normalized)
  • Supports "dangling" vertices (no out-edges from a node)
  • Supports arbitrary (e.g. non-uniform) priors (as vertex values)
  • Various stopping criteria:
    • Number of iterations threshold
    • Convergence threshold (requires additional computation after each iteration)
  • Utilities for building, preparing, and validating input graphs (incl. out-edge normalization)

Usage

Include it as a dependency in your sbt project: "com.soundcloud" %% "spark-pagerank" % <version>

As A Library

You can use PageRank in Spark as a library and call it from within your own drivers. You will want to do this when you have some data preparation to do that does not conform with the built-in driver data interfaces.

More examples of usage as a library can be found in the source of the built-in drivers (see below).

As Drivers

We include several built-in drivers that operate on plain-text TSV input of labeled edges as a starting point. You will prepare the graph and run PageRank in the following sequence of drivers. Use --help to see the arguments and usage of each driver.

  1. com.soundcloud.spark.pagerank.GraphBuilderApp: Builds a PageRank graph from (non-normalized) weighted edges in TSV format (source, destination, weight), saving the resulting graph (edges and vertices) in Parquet files in preparation for next steps.
  2. com.soundcloud.spark.pagerank.PageRankApp: Runs PageRank on the graph produced using the functions in PageRankGraph or by using the GraphBuilderApp.
  3. com.soundcloud.spark.pagerank.ConvergenceCheckApp: Compares two PageRank vectors and lets the user determine if there is convergence by outputting the sum of the component-wise difference of the vectors. Note that this is an optional tool that is mostly used for debugging. If the user is concerned with iterating until convergence, the user can specify the convergence threshold at runtime to PageRank.

Performance

We run this library on one of our behavior graphs which consists of approximately 700M vertices and 15B edges. Using the following Spark configuration, and in-memory persistence of edge and vertex RDDs, we obtain iteration times on the order of 3-5 minutes each.

Configuration example:

  • Spark 2.1.1
  • YARN
  • Dynamic allocation: no
  • Number of executors: 256
  • Number of executor cores: 4
  • Executor memory: 28G

Performance Tuning

  • Persist the edges and vertices of the graph in memory and disk (as spill): StorageLevel.MEMORY_AND_DISK
  • Enable Kryo serialization: KryoSerialization.useKryo

Publishing and Releasing

To publish an artifact to the Sonatype/Maven central repository (a snapshot or release), you need to have a Sonatype account, PGP keys and sbt plugins set up. Please follow the sbt guideline for a complete guide to getting started. Once this is done, you can use the sbt-release plugin via the Makefile to publish snapshots and perform releases.

Publishing Snapshots

At any point in the development lifecycle, a snapshot can be published to the central repository.

make publish

Performing a Release

Once development of a version is complete, the artifact should be released to the central repository. This is a two stage process with the artifact first entering a staging repository, followed by a manual promotion process.

make release

After a release to the staging repository, the staging-to-release promotion process must be followed manually before the artifact is available in the central repository.

Versioning

This library aims to adhere to Semantic Versioning 2.0.0. Violations of this scheme should be reported as bugs. Specifically, if a minor or patch version is released that breaks backward compatibility, that version should be immediately removed and/or a new version should be immediately released that restores compatibility. Breaking changes to the public API will only be introduced with new major versions.

Contributing

We welcome contributions by pull requests and bug reports or feature requests as issues.

Authors

Contributors

License

Copyright (c) 2017 SoundCloud Ltd.

See the LICENSE file for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.