soundcloud / spark-pagerank Goto Github PK

View Code? Open in Web Editor NEW

75.0 131.0 14.0 113 KB

PageRank in Spark

Home Page: https://soundcloud.github.io/spark-pagerank

License: MIT License

Makefile 0.62% Scala 97.88% Shell 1.50%

spark-pagerank's Introduction

PageRank in Spark

This is an implementation of PageRank in Spark, using Spark's standard RDD API.

Features

Fast iterations
Parameterised "teleport" probability
Weighted edges (out-edge weights must be normalized)
Supports "dangling" vertices (no out-edges from a node)
Supports arbitrary (e.g. non-uniform) priors (as vertex values)
Various stopping criteria:
- Number of iterations threshold
- Convergence threshold (requires additional computation after each iteration)
Utilities for building, preparing, and validating input graphs (incl. out-edge normalization)

Usage

Include it as a dependency in your sbt project: "com.soundcloud" %% "spark-pagerank" % <version>

As A Library

You can use PageRank in Spark as a library and call it from within your own drivers. You will want to do this when you have some data preparation to do that does not conform with the built-in driver data interfaces.

More examples of usage as a library can be found in the source of the built-in drivers (see below).

As Drivers

We include several built-in drivers that operate on plain-text TSV input of labeled edges as a starting point. You will prepare the graph and run PageRank in the following sequence of drivers. Use --help to see the arguments and usage of each driver.

com.soundcloud.spark.pagerank.GraphBuilderApp: Builds a PageRank graph from (non-normalized) weighted edges in TSV format (source, destination, weight), saving the resulting graph (edges and vertices) in Parquet files in preparation for next steps.
com.soundcloud.spark.pagerank.PageRankApp: Runs PageRank on the graph produced using the functions in PageRankGraph or by using the GraphBuilderApp.
com.soundcloud.spark.pagerank.ConvergenceCheckApp: Compares two PageRank vectors and lets the user determine if there is convergence by outputting the sum of the component-wise difference of the vectors. Note that this is an optional tool that is mostly used for debugging. If the user is concerned with iterating until convergence, the user can specify the convergence threshold at runtime to PageRank.

Performance

We run this library on one of our behavior graphs which consists of approximately 700M vertices and 15B edges. Using the following Spark configuration, and in-memory persistence of edge and vertex RDDs, we obtain iteration times on the order of 3-5 minutes each.

Configuration example:

Spark 2.1.1
YARN
Dynamic allocation: no
Number of executors: 256
Number of executor cores: 4
Executor memory: 28G

Performance Tuning

Persist the edges and vertices of the graph in memory and disk (as spill): StorageLevel.MEMORY_AND_DISK
Enable Kryo serialization: KryoSerialization.useKryo

Publishing and Releasing

To publish an artifact to the Sonatype/Maven central repository (a snapshot or release), you need to have a Sonatype account, PGP keys and sbt plugins set up. Please follow the sbt guideline for a complete guide to getting started. Once this is done, you can use the sbt-release plugin via the Makefile to publish snapshots and perform releases.

Publishing Snapshots

At any point in the development lifecycle, a snapshot can be published to the central repository.

make publish

Performing a Release

Once development of a version is complete, the artifact should be released to the central repository. This is a two stage process with the artifact first entering a staging repository, followed by a manual promotion process.

make release

After a release to the staging repository, the staging-to-release promotion process must be followed manually before the artifact is available in the central repository.

Versioning

This library aims to adhere to Semantic Versioning 2.0.0. Violations of this scheme should be reported as bugs. Specifically, if a minor or patch version is released that breaks backward compatibility, that version should be immediately removed and/or a new version should be immediately released that restores compatibility. Breaking changes to the public API will only be introduced with new major versions.

Contributing

We welcome contributions by pull requests and bug reports or feature requests as issues.

Authors

Josh Devins

Contributors

Max Jakob

License

See the LICENSE file for details.

spark-pagerank's People

Stargazers

Watchers

Forkers

omarieclaire ydh0120 phoenixhadoop moseshu limingmingli321 lightspeedvp idahopotato1 cyborgshead abinj yipen isabella232 tufanrakshit

spark-pagerank's Issues

Cleanup and remove GraphX APIs

We used to rely on GraphX APIs but since we had so many efficiency problems and bugs, we stopped using GraphX. It's still being used in the GraphUtils though. If we choose to, remove all GraphX APIs and replace with internal case classes and whatnot. We should then have two sets of APIs, one internal to PageRank and one for general graph operations. I like this more since internal to PageRank, we need always tuples (for the joins), while outside of PageRank that makes the RDD's a bit messy with tuples all over the place. Could eventually consider Dataset/DataFrame APIs as well, but for now we stick to functional paradigms with RDDs.

Super nodes cause poor join performance due to partition skew

When there are super nodes in the graph with hundreds of thousands or millions of out-edges, the PageRank join operation performs very poorly (stragglers) with skewed partitions.

Validate runtimes with production graph

Graph building: 0.2.0, 34 mins, 512 cores, 650M vertices, 14B edges
PageRank run: 0.2.0, 24 mins, 1024 cores, 650M vertices, 14B edges, 5 iterations
PageRank run: 0.2.0, 46 mins, 1024 cores, 650M vertices, 14B edges, 10 iterations

Cleanup GraphUtils

Only general edges/vertices stuff goes in here. Anything operating on the PageRankGraph as a whole should go in PageRankGraph instead.

Add MIT license

Add checks for normalized PR vector after iterations

This should probably be optional, somehow configured. Maybe you just want to do this on occassion (weekly?) or just during debugging, or maybe every n iterations?

Use robust checkpointing

Local checkpointing is not robust to executor failures, so we might lose information in the case executors go down and we don't have a redundant cache of lost block(s).

Cleanup sbt build to use only public repositories

For both pull and publishing.

Depends on: #15

Improve documentation for public consumption

Improve documentation such that it is consumable as an open-source project. This should include any basic rationale about tuning, why or when to use this, examples, dataset sizes we have, etc.

Remove spark-lib internal dependency

the Vertices value?

Hello!
I have question about the result of the Vertice. the code is in the PageRank.scala file,
for example :

val actual = PageRank.run(
graph, teleportProb,
maxIterations,
convergenceThreshold
).collect()

Does the larger value indicates a stronger importance ?
if I use this value in black list,it indicates that
the person is more dangerous?Is it right?

Add checks after normalizing edges to validate it worked on large graphs

They should sum to 1, maybe just add an external check and not necessarily inside the normalize edges function.

Don't use object files as app input and outputs

The options are: TSV, CSV, Parquet (or stay with object files). We need some interoperability with other formats in HDFS, like graph building reads from TSV in our case. What should be the standard interface to the apps? You can always build your own apps, the apps are just there for reference, handy usage/testing purposes.

/cc @maxjakob

[spike] Use Dataset API

This should provide some performance improvements over RDD's. Try it out and let's see.

Persist and reload PageRankGraph to/from HDFS

This could just store object files, but should make it a simple one method call to do each. Then replace this stuff in the drivers.

Add GitHub issues for all TODOs left in codebase

See code and search for stuff.

Library upgrades: Spark, Scala, etc.

Spark 2.2.1, patch versions of Scala, etc.

Performance evaluation at scale

Re-run everything with the full datasets and validate that things are still performing after all of the refactoring.

Fix sbt bootstrap to download from public server

Right now it's using an internal server.

Publish snapshot to public repos

sbt plugin to Sonatype: http://www.scala-sbt.org/0.13/docs/Using-Sonatype.html

App to continue iterating on a graph from intermediate results on disk

Load the graph, but then replace the vertex values with those from a previous iteration on HDFS. This would require that the graph has the same vertices in each.

Add builds to public CI

Use Travis or something like that.

Release artefact to central

This depends on all other tasks. Once complete and promoted, comment on the open Sonatype issue.

Upgrade to Spark 2.1.1

Rewrite apps to provide better example

This should be couple of apps to both provide real use-case examples as well as for ongoing testing on real-sized graphs:

remove hard coded paths
show building a graph
show running PageRank

Cleanup temporary scripts/drivers used for testing only

Add Travis CI builds

This means removing all internal repos as well, or at least setting them up properly, only to be publish repos.

See: https://docs.travis-ci.com/user/languages/scala/

Depends on: #5

Working example/generic usecase

This seems to be great. Can you provide a few lines of code of a working example for this so it becomes easier to use for first-timers.

Thanks for the hep !

Allow drop-in stopping criteria function

Right now we just support iteration number, and convergence as stopping criteria. This could be any function (in theory) including something like APA if the user is only concerned with stopping after the ranks are unchanged between iterations. Consider porting @maxjakob 's APA in here as a plugin option.

Ensure we are using Kryo with custom types

I think without specifying it, we are then just using POJO serialization. Not sure in Spark 2.1.0 how it works so time to check it out. It should also be easy for 3rd party users of our case classes to get this optimization, maybe just documented?

Build a graph from edges and an old priors vector

The cases are:

new vertices in the new EdgesRDD that are not in the old VertexRDD
vertices missing in the new EdgesRDD that are in the old VertexRDD

These need to be supported if you want to start from disk from a previous VertexRDD with old priors.

Implement missing graph modification functions

For every requirement of the graph structure, there should be a check/validation function as well as a function to correct any structure that needs it. Examples:

  /**
   * Removes any edges that are self-referencing the same vertex. That is, any
   * edges where the source and destination are the same. Any resulting vertices
   * that have no edges (in or out) will remain in the graph.
   */
  def removeSelfReferences(graph: Graph): Graph

  /**
   * Removes any vertices that have no in or out edges.
   */
  def removeDisconnectedVertices(graph: Graph): Graph

Provide GraphX interface

For open source adoption, it could be nice to have a drop-in replacement for the GraphX PageRank.

Switch to standard `spark-testing-base`

Use this since it's much more standard/openly available than our in-house version, and is better than the copy-paste we did to get started.

https://github.com/holdenk/spark-testing-base

Add a subpackage for graph stuff

Consider the package com.soundcloud.spark.graph so things don't collide with other internal Spark libraries.

Add a graph validation function

This should be stand-alone but can be run on occasion to validate the assumptions and structure of the graph to run PageRank on. This should be a composed function of all the individual tests, and should be a guard function so it can be used in a pipeline.

  /**
   * Validates the structure of the input PageRank graph. See: {{#run}}.
   */
  def validateGraphStructure(edges: Edges, vertices: Vertices) {
    val numSelfReferences = countSelfReferences(graph.edges)
    val verticesAreNormalized = areVerticesNormalized(graph.vertices)
    val numVerticesWithoutNormalizedOutEdges = countVerticesWithoutNormalizedOutEdges(graph.edges)

    require(numSelfReferences == 0, "Number of vertices with self-referencing edges must be 0")
    require(verticesAreNormalized, "Input vertices values must be normalized")
    require(numVerticesWithoutNormalizedOutEdges == 0, "Number of vertices without normalized out edges must be 0")
  }

Prepare for open-source

Read our internal guidelines to make sure we have everything covered

Publish Scaladocs to GitHub Pages

Along with the main README, we should ideally publish the Scaladocs in there as well. It's not immediately obvious how to use the README as the main index page and add docs so this is postponed to post-1.0.0 release.

See: https://alvinalexander.com/scala/how-to-generate-scala-documentation-scaladoc-command-examples
See: https://github.com/sbt/sbt-ghpages

Related: #9

Add local RDD checkpointing

Add local RDD checkpointing to truncate DAG/parents, after every iteration, this requires that dynamic allocation is off so also validate this at the start of the run by looking in the conf if possible (and document all this in the run function).

Release with cross build for Scala 2.10 and 2.11

See: https://github.com/sbt/sbt-release#cross-building-during-a-release

Publicise project

Update documentation about double-release bug

When performing a release, since we cross-compile, the first "push changes to repo" should be the only ones. The second time you get asked to tag and push to remote, you should choose to NOT do it since they versions are already updated. Furthermore, you should always make sure to specify the same release version, you will be asked twice (again, once per cross-compile version).

soundcloud / spark-pagerank Goto Github PK

spark-pagerank's Introduction

PageRank in Spark

Features

Usage

As A Library

As Drivers

Performance

Performance Tuning

Publishing and Releasing

Publishing Snapshots

Performing a Release

Versioning

Contributing

Authors

Contributors

License

spark-pagerank's People

Stargazers

Watchers

Forkers

spark-pagerank's Issues

Recommend Projects

Recommend Topics

Recommend Org