spark-cassandra-collabfiltering
Illustrates:
- Collaborative filtering with MLLib on Spark
- The same Spark client code written in Java 7 and Java 8, showing the new Java 8 features that make Spark's functional style much easier
- Cassandra providing the data to Spark
- A synthesized training/validation set of employee ratings for companies
To setup (tested on Ubuntu 14.04):
- Install JDK Java8.
sudo apt-get install oracle-java8-installer
- Get Spark.
- Download 1.1.0 for Hadoop 2.4. We will not be using Hadoop even though this build supports it.
- Untar the spark tarball. (E.g., in
~/dev
) - Test the installation with
./bin/run-example SparkPi
- See QuickStart in below for more instructions and tutorials on setup.
Get Eclipse:
- Download Eclipse Luna 4.4.1 Ubuntu 64 Bit (or 32 Bit) from Eclipse.org. Only the latest Eclipse supports Java 8.
- Untar, run Eclipse.
- Set your Java 8 JDK as the default JDK.
- Install Maven2 Eclipse,
- Menu Help -> Install New Software…
- Add this repository
- Check Maven Integration for Eclipse, then install.
Project
- Right-click on
pom.xml
, choose Maven-> install. - This will now download Spark jars; it will take a while.
- It will also set your Eclipse project's source level to Java 8.
Dataset
ratings.csv
is generated fromratings.ods
, which is a spreadsheet for synthesizing data sets to test and fine tune your model.- Adjust
ratings.ods
and save as CSV. Seereadme.txt
in data directory for instructions.
Cassandra
- Instructions for getting Cassandra: here
- Run Cassandra:
sudo /usr/bin/cassandra
- We will be runnning Cassandra and Spark locally with console, rather than remotely in a cluster as daemon/service.
- Create schema by running attached SQL as follows:
- In workspace root, run
cqlsh -f ./collabfilter/src/sql/collab_filter_schema.sql
- In workspace root, run
Running tests:
- Run
collabfilter.CollabFilterCassandraDriver.main
or theCollabFilterTest
unit test.
More references:
- QuickStart has more on setup.
- You can find a collaborative filtering tutorial for Spark and a tutorial on the Spark-Cassandra Java connector which I drew on.
- However, note that the example code in the Spark-Cassandra tutorial is outdated. The Java API class was moved to the japi subpackage.
- Bug in Guava version. The
pom.xml
specifies Guava 15. This is because the Guava 14 used with the Spark-Cassandra connector is mismatched to the Guava 15 or above expected by Spark, which includes additional methods.