This is the repo for the assignments and notes for the class, Machine Learning at Scale. The assignments focus on implementing massively parallel processing (MPP) using the MapReduce framework. There are five assignments and their respective areas of focus are as follows:
-
Assignment 1: Introduce the concept of parallel processing.
-
Assignment 2: Use Hadoop MapReduce to implement Naive Bayes algorithm.
-
Assignment 3: Implement stateless algorithms in Spark to build stripes, inverted index and compute similarity metrics.
-
Assignment 4: Implement gradient descent to train a linear regression model using PySpark RDD API.
-
Assignment 5: Implement Page Rank algorithm using Spark in a cluster.