Survival Prediction on the titanic dataset using Pyspark.
Dataset used from https://www.kaggle.com/c/titanic/data
This script involves training and predicting survival, using a logistic regression model using this dataset from the pyspark.ml package. The goal is to predict for each passenger whether he/she survive the Titanic tragedy as well as to use the pipeline and feature functionality of pyspark.ml.
The spark_csv package from databricks is used to read the data from hdfs.
Spark version 1.6 is utilized, using the cloudera quickstart virtual machine.