The main idea of this project is to explore capabilities of Apache Spark. For this purpose a dataset from UCI repository is utilized. The dataset consists of estate valuation data from one of Taiwan's districts. It has been used in the paper: Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
The provided notebook has been runned on Databricks. Data exploration is performed with use of both RDDs and DataFrames. After the initial data analysis two regression models have been utilizied. First simple linear regression is done. Next it is followed by Random Forest Regression model. In both cases hyperparameters tunning with grid search and 5 fold cross validation has been done.
Different Apache Spark modules have been utilized for purpose of this notebook. Regarding ML modeling, better results have been achieved for Random Forest approach due to data nonlinearity.