The goal of this project was to build a machine learning model that predicts whether or not a flight will be delayed, 15 minutes before the scheduled departure time. We used multiple datasets in this project which included over 600m records of data. This required us to use a distributed framework to clean the data and perform machine learning. We used PySpark and MLlib to do so.
Our final classification model was an ensemble model consisting of XGBoost, random forest, and logistic regression. Our resulting f1-score was in the top 10% of our Berkeley class of roughly 30 different teams.
While this was a group project, all code in this repository is my own. The full repository that includes my teammates' code is in a private repo.
A report with our full results can be found in the final report and presentation links below.