This repository describes the participation in the Mini Competition run in Data Science Retreat - Batch 25 between 2-4 February 2021, from the Team 3 composed by:
- Alberto Julián
- Gert-Jan Dobbelaere
- Sergio Vechi
This DSR mini-competition is based on a Kaggle competition which run from Sep 30, 2015 to Dec 15, 2015:
https://www.kaggle.com/c/rossmann-store-sales/overview/
- Develop an end-to-end Data Science project
- Live the experience of working as a Data Science team, sharing python code and jupyter notebooks through Github
- Present results to an audience
The Rossmann competition aims to predict sales on more than a thousand stores based on historic sales and additional information provided.
The competition is scored based on a composite of predictive accuracy, following a metric detailed below, and reproducibility.
Two csv files were provided for training the models:
- train.csv
- store.csv
Both datasets are described in the EDA jupyter notebook.
Additionally, a test dataset was provided to check the accuracy of the models. The holdout test period is from 2014-08-01 to 2015-07-31. The holdout test dataset has the same format as train.csv
, and is called holdout.csv
.
Apart from the aforementioned datasets, the following files have been created:
- data_cleaning_rossman.py: performs the data cleaning of the datasets
- feature_eng.py: performs the feature engineering of the cleaned datasets
- utils.py: plots sales of a bunch of stores in several modes: grouped by month, day of the week, week of the year
- EDA_rossman.ipynb: Exploratory Data Analysis of the datasets
- pipeline.ipynb: shows a complete tour through the stages deployed in the python files: data cleaning, feature engineering and modelling
Open a terminal. Create a conda environment.
conda create --name ROSSMANN_SALES python=3.7
conda activate ROSSMANN_SALES
git clone https://github.com/albertojulian/rossman-sales-pred
cd rossman-sales-pred
pip install -r requirements.txt
jupyter notebook
The repository should show in the browser. Now you can use any of the three jupyter notebooks mentioned.
The task is to predict the Sales
of a given store on a given day.
Submissions are evaluated on the root mean square percentage error (RMSPE):
def metric(preds, actuals):
preds = preds.reshape(-1)
actuals = actuals.reshape(-1)
assert preds.shape == actuals.shape
return 100 * np.linalg.norm((actuals - preds) / actuals) / np.sqrt(preds.shape[0])