Code Monkey home page Code Monkey logo

akb's Introduction

image image

In this project, we tackle the Kaggle Rossman challenge. The goal is to predict the Sales of a given store on a given day. Submissions are evaluated on the root mean square percentage error (RMSPE).

image

The dataset consists of two csv files: store.csv and train.csv. Data Files:

  • train.csv holds info about each store.
  • store.csv holds the sales info per day for each store.
  • holdout. csv holds "unseen" data that the model is going to be evaluated on

Script Files: The repo contains main.py that runs the main script from step one until the end. The script can be run after cloning since all data used is in the repo. By default, the hyperparameter section is uncommented due to long completion time. The script can be ran individually and the last print out will be the RMSPE for the predictions of the holdout set.

The function.py file contains utility functions that are called in main.py.

The rossman_model.sav contains the pickled hypertuned model with the least RMSPE.

The single steps of training the model are as follows:

  1. Exploring data: EDA and visualization

  2. Cleaning data:

    • drop data with no store
    • drop data with no DayOfWeek
    • drop data when store in NOT open
    • drop data where promo is NaN
    • drop SchoolHoliday data where promo is NaN
    • drop parameters that don't seem useful:
      ('CompetitionOpenSinceMonth','CompetitionOpenSinceYear',/ 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval')
    • drop all rows with NaNs - Approximately 3% of rows
    • convert all the columns to int when necessary
  3. Encoding:

    • add Month as dummies
    • add a feature for scaled CompetitionDistance
    • convert DayOfWeek to dummies
    • convert StateHoliday to dummies
    • convert StoreType to dummies
    • convert Assortment to dummies
  4. Looking at the correlations of the Sales with different parameters. Sales have significant correlation with:

    • Customers
    • DayOfWeek
    • StateHoliday
    • StoreType
    • Scaled CompetitionDistance etc.
  5. Test / Train Split

  6. Baseline Model and RandomForestRegression

  7. Feature Selection and Engineering

  8. Training data via Pipelines (RandomForestRegression, KNR and XGBoost Regressors)

  9. Hyperparameter Tuning of the best model from step 8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.