Code Monkey home page Code Monkey logo

kaggle-liberty's Introduction

Summary

  • This contest asks participants to predict a continuous 'Hazard' score for home inspections using 32 anonymized variables. Some variables are continuous, while others are categorical.
  • This repository provides not a specific solution, but a framework for implementing your own. Specifically, it stresses fast iteration and model composability. Machine learning models rarely stand alone: a diverse set of data, models, and model hyperparameters is almost always advised in a competition setting.
  • The basic paradigm for creating a solution is: Dataset + Model = Prediction.
    • A Dataset accepts a data_func, which returns X_train, y_train, and X_test sets - a pd.DataFrame, pd.Series, and pd.DataFrame, respectively.
    • A Model is assumed to be a scikit-learn regression model, exposing fit and predict methods.
    • A Prediction optionally accepts a TargetTransform object which, for example, may take the square root of our continuous target before the model is fit, then square the model's final predictions - and a FeatureSelector object which, for example, may remove several low-variance features in each cross-validation fold before fitting, then remove the union-set of these removed features before the final model is fit.

To compose your own solution

These objects made be composed as follows:

prediction = Prediction(
    dataset=Dataset(func=data_v9)
    model=xgb.XGBRegressor(**params),
    base_path=PREDICTION_PATH,
    target_transform=TrimOutliersTargetTransform,
    save=True
)
prediction.cross_validate()

Calling cross_validate on the Prediction object will compute out-of-folds predictions, leaderboard predictions, gini scores in each fold, and much more. This data is then persisted, such that if an identical Prediction is ever created, the computation will be skipped. "Never delete a model," they say.

Finally, predictions can be composed together with an Ensemble object, which combines the leaderboard predictions (and, seperately, out-of-folds predictions, for cross-validation) by means of a user-defined funtion, such as np.mean.

Some things I tried

  • Datasets:
    • Some continuous features varied sinusoidally with respect to the target, so I tried binning them. This is more important in linear models than it is in tree-based models.
    • Replacing categorical feature values with count of occurences.
    • "Leave-one-out encoding" of categorical features.
    • 2-way feature interactions.
  • Models:
    • XGBoostRegressor, ExtraTreesRegressor, Ridge Regression, Bayesian Ridge Regression, Linear Regression, GradientBoostingRegressor. As per often, XGBoostRegressor performed best.
  • Target transforms:
    • Square root transform, target transform, trim outliers transform.
  • Final ensemble:
    • In this contest, I trained thousands of models. I then selected only the most performant models - based on their out-of-folds gini score - giving me roughly 20.
    • I then used the out-of-folds predictions of each of these models as feature of a new model, called a "stacker." I then trained several hundred stackers.
    • To select my final stackers, I computed a correlation matrix between the out-of-folds predictions of each, and selected the top ~15 that were, on average, least correlated with the rest. Finally, I took the mean of the predictions of each of these stackers to create my final ensemble.

kaggle-liberty's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.