Code Monkey home page Code Monkey logo

capteam-iglo-titanic-ml's Introduction

Capteam Iglo - Towards a Unitary Titanic Death Determinant Theory

Code for Kaggle competition Titanic: Machine Learning from Disaster See report.pdf for more details.

Pipeline

Pipeline

How to run

  • Make sure you have Python 3.x.
  • Install the requirements from the requirements.txt file with pip3.
  • Run python3 main.py (trains everything)
  • Run python3 main_with_model_loading.py (uses pretrained models)

Settings

  • Select models to use: In main.py or main_with_model_loading.py, change contents of models list at bottom of file.
  • Select training mode (main_with_model_loading.py only): set self.training_mode to False to use pretrained models, set self.training_mode to True to train the models on the data, and save the model instance if the performance is better than of the currently saved model instance.

Feature interpretations

General

Below some points which were found by running some models (RF, SVM, GRBT, all with Gridsearch) with various features selected.

  • If a feature depicts categories, or the values are not a linear sequence or something -> Use one-hot encoding. E.g. the Title feature: Mrs category (ID 0) compared to the Master category (ID 1) is not better, worse, or logically seen before or after the Master category (with ID 1), so it is better to split the categories into different features with one-hot encoding.
  • If the feature depicts a linear increase/decrease in something, or it is a logical sequence, one-hot encoding (or binning?) seems unnecessary. E.g. FamilySize doesn't need one-hot encoding.
  • To capture interplay between features, some features can be binned (such as Age) to combine them with other features (such as Age*PClass), which may give a increased performance.
  • Simplifying features which indirectly capture the same feature to one feature can give better results by reducing the bias towards this certain feature, e.g.: FamilySize, FamilySize_cat, SibSp, Parch, IsAlone. These features can all be captured in the single feature FamilySize.
  • Outlier detection seems to help increase performance (atleast for KNN)

Model / feature specific

  • As KNN is of limited complexity it does not cope well with too much features.
  • KNN seems to work much best with binned features, E.g. replacing Age_cat, Fare_cat, etc for their non-binned version reduces test set performance by ~10%.
  • Bayes seems similar to KNN, doesn't like too many features
  • MLP, ExtraTrees seem to like as many features as you can throw at it, the more (good features) the better
  • One-hot encoding the Decks seems to lead to overfitting -> Train score higher, test score lower (tested on RF, SVM, GRBT). Probably because of lots of missing data
  • Same problem for Namelength as with Deck

capteam-iglo-titanic-ml's People

Contributors

dropiex avatar iglohut avatar luukru avatar thaije avatar zeppdev avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.