Code Monkey home page Code Monkey logo

titanic's Introduction

Predict survival on the Titanic

About

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, I look at what sorts of people were likely to survive. In particular, I use the tools of machine learning to predict which passengers survived the tragedy.

This project stems form a Kaggle competition. Details are available here.

Models

My first thought was to use age, gender, and passenger class as factors in decided the outcome of survival. Since many of the life boats were loaded with "women and children first," it seems reasonable that age and gender would be important factors. The passenger class seemed relevant as first class passengers were closer to the lifeboats than third class. Also, the movie Titanic seemed to indicated third class passengers were initially prevent from boarding while first class passengers boarded (though who knows if this is accurate).

Building a Logistic Model from scratch

I decided first to try a logistic regression. Having just completed Andrew Ng Coursera Course on Machine Learning, I manually programed the algorithm. To see this, run ./ManualModel.R grad_descent, which uses gradient descent to optimize the cost function. I also added an option to use R's built in function optimizer to optimize the cost function. Running ./ManualModel.R optim will use the built in optim function.

These two optimization methods produce different results. The built in optim function performs slightly better (accurately predicting 80.93% of the test data, where the gradient descent method got 77.67%). I am not sure if there is an error in my gradient descent, or if optim simply finds a better solution.

Regularization and Learning Curves

Next I played around with regularization. It was simple to add a regularization term to the cost function. However increase lambada (the amount of regularization) seemed to make the performance worse. This is not surprising since we are using a linear decision boundary, and hence it is likely that this model is under-fitting the data. To verify this I wrote LearningCurve.R, which confirmed that this model is under-fitting that data (see Graphs/LearningCurves.pdf). Hence regularization won't help (as it's only useful for a model over fitting data).

Using Training Packages (such as caret)

After this I decided to try some more built in functions via caret (since I don't want to program all my machine learning algorithms from scratch). Running this against the test data set gave the same results using the manual method with the optim method. To run this, simply run ./RModels.R logistic omit. The omit simply means any entries with missing data are omitted.

Missing Data - Imputing Age

Up to this point, I simply had ignored missing data. Any entry that didn't have either the age, gender, or passenger was ignored. Perhaps my model would function better if I imputed the age.

We do have more information - perhaps some will help predict age for the missing data. We have names for all the passengers, and these names include a title (such as "Mr.", "Miss", and so forth). For the passengers with known ages, I grouped them by title and computed the average age for each title group. Then, for each passenger with a missing age, I replaced their age with the appropriate title group's mean age.

Running ./RModels.R logistic impute_age will impute the age data. However, in the end this did not make much a difference, having an accuracy of 79.1% on the test set.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.