Code Monkey home page Code Monkey logo

data-3402-project's Introduction

Spaceship Titanic

This repository holds an attempt to the Spaceship Titanic Kaggle Challenge

image

Overview

  • The aim of the spaceship challenge is to help rescue crews and retrieve the lost passengers, I have to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.
  • We performed feature engineering, data visualization and used 3 classifiers that were logisitic regression, decision tree classifier and random forest classifier
  • Our best model was able to predict with 80% accuracy of how many students went missing

Summary of Workdone

Data

  • Data:
    • The data was of the csv type.
    • We had test data and train data of the list of all passengers.
    • The size of the data is 1.24mb.

train.csv:

  • PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
  • HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
  • CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
  • Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. Destination - The planet the passenger will be debarking to.
  • Age - The age of the passenger.
  • VIP - Whether the passenger has paid for special VIP service during the voyage.
  • RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities. Name - The first and last names of the passenger.
  • Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

test.csv:

Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

Preprocessing / Clean up

Firstly, we made a couple of histrograms to better understand the data. After that we got rid of the null values and and converted categorical values and bool values into int values. We performed feature engineering using SimpleImputer from sklearn.impute.

image

Figure 1: This shows the number of null values for each variable

image

Figure 2: Converting to integer columns.

Data Visualization

image

Figure 3: Passngers from their homeplanet and whether or not they were transported.

image

Figure 4: Representation of the age of the passengers.

image

Figure 5: Outliers in the category of roomservice are visible here.

Problem Formulation

  • Define:
    • Input / Output : The model is trained using the train data and then provided with the testing data where it predicts which passngers got transported.
    • Models: I used logistic regression, decision tree and random forest classifier. I got the best results from the random forest classifier.

    Logistic regression : Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

    DecisionTreeClassifer : A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

    RandomForestClassifier : Random forest classifier can be used to solve for regression or classification problems. The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample.

Training

  • Describe the training:
    • How you trained: I used sklearn modules and uploaded the classifiers. I used pandas, matplotlib, seaborn to organiza and arrange the data.
    • It took about 15-20 minutes to train the data
    • No difficulties

Logistic Regression

image

Decision Tree Classifier

image

Random Forest Classifier

image

Performance Comparison

As mentioned earlier, random forest classifier performed the best of them all.

image

Conclusions

  • Random forest classifer worked better than logistic regression and decision tree classifier

Future Work

  • In the future, I can learn to building neural network models or utilize deep learning techniques. I can also utilize different techniques to understand which features are more important and how to clean up the data better.

How to reproduce results

  • I used google colab as it was a group project with another classmate. We used google colab to collaborate and used python and libraries that are sklearn, pandas, numpy, matplotlib and seaborn.
  • Furthermore, following the following commands below to get the required libraries and modules.

image

  • Follow the steps from the Data Visualisation, Feature Engineering, Cleaning data, and Model Training section of FinalExamPrject.ipynb to get the required results.

Overview of files in repository

  • FinalExamProject.ipynb: file with all the code for training models and testing accuracy.
  • FinalEXamProject.py : file with all the code for training models and testing accuracy.
  • test.csv: CSV file of test data
  • train.csv: CSV file for training data
  • submission.csv: CSV file to submit the code

Software Setup

Python packages: numpy, pandas, math, sklearn, seaborn, matplotlib.pyplot, xgboost, lightgbm, joblib, keras Download seaborn in jupyter - pip install seaborn

Data

The data can be trained here https://www.kaggle.com/competitions/spaceship-titanic/data

Citations

* Provide any references.

data-3402-project's People

Contributors

aarti-darji avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.