Code Monkey home page Code Monkey logo

drive-predict's Introduction

Drive & Predict

Photo by Pang Yuhao on UnsplashPhoto by Pang Yuhao on Unsplash

Description

The study aims to predict distinct metrics and categories out of driving data. How good will it be to tell you if you are a good or bad driver based on your driving data ? Or may be, to tell the drivers to slow down during turns or to increase their braking distance.

Such model might have multiple applications like insurance rating/pricing, driver rating system, on board warning system, companies wanting to track their fleet drivers. Some companies (Uber, Lyft, Progressive) are already using their own private data to predict/build their models.

As the study involves to get lots of data around driving car, it has been difficult to find out available datasets. The first step of this project was from the very beginning to build the app that allows the data collection in order to build the model. As I ran out of time, I haven't been able to build it unfortunately. I hope I will do it in the near future.

During this study, we will split our work into small feasible tasks. We will look at behaviours like braking, starts, turning. To do so, I have built a data utils lib under models/data.py containing events (such as braking, acceleration, turning) extraction functions and also functions that calculate metrics around those events. The rest of the code can be found under the models module where there is more functionalities.

The main Jupyter notebook is available at notebook.ipynb link. It details every single steps taken during the project.

Installation

In order to run the Jupyter Notebook, you will need first to download the following dataset https://www.kaggle.com/vitorrf/cartripsdatamining/downloads/cartripsdatamining.zip/1 under the data project folder (dataset not included in the repo as it is too big).

Dataset Description

The original dataset is a zip file containing 38 CSV files corresponding to 38 car trips (~30 mins each)

Contents of CSV files:

  • Column 1: Time (in seconds)
  • Column 2: Vehicle’s speed (in m/s)
  • Column 3: Shift number (0 = intermediate position)
  • Column 4: Engine Load (% of max power)
  • Column 5: Total Acceleration (m/s^2)
  • Column 6: Engine RPM
  • Column 7: Pitch
  • Column 8: Lateral Acceleration (m/s^2)
  • Column 9: Passenger count (0 - 5)
  • Column 10: Car’s load (0 - 10)
  • Column 11: Air conditioning status (0 - 4)
  • Column 12: Window opening (0 - 10)
  • Column 13: Radio volume (0 - 10)
  • Column 14: Rain intensity (0 - 10)
  • Column 15: Visibility (0 - 10)
  • Column 16: Driver’s wellbeing (0 - 10)
  • Column 17: Driver’s rush (0 - 10)

Exploratory Data Analysis

The EDA has been conducted in 3 distinct steps :

  • Loading the datasets and checking the structure, feature types and null values.
  • Looking at feature distribution (looking for outliers) and any feature correlations.
  • Extracting the driving events (braking, acceleration, turning), calculating the metrics around those events and then plotting the metrics against our target value driver_rush

From the 3 events types the most notable one is the braking. Even if the numbers are not huge, the ratio of harsh braking over total braking events is five times higher when the driver is in a rush compared to not in a rush. The others events (accelerations and turnings) didn't really show a difference between rush vs not rush when looking at harsh accelerations ratio and harsh turning ratio.

Harsh braking Harsh acceleration
Harsh braking Harsh acceleration

The EDA highlights also the fact that the observations are not conclusives. It would have been helpful to get more data and especially more data coming from distinct drivers. Here we only have the driving measurements for one driver-one car which obviously is not ideal.

Modelling

This step consists in trying to find the best model which can predict our target value driver_rush. We took a pragmatic approach to the modelling phase and followed the steps bellow :

  • Select list of known classification models
  • Run a baseline model for each of our pre-selected models
  • Pick the top models base on precision, recall, f1-score scores
  • Tune the hyper-parameters of the top model using cross-validation and watch score increase

The list of pre-selected classification models is :

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • AdaBoost
  • K Nearest Neighbours
  • XGBoost
  • SVC (Support Vector Classification)

Out of our baseline modelling simulations and after reducing the classification task to a binary classification, the best-performing models are XGBoost Classifier, Decision Tree Classifier, Logistic Regression.

XGBOOST Logistic Regression Decision Tree
XGBoost Classifier Results Logistic Regression Classifier Results Decision Tree Classifier Results

After dealing our imbalance targets dataset, it comes up that our top performer was the XGBoost Classifier. Surprisingly enough, the model which benefits the most from the balanced training dataset was Random Forest. As such, I kept it for the next step of the modelling phase.

For each selected classifiers (Random Forest and XGBoost), we ran a grid-search along with cross validation to determine the best hyper-parameters for the given classifier.

At the end, the best performing classifier was the Random Forest with overall mean validation score of 0.707

Conclusion

  • Random Forest was the best-performing classifier to predict the driver rush indicator.
  • Surprisingly accelerations and turning events didn't really indicate that it was possible to predict our target values from. This observation definitively needs more studies as we have a lack of data for now.
  • Next steps include gathering more data from different diver and a way to label those data. I would look at building an app that does that and ask friends for their participation.
  • Also it would be interesting to train a Neural Network and see if it can perform better than Random Forest.

drive-predict's People

Contributors

francoisnation avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.