Photo by Pang Yuhao on Unsplash
The study aims to predict distinct metrics and categories out of driving data. How good will it be to tell you if you are a good or bad driver based on your driving data ? Or may be, to tell the drivers to slow down during turns or to increase their braking distance.
Such model might have multiple applications like insurance rating/pricing, driver rating system, on board warning system, companies wanting to track their fleet drivers. Some companies (Uber, Lyft, Progressive) are already using their own private data to predict/build their models.
As the study involves to get lots of data around driving car, it has been difficult to find out available datasets. The first step of this project was from the very beginning to build the app that allows the data collection in order to build the model. As I ran out of time, I haven't been able to build it unfortunately. I hope I will do it in the near future.
During this study, we will split our work into small feasible tasks. We will look
at behaviours like braking, starts, turning. To do so, I have built a data utils
lib under models/data.py
containing events (such as braking, acceleration, turning)
extraction functions and also functions that calculate metrics around those events. The rest of the code can be found under the models
module where there is more functionalities.
The main Jupyter notebook is available at notebook.ipynb
link. It details every single steps taken during the project.
In order to run the Jupyter Notebook, you will need first to download the following dataset https://www.kaggle.com/vitorrf/cartripsdatamining/downloads/cartripsdatamining.zip/1 under the data project folder (dataset not included in the repo as it is too big).
The original dataset is a zip file containing 38 CSV files corresponding to 38 car trips (~30 mins each)
Contents of CSV files:
- Column 1: Time (in seconds)
- Column 2: Vehicle’s speed (in m/s)
- Column 3: Shift number (0 = intermediate position)
- Column 4: Engine Load (% of max power)
- Column 5: Total Acceleration (m/s^2)
- Column 6: Engine RPM
- Column 7: Pitch
- Column 8: Lateral Acceleration (m/s^2)
- Column 9: Passenger count (0 - 5)
- Column 10: Car’s load (0 - 10)
- Column 11: Air conditioning status (0 - 4)
- Column 12: Window opening (0 - 10)
- Column 13: Radio volume (0 - 10)
- Column 14: Rain intensity (0 - 10)
- Column 15: Visibility (0 - 10)
- Column 16: Driver’s wellbeing (0 - 10)
- Column 17: Driver’s rush (0 - 10)
The EDA has been conducted in 3 distinct steps :
- Loading the datasets and checking the structure, feature types and null values.
- Looking at feature distribution (looking for outliers) and any feature correlations.
- Extracting the driving events (braking, acceleration, turning), calculating the metrics around those events and then plotting the metrics against our target value
driver_rush
From the 3 events types the most notable one is the braking. Even if the numbers are not huge, the ratio of harsh braking over total braking events is five times higher when the driver is in a rush compared to not in a rush. The others events (accelerations and turnings) didn't really show a difference between rush
vs not rush
when looking at harsh accelerations ratio and harsh turning ratio.
Harsh braking | Harsh acceleration |
---|---|
The EDA highlights also the fact that the observations are not conclusives. It would have been helpful to get more data and especially more data coming from distinct drivers. Here we only have the driving measurements for one driver-one car which obviously is not ideal.
This step consists in trying to find the best model which can predict our target value driver_rush
.
We took a pragmatic approach to the modelling phase and followed the steps bellow :
- Select list of known classification models
- Run a baseline model for each of our pre-selected models
- Pick the top models base on precision, recall, f1-score scores
- Tune the hyper-parameters of the top model using cross-validation and watch score increase
The list of pre-selected classification models is :
- Logistic Regression
- Decision Tree
- Random Forest
- AdaBoost
- K Nearest Neighbours
- XGBoost
- SVC (Support Vector Classification)
Out of our baseline modelling simulations and after reducing the classification task to a binary classification, the best-performing models are XGBoost Classifier, Decision Tree Classifier, Logistic Regression.
XGBOOST | Logistic Regression | Decision Tree |
---|---|---|
After dealing our imbalance targets dataset, it comes up that our top performer was the XGBoost Classifier. Surprisingly enough, the model which benefits the most from the balanced training dataset was Random Forest. As such, I kept it for the next step of the modelling phase.
For each selected classifiers (Random Forest and XGBoost), we ran a grid-search along with cross validation to determine the best hyper-parameters for the given classifier.
At the end, the best performing classifier was the Random Forest with overall mean validation score of 0.707
- Random Forest was the best-performing classifier to predict the driver rush indicator.
- Surprisingly accelerations and turning events didn't really indicate that it was possible to predict our target values from. This observation definitively needs more studies as we have a lack of data for now.
- Next steps include gathering more data from different diver and a way to label those data. I would look at building an app that does that and ask friends for their participation.
- Also it would be interesting to train a Neural Network and see if it can perform better than Random Forest.