An exploration of the world of interpretable machine learning_
Machine learning has become ubiquitous. Media coverage on ML's effectiveness proliferate, both in professional networks and popular media while ML experts continue to be in high demand. Online courses place education on popular ML methods at anyone's fingertips - from the basics to the cutting edge. And tools like scikit-learn or high-level interfaces to TensorFlow and PyTorch, make training a supervised model as easy as calling .fit(X, y)
.
Fortunately, there are many real-life problems that can be solved by doing what these methods are great at โ learning complex functions that map from an input to an output space. These machine learning applications create enormous value today and many quickly become engineering rather than data science problems.
On the other hand, there is a vast array of business questions that cannot be easily solved using the standard ML toolbox: Why did the model reject this applicant? What would have happened had we not contacted the customer? What's the root cause of this quality issue? By not addressing these types of questions, data scientists and ML engineers ignore a large fraction of potential beneficiaries of data-informed decisions.
While many of these questions might require vastly different approaches than those referenced above (in particular, for causal inference), interpretable machine learning is becoming increasingly important to improve our understanding of what our tried and tested methods actually learn.
Today, there are great resources for interpretable machine learning available online, such as the Interpretable Machine Learning Book and various open-source tools (Shap, Lime, eli5, interpret, stratx or scikit-learn's inspection module, just to name a few in the Python universe).
This project builds on top of these methods and implementations to stress test and compare them on real-world data sets. By doing so, I hope to create a resource that can facilitate our understanding of how common interpretability methods work, where they are helpful and where they might fail.
- Source: Kaggle
- Sales calls to insurance customers
- Binary classification: Did the customer end up buying a car insurance or not?
- A few, easy-to-understand features on the customers being called and previous interactions
You can create a new conda environment containing the required dependencies by running conda env create -f environment.yml
Before you run the app, make sure that:
- You have the dataset you would like to use available in
data/raw
- You have the
interpret
environment activated
Then, just run streamlit run src/app.py
from the root of the directory and the app will be served at localhost:8501
.