Code Monkey home page Code Monkey logo

interpretablemachinelearning2020's Introduction

Interpretable Machine Learning 2020

Lecture notes for 'Interpretable Machine Learning' at WUT and UoW. Summer semester 2019/2020

This document: http://tiny.cc/IML2020

Slack for this course: http://iml2020workspace.slack.com

XAI stories ebook: https://github.com/pbiecek/xai_stories

Introduction

The course consists of a lecture, computer laboratory and project.

The course is elective. The rules of passing may seem non-standard. Make sure that you understand them to avoid unpleasant consequences. I believe that one of the most important skills in building ML/XAI models is flexibility and a proactive approach to the problem. In this course, the assessment criteria will strongly reward both flexibility and a proactive approach.

Design Principles

The design of this course is based on four principles:

  • Mixing experiences during studies is good. It allows you to generate more ideas. Also, in mixed groups, we can improve our communication skills,
  • In XAI, the interface/esthetic of the solution is important. XAI, like earlier HCI (Human Computer Interaction), is on the borderline between technical, domain and cognitive aspects. Therefore, apart from the purely technical descriptions, the results must be grounded in the domain and are communicated aesthetically and legibly,
  • communication of results is important. Both in science and business, it is very important to be able to present the results concisely and legibly. In this course, it should translate into the ability to describe one XAI story in the form of a short chapter/article.
  • It is worth doing useful things. Let's look for new applications for XAI methods discussed on typical predictive problems.

Meetings

Plan for the summer semester 2019/2020. WUT classes are on Thursdays, UoW classes are on Fridays. We will meet online here: meet.google.com/yfq-hckf-pgu.

  • 2020-02-27/28 -- Introduction
  • 2020-03-05/06 -- Break Down / SHAP. EMA chapter, paper shap, paper break down
  • 2020-03-12/13 -- [XAI stories: first meeting, groups are assembles]
  • 2020-03-19/20 -- LIME. EMA chapter, paper lime
  • 2020-03-26/27 -- Ceteris Paribus profiles / Partial Dependence profiles. EMA chapter, paper pdp/ale
  • 2020-04-02/03 -- [XAI stories: first version of the solution]
  • 2020-04-08/09 -- Interactive Explanatory Model Analysis - how instance level methods complement each other
  • 2020-04-16/17 -- Variable's importance. EMA chapter, paper pvi
  • 2020-04-23/24 -- Discussions related to XAI chapters / interactive XAI
  • 2020-04-30 -- TBA
  • 2020-05-08 -- [XAI stories: second version of the solution] (both groups)
  • 2020-05-14/15 -- Model diagnostic plots. EMA chapter, paper auditor
  • 2020-05-21/22 -- students presentations
  • 2020-05-28/29 -- students presentations
  • 2020-06-04/05 -- [XAI stories: final version of the solution]

How to get a good grade

From different activities, you can get from 0 to 100 points. 51 points are needed to pass this course. There are three key components.

Chapter in the 'XAI stories' [0-60 points]

  • quality of trained predictive models [0-10 points]
  • quality of dataset level explanations [0-10 points]
  • quality of instance level explanations [0-10 points]
  • quality of the charts/visuals/diagrams [0-10 points]
  • the relevance of the example [0-10 points]
  • presentation of key results during the final meeting [0-10 points]

Presentation of a selected XAI related article [0-10 points]

Home works [0-30 points]

  • home work 1 for 0-5 points: Train a predictive model for selected ML problem (see issues). Submit knitr/notebook script to GitHub (directory Homeworks/H1/FirstnameLastname). Deadline: 2020-03-12
  • home work 2 for 0-5 points. Deadline: 2020-03-26
  • home work 3 for 0-5 points. Deadline: 2020-04-09
  • home work 4 for 0-5 points. Deadline: 2020-04-16
  • home work 5 for 0-5 points. Deadline: 2020-05-04
  • home work 6 for 0-5 points. Deadline: 2020-05-14

Presentations

Presentations can be prepared by one or two students. Each group should present a single XAI related paper (journal or conference). Each group should choose a different paper. Here are some suggestions.

Projects

Project proposals are described as issues in this repository. Each issue is a single problem in which you need to train a few predictive models and explain them. Among different issues, you will fond applications in different areas, some concern medical data, some concern financial data.

Each group of students should choose one issue they want to solve. After consultation with the lecturer, you can also submit your projects. Projects should be solved in groups. The ideal group consists of three people, one student from each university (PW, US, SGH). Data Scientists from McKinsey will help us with these projects. More details about the rules of cooperation will be given during classes.

The project ends with a small article prepared in English and a short presentation summarizing the key results. The study will be available to the public in the form of open-gitbook.

See Limitations of Interpretable Machine Learning Methods as an example to follow. During this course, we are going to gather several use-cases/success stories for explainable machine learning.

Phase 1

After the first meeting, each group should:

  • know what problem they'll be working on,
  • know how to communicate with every team member (own slack channel, something else?)
  • initially share/distribute work on (1) finding similar solutions in literature, (2) generating models, (3) generating explanations, (4) describing models and explanations,
  • establish an internal work schedule for the next meeting (already in 3 weeks).

Literature

The literature will be added on an ongoing basis.

interpretablemachinelearning2020's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

interpretablemachinelearning2020's Issues

Dataset: stock emotions

Problem

This is a regression problem.

The goal is to identify, which statements made during press conferences by the Chairman of the US Federal Reserve were associated with the strongest market reactions and to extract keywords or topics from these statements.

The dataset consists of transcripts and recordings of the press conferences as well as high-frequency trading data for two major US indexes.

Data

The data will be provided to interested students.
This project will be conducted in collaboration with Michal Dzielinski (https://www.sbs.su.se/).

Dataset: COMPAS Recidivism Algorithm

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the COMPAS scores.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

COMPAS Recidivism Risk Scores are suspected of being biased https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm. Let's check this.
The data can be downloaded from the propublica website.
https://github.com/propublica/compas-analysis

Example solution

An interesting description of performed analysis can be found here:
https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb
and here
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

Dataset: House Sale Prices

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the house sale prices.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the Kaggle website or OpenML website.
https://www.kaggle.com/harlfoxem/housesalesprediction
https://www.openml.org/d/42079

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Dataset: Hotel booking demand

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict Average Daily Rate.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the Kaggle competition website.
https://www.kaggle.com/jessemostipak/hotel-booking-demand

Example solution

For classification, but still useful https://juliasilge.com/blog/hotels-recipes/

Dataset: Leukemia and gene expression

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the type of leukemia.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

Source: Molecular Classification of Cancer by Gene Expression Monitoring. Gene expression dataset (Golub et al.)
https://www.kaggle.com/crawford/gene-expression#data_set_ALL_AML_independent.csv
The original authors used the data to classify the type of cancer in each patient by gene expressions.

Note

Due to number of features, this dataset will be more interesting for people that have some experience in med/bio applications.

Dataset: FICO

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the chances of default (credit risk scoring).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the FICO competition website. The form for applying for access to data can be found at
https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=2

Example solution

Two interesting solutions in the FICO competition are described under the links

Dataset: Medical Expenditure Panel Survey

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the expenses related to medical treatments (pricing).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the MEPS website https://meps.ahrq.gov/mepsweb/
MEPS is the most complete source of data on the cost and use of health care and health insurance coverage

Example solution

An interesting use-case related to this data is available at AIX360 website
https://nbviewer.jupyter.org/github/IBM/AIX360/blob/master/examples/tutorials/MEPS.ipynb#LinRR

Dataset: Lung cancer mortality

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the odds of survival after lung cancer surgery.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

Data will be provided upon request. Data correspond to the lung cancer survival in the Polish population.

Dataset: risk of suspension of operations based on CEIDG

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the risk of suspension of operations for a company based on CEIDG data (Centralna Ewidencja i Informacja o Działalności Gospodarczej).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data will be provided to interested students.
This project will be conducted in collaboration with Bartłomiej Karaban.

Example

See for example https://pl.wikipedia.org/wiki/Modele_oceny_zagro%C5%BCenia_upad%C5%82o%C5%9Bci%C4%85

Dataset: COVID-19

Problem

This is a classification problem.
On the basis of available data one needs to build and explain a predictive model for mortality of COVID-19.
The most basic model shall use gender, age and country as predictive variables. The most advanced - be creative.

Data

The data can be downloaded from the website (data on individual level) https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGuJXUJWET008/edit#gid=429276722
Note that there may be a larger number of interesting sources of data.

Dataset: historical marketing campaign

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the purchase uplift of marketing offer (uplift modelling).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

Datasets: (1) 'train' and (2) 'valid' from R package named 'Information'

Example solution

Two interesting solutions for this dataset are described under the links
https://www.profit-analytics.com/examples/ch-4-uplift-examples/uplift-modeling-example-two-model-approach/
https://humboldt-wi.github.io/blog/research/theses/uplift_modeling_blogpost/

Additional learning materials and implementations:

R grf package Generalized Random Forests https://github.com/grf-labs/grf
R uplift package: https://cran.r-project.org/web/packages/uplift/index.html
R tools4uplift package: https://cran.r-project.org/web/packages/tools4uplift/index.html
R BART package, vignettes: https://rdrr.io/cran/BART/
Python: Microsoft ALICE https://github.com/microsoft/EconML
Python: Uber's CausalML https://github.com/uber/causalml

Dataset: Women's Shoe Prices

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict shoe prices.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the Kaggle website
https://www.kaggle.com/datafiniti/womens-shoes-prices

You will get the list of 10,000 women's shoes and the prices at which they are sold.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.