pbiecek / interpretablemachinelearning2020 Goto Github PK

View Code? Open in Web Editor NEW

27.0 7.0 27.0 106.76 MB

Lecture notes for 'Interpretable Machine Learning' at WUT and UoW. Summer semester 2019/2020

Jupyter Notebook 45.83% HTML 50.85% R 0.01% Python 0.01% JavaScript 3.32%

interpretablemachinelearning2020's Introduction

Interpretable Machine Learning 2020

Lecture notes for 'Interpretable Machine Learning' at WUT and UoW. Summer semester 2019/2020

This document: http://tiny.cc/IML2020

Slack for this course: http://iml2020workspace.slack.com

XAI stories ebook: https://github.com/pbiecek/xai_stories

Introduction

The course consists of a lecture, computer laboratory and project.

The course is elective. The rules of passing may seem non-standard. Make sure that you understand them to avoid unpleasant consequences. I believe that one of the most important skills in building ML/XAI models is flexibility and a proactive approach to the problem. In this course, the assessment criteria will strongly reward both flexibility and a proactive approach.

Design Principles

The design of this course is based on four principles:

Mixing experiences during studies is good. It allows you to generate more ideas. Also, in mixed groups, we can improve our communication skills,
In XAI, the interface/esthetic of the solution is important. XAI, like earlier HCI (Human Computer Interaction), is on the borderline between technical, domain and cognitive aspects. Therefore, apart from the purely technical descriptions, the results must be grounded in the domain and are communicated aesthetically and legibly,
communication of results is important. Both in science and business, it is very important to be able to present the results concisely and legibly. In this course, it should translate into the ability to describe one XAI story in the form of a short chapter/article.
It is worth doing useful things. Let's look for new applications for XAI methods discussed on typical predictive problems.

Meetings

Plan for the summer semester 2019/2020. WUT classes are on Thursdays, UoW classes are on Fridays. We will meet online here: meet.google.com/yfq-hckf-pgu.

2020-02-27/28 -- Introduction
2020-03-05/06 -- Break Down / SHAP. EMA chapter, paper shap, paper break down
2020-03-12/13 -- [XAI stories: first meeting, groups are assembles]
2020-03-19/20 -- LIME. EMA chapter, paper lime
2020-03-26/27 -- Ceteris Paribus profiles / Partial Dependence profiles. EMA chapter, paper pdp/ale
2020-04-02/03 -- [XAI stories: first version of the solution]
2020-04-08/09 -- Interactive Explanatory Model Analysis - how instance level methods complement each other
2020-04-16/17 -- Variable's importance. EMA chapter, paper pvi
2020-04-23/24 -- Discussions related to XAI chapters / interactive XAI
2020-04-30 -- TBA
2020-05-08 -- [XAI stories: second version of the solution] (both groups)
2020-05-14/15 -- Model diagnostic plots. EMA chapter, paper auditor
2020-05-21/22 -- students presentations
2020-05-28/29 -- students presentations
2020-06-04/05 -- [XAI stories: final version of the solution]

How to get a good grade

From different activities, you can get from 0 to 100 points. 51 points are needed to pass this course. There are three key components.

Chapter in the 'XAI stories' [0-60 points]

quality of trained predictive models [0-10 points]
quality of dataset level explanations [0-10 points]
quality of instance level explanations [0-10 points]
quality of the charts/visuals/diagrams [0-10 points]
the relevance of the example [0-10 points]
presentation of key results during the final meeting [0-10 points]

Presentation of a selected XAI related article [0-10 points]

Home works [0-30 points]

home work 1 for 0-5 points: Train a predictive model for selected ML problem (see issues). Submit knitr/notebook script to GitHub (directory Homeworks/H1/FirstnameLastname). Deadline: 2020-03-12
home work 2 for 0-5 points. Deadline: 2020-03-26
home work 3 for 0-5 points. Deadline: 2020-04-09
home work 4 for 0-5 points. Deadline: 2020-04-16
home work 5 for 0-5 points. Deadline: 2020-05-04
home work 6 for 0-5 points. Deadline: 2020-05-14

Presentations

Presentations can be prepared by one or two students. Each group should present a single XAI related paper (journal or conference). Each group should choose a different paper. Here are some suggestions.

Revealing the Dark Secrets of BERT note, that you need to have some knowledge related to NLP to take this topic,
Explanation in Artificial Intelligence: Insights from the Social Sciences by Tim Miller
Explainable AI for Trees: From Local Explanations to Global Understanding by Scott Lundberg
One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques by Vijay Arya et al
iNNvestigate neural networks by Maximilian Alber et al
Veridical Data Science by Bin Yu at NIPS19
Agency + Automation: Designing Artificial Intelligence into Interactive Systems Jeff Heer by Jeff Heer at NIPS19
other papers from this list https://github.com/pbiecek/xai_resources
Soft: https://captum.ai/

Projects

Project proposals are described as issues in this repository. Each issue is a single problem in which you need to train a few predictive models and explain them. Among different issues, you will fond applications in different areas, some concern medical data, some concern financial data.

Each group of students should choose one issue they want to solve. After consultation with the lecturer, you can also submit your projects. Projects should be solved in groups. The ideal group consists of three people, one student from each university (PW, US, SGH). Data Scientists from McKinsey will help us with these projects. More details about the rules of cooperation will be given during classes.

The project ends with a small article prepared in English and a short presentation summarizing the key results. The study will be available to the public in the form of open-gitbook.

See Limitations of Interpretable Machine Learning Methods as an example to follow. During this course, we are going to gather several use-cases/success stories for explainable machine learning.

Phase 1

After the first meeting, each group should:

know what problem they'll be working on,
know how to communicate with every team member (own slack channel, something else?)
initially share/distribute work on (1) finding similar solutions in literature, (2) generating models, (3) generating explanations, (4) describing models and explanations,
establish an internal work schedule for the next meeting (already in 3 weeks).

Literature

The literature will be added on an ongoing basis.

interpretablemachinelearning2020's People

Stargazers

Watchers

interpretablemachinelearning2020's Issues

interesting materials for edition 20/21

https://www.youtube.com/playlist?list=PLWjm4hHpaNg6c-W7JjNYDEC_kJK9oSp0Y

Dataset: stock emotions

Problem

This is a regression problem.

The goal is to identify, which statements made during press conferences by the Chairman of the US Federal Reserve were associated with the strongest market reactions and to extract keywords or topics from these statements.

The dataset consists of transcripts and recordings of the press conferences as well as high-frequency trading data for two major US indexes.

Data

The data will be provided to interested students.
This project will be conducted in collaboration with Michal Dzielinski (https://www.sbs.su.se/).

Dataset: COMPAS Recidivism Algorithm

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the COMPAS scores.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

COMPAS Recidivism Risk Scores are suspected of being biased https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm. Let's check this.
The data can be downloaded from the propublica website.
https://github.com/propublica/compas-analysis

Example solution

An interesting description of performed analysis can be found here:
https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb
and here
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

Dataset: House Sale Prices

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the house sale prices.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the Kaggle website or OpenML website.
https://www.kaggle.com/harlfoxem/housesalesprediction
https://www.openml.org/d/42079

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Dataset: Hotel booking demand

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict Average Daily Rate.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the Kaggle competition website.
https://www.kaggle.com/jessemostipak/hotel-booking-demand

Example solution

For classification, but still useful https://juliasilge.com/blog/hotels-recipes/

Dataset: Leukemia and gene expression

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the type of leukemia.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

Source: Molecular Classification of Cancer by Gene Expression Monitoring. Gene expression dataset (Golub et al.)
https://www.kaggle.com/crawford/gene-expression#data_set_ALL_AML_independent.csv
The original authors used the data to classify the type of cancer in each patient by gene expressions.

Note

Due to number of features, this dataset will be more interesting for people that have some experience in med/bio applications.

Presentation [MIMUW] Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Artykuł: https://arxiv.org/abs/1711.11279

Preferowany termin 29.05

Dataset: Mushroom Classification

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict if a mushroom is safe to eat or deadly poison?
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the Kaggle or OpenML website.
https://www.kaggle.com/uciml/mushroom-classification
https://www.openml.org/d/24

Presentation [PW] "iNNvestigate neural networks"

Marika Partyka, Karolina Seweryn

Preferowany termin: 28.05.2020

Dataset: FICO

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the chances of default (credit risk scoring).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the FICO competition website. The form for applying for access to data can be found at
https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=2

Example solution

Two interesting solutions in the FICO competition are described under the links

Presentation [UW]: Demystifying Black-box Models with Symbolic Metamodels

Team members: Miłosz Michta
Paper: https://papers.nips.cc/paper/9308-demystifying-black-box-models-with-symbolic-metamodels.pdf
Repository: https://github.com/ahmedmalaa/Symbolic-Metamodeling

Presentation [MIMUW] "Artificial Intelligence Confronts a 'Reproducibility' Crisis"

https://www.wired.com/story/artificial-intelligence-confronts-reproducibility-crisis/
Preferowany termin 29.05

Dataset: Medical Expenditure Panel Survey

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the expenses related to medical treatments (pricing).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the MEPS website https://meps.ahrq.gov/mepsweb/
MEPS is the most complete source of data on the cost and use of health care and health insurance coverage

Example solution

An interesting use-case related to this data is available at AIX360 website
https://nbviewer.jupyter.org/github/IBM/AIX360/blob/master/examples/tutorials/MEPS.ipynb#LinRR

Dataset: Lung cancer mortality

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the odds of survival after lung cancer surgery.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

Data will be provided upon request. Data correspond to the lung cancer survival in the Polish population.

Dataset: risk of suspension of operations based on CEIDG

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the risk of suspension of operations for a company based on CEIDG data (Centralna Ewidencja i Informacja o Działalności Gospodarczej).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data will be provided to interested students.
This project will be conducted in collaboration with Bartłomiej Karaban.

Example

See for example https://pl.wikipedia.org/wiki/Modele_oceny_zagro%C5%BCenia_upad%C5%82o%C5%9Bci%C4%85

Presentation [PW] "AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models"

Katarzyna Koprowska, Katarzyna Lorenc
na podstawie artykułu: https://arxiv.org/pdf/1909.09251.pdf
preferowany termin: 28.05.2020
@pbiecek czy ten temat jest akceptowalny?

Presentation [PW] SurvLIME: A method for explaining machine learning survival models

Preferowany termin 28.05

Dataset: COVID-19

Problem

This is a classification problem.
On the basis of available data one needs to build and explain a predictive model for mortality of COVID-19.
The most basic model shall use gender, age and country as predictive variables. The most advanced - be creative.

Data

The data can be downloaded from the website (data on individual level) https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGuJXUJWET008/edit#gid=429276722
Note that there may be a larger number of interesting sources of data.

Presentation [PW] One Pixel Attack for Fooling Deep Neural Networks

Paulina Tomaszewska
termin: 21 maja

Dataset: historical marketing campaign

Problem

This is a binary classification problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict the purchase uplift of marketing offer (uplift modelling).
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

Datasets: (1) 'train' and (2) 'valid' from R package named 'Information'

Example solution

Two interesting solutions for this dataset are described under the links
https://www.profit-analytics.com/examples/ch-4-uplift-examples/uplift-modeling-example-two-model-approach/
https://humboldt-wi.github.io/blog/research/theses/uplift_modeling_blogpost/

Additional learning materials and implementations:

R grf package Generalized Random Forests https://github.com/grf-labs/grf
R uplift package: https://cran.r-project.org/web/packages/uplift/index.html
R tools4uplift package: https://cran.r-project.org/web/packages/tools4uplift/index.html
R BART package, vignettes: https://rdrr.io/cran/BART/
Python: Microsoft ALICE https://github.com/microsoft/EconML
Python: Uber's CausalML https://github.com/uber/causalml

Presentation [MIMUW] "HELOC Applicant Risk Performance Evaluation by Topological Hierarchical Decomposition"

TK
Termin: 29.05

Dataset: Women's Shoe Prices

Problem

This is a regression problem.
On the basis of historical data, models (of varying degrees of complexity) should be developed to predict shoe prices.
The best models should be explained using XAI tools at the instance level and at the data set level.

Data

The data can be downloaded from the Kaggle website
https://www.kaggle.com/datafiniti/womens-shoes-prices

You will get the list of 10,000 women's shoes and the prices at which they are sold.

Presentation [PW]: Teaching AI, Ethics, Law and Policy

Teaching AI, Ethics, Law and Policy, 2019, Asher Wilk, arxiv.org/abs/1904.12470

Podsumowanie artykułu przez: Kazimierz Wojciechowski

Termin: 28.05.2020

pbiecek / interpretablemachinelearning2020 Goto Github PK

interpretablemachinelearning2020's Introduction

Interpretable Machine Learning 2020

Introduction

Design Principles

Meetings

How to get a good grade

Presentations

Projects

Phase 1

Literature

interpretablemachinelearning2020's People

Stargazers

Watchers

Forkers

interpretablemachinelearning2020's Issues

Problem

Data

Problem

Data

Example solution

Problem

Data

Problem

Data

Example solution

Problem

Data

Note

Problem

Data

Problem

Data

Example solution

Problem

Data

Example solution

Problem

Data

Problem

Data

Example

Problem

Data

Problem

Data

Example solution

Additional learning materials and implementations:

Problem

Data

Recommend Projects

Recommend Topics

Recommend Org