Code Monkey home page Code Monkey logo

awesome-data-valuation's Introduction

Awesome Data Valuation

data market problem

💱 A curated list of data valuation (DV) to design your next data marketplace. DV aims to understand the value of a data point for a given machine learning task and is an essential primitive in the design of data marketplaces and explainable AI.

Legend

💻 Code available

🎥 Talk / Slides

What is your data worth?

Shapley Value & Cooperative Game Theory

Towards Efficient Data Valuation Based on the Shapley Value Ruoxi Jia & David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, Costas J. Spanos 2019
Summary Jia et al. (2019) contribute theoretical and practical results for efficient methods for approximating the Shapley value (SV). They show that methods with a sublinear amount of model evaluations are possible and further reductions can be made for sparse SVs. Lastly, they introduce two practical SV estimation methods for ML tasks, one for uniformly stable learning algorithms and one for smooth loss functions.
Bibtex
@inproceedings{jia2019towards,
title={Towards efficient data valuation based on the shapley value},
author={Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Hynes, Nick and G{"u}rel, Nezihe Merve and Li, Bo and Zhang, Ce and Song, Dawn and Spanos, Costas J},
booktitle={The 22nd International Conference on Artificial Intelligence and Statistics},
pages={1167--1176},
year={2019},
organization={PMLR}
}
💻
Data Shapley: Equitable Valuation of Data for Machine Learning Amirata Ghorbani, James Zou 2019
Summary Ghorbani & Zou (2019) introduce (data) Shapley value to equitably measure the value of each training point to a supervised learners performance. They further outline several benefits of the Shapley value, e.g. being able to capture outliers or inform what new data to acquire, as well as develop Monte Carlo and gradient-based methods for its efficient estimation.
Bibtex
@inproceedings{ghorbani2019data,
title={Data shapley: Equitable valuation of data for machine learning},
author={Ghorbani, Amirata and Zou, James},
booktitle={International Conference on Machine Learning},
pages={2242--2251},
year={2019},
organization={PMLR}
}
💻
A Distributional Framework for Data Valuation Amirata Ghorbani, Michael P. Kim, James Zou 2020
Summary Ghorbani et al. (2020) formulate the Shapley value as a distributional quantity in the context of an underlying data distribution instead of a fixed dataset. They further introduce a novel sampling-based algorithm for the distributional Shapley value with strong approximation guarantees.
Bibtex
@inproceedings{ghorbani2020distributional,
title={A Distributional Framework for Data Valuation},
author={Ghorbani, Amirata, P. Kim, Michael and Zou, James},
booktitle={International Conference on Machine Learning},
year={2020}
}
💻
Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability Christopher Frye, Colin Rowat, Ilya Feige 2020
Summary Frye et al. (2020) incorporate causality into the Shapley value framework. Importantly, their framework can handle any amount of causal knowledge and does not require the complete causal graph underlying the data.
Bibtex
@article{frye2020asymmetric,
title={Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability},
author={Frye, Christopher and Rowat, Colin and Feige, Ilya},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}
🎥
Collaborative Machine Learning with Incentive-Aware Model Rewards Rachael Hwee Ling Sim, Yehong Zhang, Mun Choon Chan, Bryan Kian Hsiang Low 2020
Summary Sim et al. (2020) introduce a data valuation method with separate ML models as rewards based on the Shapley value and information gain on model parameters given its data. They further define several conditions for incentives such as Shapley fairness, stability, individual rationality, and group welfare, that are suitable for the freely replicable nature of their model reward scheme.
Bibtex
@inproceedings{sim2020collaborative,
title={Collaborative machine learning with incentive-aware model rewards},
author={Sim, Rachael Hwee Ling and Zhang, Yehong and Chan, Mun Choon and Low, Bryan Kian Hsiang},
booktitle={International Conference on Machine Learning},
pages={8927--8936},
year={2020},
organization={PMLR}
}
Validation free and replication robust volume-based data valuation Xinyi Xu, Zhaoxuan Wu, Chuan Sheng Foo, Bryan Kian Hsiang Low 2021
Summary Xu et al. (2021) propose using data diversity via robust volume for measuring the value of data. This removes the need for a validation set and allows for guarantees on replication robustness but suffers from the curse of dimensionality and may ignore useful information in the validation set.
Bibtex
@article{xu2021validation,
title={Validation free and replication robust volume-based data valuation},
author={Xu, Xinyi and Wu, Zhaoxuan and Foo, Chuan Sheng and Low, Bryan Kian Hsiang},
journal={Advances in Neural Information Processing Systems},
volume={34},
year={2021}
}
💻
Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning Yongchan Kwon, James Zou 2021
Summary Kwon & Zou (2022) introduce Beta Shapley, a generalization of Data Shapley by relaxing the efficiency axiom.
Bibtex
@article{kwon2021beta,
title={Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning},
author={Kwon, Yongchan and Zou, James},
journal={arXiv preprint arXiv:2110.14049},
year={2021}
}
Gradient-Driven Rewards to Guarantee Fairness in Collaborative Machine Learning Xinyi Xu, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo, Bryan Kian Hsiang Low 2021
Summary Xu et al. (2021) propose cosine gradient Shapley value to fairly evaluate the expected contribution of each agent's update in the federated learning setting removing the need for an auxiliary validation dataset. They further introduce a novel training-time gradient reward mechanism with a fairness guarantee.
Bibtex
@article{xu2021gradient,
title={Gradient driven rewards to guarantee fairness in collaborative machine learning},
author={Xu, Xinyi and Lyu, Lingjuan and Ma, Xingjun and Miao, Chenglin and Foo, Chuan Sheng and Low, Bryan Kian Hsiang},
journal={Advances in Neural Information Processing Systems},
volume={34},
pages={16104--16117},
year={2021}
}
Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning Tianhao Wang, Yu Yang, Ruoxi Jia 2022
Summary Wang et al. (2022) propose a general framework to improve effectiveness of sampling-based Shapley value (SV) or Least core (LC) estimation heuristics. They propose learning to predict the performance of a learning algorithm (denoted data utility learning) and using this predictor to estimate learning performance without retraining for cheaper SV and LC estimation.
Bibtex
@article{wang2021improving,
title={Improving cooperative game theory-based data valuation via data utility learning},
author={Wang, Tianhao and Yang, Yu and Jia, Ruoxi},
journal={arXiv preprint arXiv:2107.06336},
year={2021}
}

Efficient algorithms

Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J. Spanos, Dawn Song 2019
Summary Jia et al. (2019) present algorithms to compute the Shapley value exactly in quasi-linear time and approximations in sublinear time for k-nearest-neighbor models. They empirically evaluate their algorithms at scale and extend them to several other settings.
Bibtex
@article{jia12efficient,
title={Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms},
author={Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Gurel, Nezihe Merve and Zhang, Bo Li4 Ce and Song, Costas Spanos1 Dawn},
journal={Proceedings of the VLDB Endowment},
volume={12},
number={11}
}
💻
Efficient computation and analysis of distributional Shapley values Yongchan Kwon, Manuel A. Rivas, James Zou 2021
Summary Kwon et al. (2021) develop tractable analytic expressions for the distributional data Shapley value for linear regression, binary classification, and non-parametric density estimation as well as new efficient methods for its estimation.
Bibtex
@inproceedings{kwon2021efficient,
title={Efficient computation and analysis of distributional Shapley values},
author={Kwon, Yongchan and Rivas, Manuel A and Zou, James},
booktitle={International Conference on Artificial Intelligence and Statistics},
pages={793--801},
year={2021},
organization={PMLR}
}
💻

Benchmarks, Criticism & Relaxations

Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification? Ruoxi Jia, Fan Wu, Xuehui Sun, Jiacen Xu, David Dao, Bhavya Kailkhura, Ce Zhang, Bo Li, Dawn Song 2021
Summary Jia et al. (2021) perform a theoretical analysis on the differences between leave-one-out-based and Shapley value-based methods as well as an empirical study across several ML tasks investigating the two aforementioned methods as well as exact Shapley value-based methods and Shapley over KNN Surrogates.
Bibtex
@misc{jia2021scalability,
title={Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?},
author={Ruoxi Jia and Fan Wu and Xuehui Sun and Jiacen Xu and David Dao and Bhavya Kailkhura and Ce Zhang and Bo Li and Dawn Song},
year={2021},
eprint={1911.07128},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
💻
Shapley values for feature selection: The good, the bad, and the axioms Daniel Fryer, Inga Strümke, Hien Nguyen 2021
Summary Fryer et al. (2021) calls into question the appropriateness of using the Shapley value for feature selection and advise caution against the magical thinking that presenting its abstract general axioms as "favourable and fair" may introduce. They further point out that the four axioms of "efficiency", "null player", "symmetry", and "additivity" do not guarantee that the Shapley value is suited to feature selection and may sometimes even imply the opposite.
Bibtex
@misc{fryer2021shapley,
title={Shapley values for feature selection: The good, the bad, and the axioms},
author={Daniel Fryer and Inga Strümke and Hien Nguyen},
year={2021},
eprint={2102.10936},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Influence functions & LOO

Understanding Black-box Predictions via Influence Functions Pang Wei Koh, Percy Liang 2017
Summary Koh & Liang (2017) introduce the use of influence functions, a technique borrowed from robust statistics, to identify training points most responsible for a model's given prediction without needing to retrain. They further develop a simple and efficient implementation of influence functions that scales to large ML settings.
Bibtex
@inproceedings{koh2017understanding,
title={Understanding black-box predictions via influence functions},
author={Koh, Pang Wei and Liang, Percy},
booktitle={International Conference on Machine Learning},
pages={1885--1894},
year={2017},
organization={PMLR}
}
💻 🎥
On the accuracy of influence functions for measuring group effects Pang Wei Koh*, Kai-Siang Ang*, Hubert H. K. Teo*, and Percy Liang 2019
Summary Koh et al. (2019) study influence functions to measure effects of large groups of training points instead of individual points. They empirically find a correlation and often underestimation between predicted and actual effects and theoretically show that this need not hold in general, realistic settings.
Bibtex
@article{koh2019accuracy,
title={On the accuracy of influence functions for measuring group effects},
author={Koh, Pang Wei and Ang, Kai-Siang and Teo, Hubert HK and Liang, Percy},
journal={arXiv preprint arXiv:1905.13289},
year={2019}
}
💻 🎥

Reinforcement Learning

Data Valuation using Reinforcement Learning Jinsung Yoon, Sercan Ö Arık, Tomas Pfister 2020
Summary Yoon et al. (2020) propose using reinforcement learning for data valuation to learn data values jointly with the predictor model.
Bibtex
@inproceedings{49189,
title={Data Valuation using Reinforcement Learning},
author={Jinsung Yoon and Sercan Arik and Tomas Pfister},
year={2020}
}
💻 🎥

Deep Neural Networks

DAVINZ: Data Valuation using Deep Neural Networks at Initialization Zhaoxuan Wu, Yao Shu, Bryan Kian Hsiang Low 2022
Summary Wu et al. (2022) introduce a validation-based and training-free method for efficient data valuation with large and complex deep neural networks (DNNs). They derive and exploit a domain-aware generalization bound for DNNs to characterize their performance without training and uses this bound as the scoring function while keeping conventional techniques such as Shapley values as the valuation function.
Bibtex
@inproceedings{wu2022davinz,
title={DAVINZ: Data Valuation using Deep Neural Networks at Initialization},
author={Wu, Zhaoxuan and Shu, Yao and Low, Bryan Kian Hsiang},
booktitle={International Conference on Machine Learning},
pages={24150--24176},
year={2022},
organization={PMLR}
}
🎥

Surveys

Data Valuation in Machine Learning: “Ingredients”, Strategies, and Open Challenges Rachael Hwee Ling Sim*, Xinyi Xu*, Bryan Kian Hsiang Low 2022
Summary Sim et al. (2022) present a technical survey of data valuation and its "ingredients" and properties. The paper outlines common desiderata as well as some open research challenges.
Bibtex
@inproceedings{sim2022data,
title={Data valuation in machine learning:“ingredients”, strategies, and open challenges},
author={Sim, Rachael Hwee Ling and Xu, Xinyi and Low, Bryan Kian Hsiang},
booktitle={Proc. IJCAI},
year={2022}
}
🎥

Designing data marketplaces

Data market system designs

A demonstration of sterling: a privacy-preserving data marketplace Nick Hynes, David Dao, David Yan, Raymond Cheng, Dawn Song 2018
Bibtex
@article{hynes2018demonstration,
title={A Demonstration of Sterling: A Privacy-Preserving Data Marketplace},
author={Hynes, Nick and Dao, David and Yan, David and Cheng, Raymond and Song, Dawn},
journal={Proceedings of the VLDB Endowment},
volume={11},
number={12},
year={2018}
}
DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation David Dao, Dan Alistarh, Claudiu Musat, Ce Zhang 2018
Bibtex
@article{dao2018databright,
title={Databright: Towards a global exchange for decentralized data ownership and trusted computation},
author={Dao, David and Alistarh, Dan and Musat, Claudiu and Zhang, Ce},
journal={arXiv preprint arXiv:1802.04780},
year={2018}
}
A Marketplace for Data: An Algorithmic Solution Anish Agarwal, Munther Dahleh, Tuhin Sarkar 2019
Bibtex
@inproceedings{agarwal2019marketplace,
title={A marketplace for data: An algorithmic solution},
author={Agarwal, Anish and Dahleh, Munther and Sarkar, Tuhin},
booktitle={Proceedings of the 2019 ACM Conference on Economics and Computation},
pages={701--726},
year={2019}
}
Computing a Data Dividend Eric Bax 2019
Bibtex
@misc{bax2019computing,
title={Computing a Data Dividend},
author={Eric Bax},
year={2019},
eprint={1905.01805},
archivePrefix={arXiv},
primaryClass={cs.GT}
}
Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards Sebastian Shenghong Tay, Xinyi Xu, Chuan Sheng Foo, Bryan Kian Hsiang Low 2021
Bibtex
@article{tay2021incentivizing,
title={Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards},
author={Tay, Sebastian Shenghong and Xu, Xinyi and Foo, Chuan Sheng and Low, Bryan Kian Hsiang},
journal={arXiv preprint arXiv:2112.09327},
year={2021}
}

Automatic data compliance

Data Capsule: A New Paradigm for Automatic Compliance with Data Privacy Regulations Lun Wang, Joseph P. Near, Neel Somani, Peng Gao, Andrew Low, David Dao, Dawn Song 2019
Bibtex
@misc{wang2019data,
title={Data Capsule: A New Paradigm for Automatic Compliance with Data Privacy Regulations},
author={Lun Wang and Joseph P. Near and Neel Somani and Peng Gao and Andrew Low and David Dao and Dawn Song},
year={2019},
eprint={1909.00077},
archivePrefix={arXiv},
primaryClass={cs.CY}
}
💻

Data valuation applications

A Principled Approach to Data Valuation for Federated Learning Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, Dawn Song 2020
Bibtex
@misc{wang2020principled,
title={A Principled Approach to Data Valuation for Federated Learning},
author={Tianhao Wang and Johannes Rausch and Ce Zhang and Ruoxi Jia and Dawn Song},
year={2020},
eprint={2009.06192},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A Dunnmon, James Zou, Daniel L Rubin 2021
Bibtex
@article{tang2021data,
title={Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset},
author={Tang, Siyi and Ghorbani, Amirata and Yamashita, Rikiya and Rehman, Sameer and Dunnmon, Jared A and Zou, James and Rubin, Daniel L},
journal={Scientific reports},
volume={11},
number={1},
pages={1--9},
year={2021},
publisher={Nature Publishing Group}
}

Data markets and society

Economics of Data

Nonrivalry and the Economics of Data Charles I. Jones, Christopher Tonetti 2019
Bibtex
@article{10.1257/aer.20191330,
Author = {Jones, Charles I. and Tonetti, Christopher},
Title = {Nonrivalry and the Economics of Data},
Journal = {American Economic Review},
Volume = {110},
Number = {9},
Year = {2020},
Month = {September},
Pages = {2819-58},
DOI = {10.1257/aer.20191330},
URL = {https://www.aeaweb.org/articles?id=10.1257/aer.20191330}
}

Data Dignity

Chapter 5: Data as Labor, Radical Markets Eric A. Posner and E Glen Weyl 2019
Bibtex
@book{posner2019radical,
title={Radical Markets},
author={Posner, Eric A and Weyl, E Glen},
year={2019},
publisher={Princeton University Press}
}
Should We Treat Data as Labor? Moving beyond "Free" Imanol Arrieta-Ibarra, Leonard Goff, Diego Jiménez-Hernández, Jaron Lanier, E. Glen Weyl 2018
Bibtex
@article{10.1257/pandp.20181003,
Author = {Arrieta-Ibarra, Imanol and Goff, Leonard and Jiménez-Hernández, Diego and Lanier, Jaron and Weyl, E. Glen},
Title = {Should We Treat Data as Labor? Moving beyond "Free"},
Journal = {AEA Papers and Proceedings},
Volume = {108},
Year = {2018},
Month = {May},
Pages = {38-42},
DOI = {10.1257/pandp.20181003},
URL = {https://www.aeaweb.org/articles?id=10.1257/pandp.20181003}
}

Strategic adaptation

Performative prediction

Performative Prediction Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, Moritz Hardt 2020
Summary Perdomo et al. (2020) introduce the concept of "performative prediction" dealing with predictions that influence the target they aim to predict, e.g. through taking actions based on the predictions, causing a distribution shift. The authors develop a risk minimization framework for performative prediction and introduce the equilibrium notion of performative stability where predictions are calibrated against future outcomes that manifest from acting on the prediction.
Bibtex
@inproceedings{perdomo2020performative,
title={Performative prediction},
author={Perdomo, Juan and Zrnic, Tijana and Mendler-D{"u}nner, Celestine and Hardt, Moritz},
booktitle={International Conference on Machine Learning},
pages={7599--7609},
year={2020},
organization={PMLR}
}
Stochastic Optimization for Performative Prediction Celestine Mendler-Dünner, Juan Perdomo, Tijana Zrnic, Moritz Hardt 2020
Summary Mendler-Dünner et al. (2020) look at stochastic optimization for performative prediction and prove convergence rates for greedily deploying models after each stochastic update (which may cause distribution shift affecting convergence to a stability point) or lazily deploying the model after several updates.
Bibtex
@article{mendler2020stochastic,
title={Stochastic optimization for performative prediction},
author={Mendler-D{"u}nner, Celestine and Perdomo, Juan and Zrnic, Tijana and Hardt, Moritz},
journal={Advances in Neural Information Processing Systems},
volume={33},
pages={4929--4939},
year={2020}
}

Strategic classification

Strategic Classification is Causal Modeling in Disguise John Miller, Smitha Milli, Moritz Hardt 2020
Summary Miller et al. (2020) argue that strategic classication involves causal modelling and designing incentives for improvement requires solving a non-trivial causal inference problem. The authors provide a distinction between gaming and improvement as well as provide a causal framework for strategic adaptation.
Bibtex
@inproceedings{miller2020strategic,
title={Strategic classification is causal modeling in disguise},
author={Miller, John and Milli, Smitha and Hardt, Moritz},
booktitle={International Conference on Machine Learning},
pages={6917--6926},
year={2020},
organization={PMLR}
}
Alternative Microfoundations for Strategic Classification Meena Jagadeesan, Celestine Mendler-Dünner, Moritz Hardt 2021
Summary Jagadeesan et al. (2021) show that standard microfoundations in strategic classification, that typically uses individual-level behaviour to deduce aggregate-level responses, can lead to degenerate behaviour in aggregate: discontinuities in the aggregate response, stable points ceasing to exist, and maximizing social burden. The authors introduce a noisy response model inspired by performative prediction that mitigates these limitations for binary classification.
Bibtex
@inproceedings{jagadeesan2021alternative,
title={Alternative microfoundations for strategic classification},
author={Jagadeesan, Meena and Mendler-D{"u}nner, Celestine and Hardt, Moritz},
booktitle={International Conference on Machine Learning},
pages={4687--4697},
year={2021},
organization={PMLR}
}

awesome-data-valuation's People

Contributors

daviddao avatar lfwa avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.