Awesome Data Valuation

💱 A curated list of data valuation (DV) to design your next data marketplace. DV aims to understand the value of a data point for a given machine learning task and is an essential primitive in the design of data marketplaces and explainable AI.

Legend

💻 Code available

🎥 Talk / Slides

Table of Contents
1. What is your data worth? (DV Algorithms)
1.1. Shapley Value & Cooperative Game Theory
1.1.1. Efficient algorithms
1.1.2. Benchmarks, Criticism & Relaxations
1.2. Influence functions & LOO
1.3. Reinforcement Learning
1.4. Deep Neural Networks
1.5. Out-of-Bag score
1.6. Task Agnostic
2. Benchmarks
3. Libraries
3.1. Surveys
4. Designing data marketplaces
4.1. Data market system designs
4.2. Automatic data compliance
4.3. Data valuation applications
5. Data markets and society
5.1. Economics of Data
5.2. Data Dignity
6. Strategic adaptation
6.1. Performative prediction
6.2. Strategic classification
7. Data Valuation Researchers

What is your data worth?

Shapley Value & Cooperative Game Theory


Towards Efficient Data Valuation Based on the Shapley Value	Ruoxi Jia & David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, Costas J. Spanos	2019	Summary Jia et al. (2019) contribute theoretical and practical results for efficient methods for approximating the Shapley value (SV). They show that methods with a sublinear amount of model evaluations are possible and further reductions can be made for sparse SVs. Lastly, they introduce two practical SV estimation methods for ML tasks, one for uniformly stable learning algorithms and one for smooth loss functions.	Bibtex @inproceedings{jia2019towards, title={Towards efficient data valuation based on the shapley value}, author={Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Hynes, Nick and G{"u}rel, Nezihe Merve and Li, Bo and Zhang, Ce and Song, Dawn and Spanos, Costas J}, booktitle={The 22nd International Conference on Artificial Intelligence and Statistics}, pages={1167--1176}, year={2019}, organization={PMLR} }	💻
Data Shapley: Equitable Valuation of Data for Machine Learning	Amirata Ghorbani, James Zou	2019	Summary Ghorbani & Zou (2019) introduce (data) Shapley value to equitably measure the value of each training point to a supervised learners performance. They further outline several benefits of the Shapley value, e.g. being able to capture outliers or inform what new data to acquire, as well as develop Monte Carlo and gradient-based methods for its efficient estimation.	Bibtex @inproceedings{ghorbani2019data, title={Data shapley: Equitable valuation of data for machine learning}, author={Ghorbani, Amirata and Zou, James}, booktitle={International Conference on Machine Learning}, pages={2242--2251}, year={2019}, organization={PMLR} }	💻
A Distributional Framework for Data Valuation	Amirata Ghorbani, Michael P. Kim, James Zou	2020	Summary Ghorbani et al. (2020) formulate the Shapley value as a distributional quantity in the context of an underlying data distribution instead of a fixed dataset. They further introduce a novel sampling-based algorithm for the distributional Shapley value with strong approximation guarantees.	Bibtex @inproceedings{ghorbani2020distributional, title={A Distributional Framework for Data Valuation}, author={Ghorbani, Amirata, P. Kim, Michael and Zou, James}, booktitle={International Conference on Machine Learning}, year={2020} }	💻
Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability	Christopher Frye, Colin Rowat, Ilya Feige	2020	Summary Frye et al. (2020) incorporate causality into the Shapley value framework. Importantly, their framework can handle any amount of causal knowledge and does not require the complete causal graph underlying the data.	Bibtex @article{frye2020asymmetric, title={Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability}, author={Frye, Christopher and Rowat, Colin and Feige, Ilya}, journal={Advances in Neural Information Processing Systems}, volume={33}, year={2020} }		🎥
Collaborative Machine Learning with Incentive-Aware Model Rewards	Rachael Hwee Ling Sim, Yehong Zhang, Mun Choon Chan, Bryan Kian Hsiang Low	2020	Summary Sim et al. (2020) introduce a data valuation method with separate ML models as rewards based on the Shapley value and information gain on model parameters given its data. They further define several conditions for incentives such as Shapley fairness, stability, individual rationality, and group welfare, that are suitable for the freely replicable nature of their model reward scheme.	Bibtex @inproceedings{sim2020collaborative, title={Collaborative machine learning with incentive-aware model rewards}, author={Sim, Rachael Hwee Ling and Zhang, Yehong and Chan, Mun Choon and Low, Bryan Kian Hsiang}, booktitle={International Conference on Machine Learning}, pages={8927--8936}, year={2020}, organization={PMLR} }
Validation free and replication robust volume-based data valuation	Xinyi Xu, Zhaoxuan Wu, Chuan Sheng Foo, Bryan Kian Hsiang Low	2021	Summary Xu et al. (2021) propose using data diversity via robust volume for measuring the value of data. This removes the need for a validation set and allows for guarantees on replication robustness but suffers from the curse of dimensionality and may ignore useful information in the validation set.	Bibtex @article{xu2021validation, title={Validation free and replication robust volume-based data valuation}, author={Xu, Xinyi and Wu, Zhaoxuan and Foo, Chuan Sheng and Low, Bryan Kian Hsiang}, journal={Advances in Neural Information Processing Systems}, volume={34}, year={2021} }	💻
Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning	Yongchan Kwon, James Zou	2021	Summary Kwon & Zou (2022) introduce Beta Shapley, a generalization of Data Shapley by relaxing the efficiency axiom.	Bibtex @article{kwon2021beta, title={Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning}, author={Kwon, Yongchan and Zou, James}, journal={arXiv preprint arXiv:2110.14049}, year={2021} }
Gradient-Driven Rewards to Guarantee Fairness in Collaborative Machine Learning	Xinyi Xu, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo, Bryan Kian Hsiang Low	2021	Summary Xu et al. (2021) propose cosine gradient Shapley value to fairly evaluate the expected contribution of each agent's update in the federated learning setting removing the need for an auxiliary validation dataset. They further introduce a novel training-time gradient reward mechanism with a fairness guarantee.	Bibtex @article{xu2021gradient, title={Gradient driven rewards to guarantee fairness in collaborative machine learning}, author={Xu, Xinyi and Lyu, Lingjuan and Ma, Xingjun and Miao, Chenglin and Foo, Chuan Sheng and Low, Bryan Kian Hsiang}, journal={Advances in Neural Information Processing Systems}, volume={34}, pages={16104--16117}, year={2021} }
Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning	Tianhao Wang, Yu Yang, Ruoxi Jia	2022	Summary Wang et al. (2022) propose a general framework to improve effectiveness of sampling-based Shapley value (SV) or Least core (LC) estimation heuristics. They propose learning to predict the performance of a learning algorithm (denoted data utility learning) and using this predictor to estimate learning performance without retraining for cheaper SV and LC estimation.	Bibtex @article{wang2021improving, title={Improving cooperative game theory-based data valuation via data utility learning}, author={Wang, Tianhao and Yang, Yu and Jia, Ruoxi}, journal={arXiv preprint arXiv:2107.06336}, year={2021} }
Data Banzhaf: A Robust Data Valuation Framework for Machine Learning	Jiachen T. Wang, Ruoxi Jia	2023	Summary Wang et al. (2023) propose using the Banzhaf value for data valuation, providing better robustness against noisy performance scores and an efficient estimate using Maximum Sample Reuse (MSR) principle	Bibtex @InProceedings{pmlr-v206-wang23e, title={Data Banzhaf: A Robust Data Valuation Framework for Machine Learning}, author={Wang, Jiachen T. and Jia, Ruoxi}, booktitle={Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages={6388--6421}, year={2023}, editor={Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume={206}, series={Proceedings of Machine Learning Research}, month={25--27 Apr}, publisher={PMLR}, pdf={https://proceedings.mlr.press/v206/wang23e/wang23e.pdf}, url={https://proceedings.mlr.press/v206/wang23e.html} }	💻
A Multilinear Sampling Algorithm to Estimate Shapley Values	Ramin Okhrati, Aldo Lipani	2021	Summary Okhrati and Lipani (2021) propose a new sampling method for Shapley values based on a multilinear extension technique as applied in game theory. It provides more accurate estimations of the Shapley values by reducing the variance of the sampling statistics.	Bibtex @INPROCEEDINGS{9412511, title={A Multilinear Sampling Algorithm to Estimate Shapley Values}, author={Okhrati, Ramin and Lipani, Aldo}, booktitle={2020 25th International Conference on Pattern Recognition (ICPR)}, year={2021} }	💻
If You Like Shapley Then You’ll Love the Core	Yan, T., and Procaccia, A. D.	2021	Summary Yan and Procaccia (2021) propose an alternative method for credit assignment in data valuation. They use the least core, which can be computed efficiently.	Bibtex @article{Yan_Procaccia_2021, title={If You Like Shapley Then You’ll Love the Core}, author={Yan, Tom and Procaccia, Ariel D.}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, year={2021} }
CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification	Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji	2022	Summary Schoch et al. (2022) propose a new Shapley value that discriminates between training instances' in-class and out-of-class contributions.	Bibtex @inproceedings{schoch2022csshapley, title={{CS}-Shapley: Class-wise Shapley Values for Data Valuation in Classification}, author={Stephanie Schoch and Haifeng Xu and Yangfeng Ji}, booktitle={Advances in Neural Information Processing Systems}, year={2022} }	💻

Efficient algorithms


Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms	Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J. Spanos, Dawn Song	2019	Summary Jia et al. (2019) present algorithms to compute the Shapley value exactly in quasi-linear time and approximations in sublinear time for k-nearest-neighbor models. They empirically evaluate their algorithms at scale and extend them to several other settings.	Bibtex @article{jia12efficient, title={Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms}, author={Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Gurel, Nezihe Merve and Zhang, Bo Li4 Ce and Song, Costas Spanos1 Dawn}, journal={Proceedings of the VLDB Endowment}, volume={12}, number={11} }	💻
Efficient computation and analysis of distributional Shapley values	Yongchan Kwon, Manuel A. Rivas, James Zou	2021	Summary Kwon et al. (2021) develop tractable analytic expressions for the distributional data Shapley value for linear regression, binary classification, and non-parametric density estimation as well as new efficient methods for its estimation.	Bibtex @inproceedings{kwon2021efficient, title={Efficient computation and analysis of distributional Shapley values}, author={Kwon, Yongchan and Rivas, Manuel A and Zou, James}, booktitle={International Conference on Artificial Intelligence and Statistics}, pages={793--801}, year={2021}, organization={PMLR} }	💻

Benchmarks, Criticism & Relaxations


Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?	Ruoxi Jia, Fan Wu, Xuehui Sun, Jiacen Xu, David Dao, Bhavya Kailkhura, Ce Zhang, Bo Li, Dawn Song	2021	Summary Jia et al. (2021) perform a theoretical analysis on the differences between leave-one-out-based and Shapley value-based methods as well as an empirical study across several ML tasks investigating the two aforementioned methods as well as exact Shapley value-based methods and Shapley over KNN Surrogates.	Bibtex @misc{jia2021scalability, title={Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?}, author={Ruoxi Jia and Fan Wu and Xuehui Sun and Jiacen Xu and David Dao and Bhavya Kailkhura and Ce Zhang and Bo Li and Dawn Song}, year={2021}, eprint={1911.07128}, archivePrefix={arXiv}, primaryClass={cs.LG} }	💻
Shapley values for feature selection: The good, the bad, and the axioms	Daniel Fryer, Inga Strümke, Hien Nguyen	2021	Summary Fryer et al. (2021) calls into question the appropriateness of using the Shapley value for feature selection and advise caution against the magical thinking that presenting its abstract general axioms as "favourable and fair" may introduce. They further point out that the four axioms of "efficiency", "null player", "symmetry", and "additivity" do not guarantee that the Shapley value is suited to feature selection and may sometimes even imply the opposite.	Bibtex @misc{fryer2021shapley, title={Shapley values for feature selection: The good, the bad, and the axioms}, author={Daniel Fryer and Inga Strümke and Hien Nguyen}, year={2021}, eprint={2102.10936}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Influence functions & LOO


Understanding Black-box Predictions via Influence Functions	Pang Wei Koh, Percy Liang	2017	Summary Koh & Liang (2017) introduce the use of influence functions, a technique borrowed from robust statistics, to identify training points most responsible for a model's given prediction without needing to retrain. They further develop a simple and efficient implementation of influence functions that scales to large ML settings.	Bibtex @inproceedings{koh2017understanding, title={Understanding black-box predictions via influence functions}, author={Koh, Pang Wei and Liang, Percy}, booktitle={International Conference on Machine Learning}, pages={1885--1894}, year={2017}, organization={PMLR} }	💻	🎥
On the accuracy of influence functions for measuring group effects	Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo*, and Percy Liang	2019	Summary Koh et al. (2019) study influence functions to measure effects of large groups of training points instead of individual points. They empirically find a correlation and often underestimation between predicted and actual effects and theoretically show that this need not hold in general, realistic settings.	Bibtex @article{koh2019accuracy, title={On the accuracy of influence functions for measuring group effects}, author={Koh, Pang Wei and Ang, Kai-Siang and Teo, Hubert HK and Liang, Percy}, journal={arXiv preprint arXiv:1905.13289}, year={2019} }	💻	🎥
Scaling Up Influence Functions	Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov	2022	Summary Schioppa et al. (2022) propose a new method to scale the computation of influence functions for large neural networks using the Arnoldi iteration. With this, they achieve successful implementation of influence functions on full-size Transformer models with hundreds of millions of parameters.	Bibtex @inproceedings{schioppa2022scaling, title={Scaling Up Influence Functions}, author={Schioppa, Andrea and Zablotskaia, Polina and Vilar, David and Sokolov, Artem}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, year={2022} }	💻
Studying large language model generalization with influence functions	Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and others	2023	Summary Grosse et al. (2023) use a method known as EK-FAC to approximate the Hessian of the loss of large language models. They apply this technique to study influence functions on large language models, up to 50 billion parameters.	Bibtex @article{grosse2023studying, title={Studying large language model generalization with influence functions}, author={Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and others}, journal={arXiv preprint arXiv:2308.03296}, year={2023} }

Reinforcement Learning


Data Valuation using Reinforcement Learning	Jinsung Yoon, Sercan Ö Arık, Tomas Pfister	2020	Summary Yoon et al. (2020) propose using reinforcement learning for data valuation to learn data values jointly with the predictor model.	Bibtex @inproceedings{49189, title={Data Valuation using Reinforcement Learning}, author={Jinsung Yoon and Sercan Arik and Tomas Pfister}, year={2020} }	💻	🎥

Deep Neural Networks


DAVINZ: Data Valuation using Deep Neural Networks at Initialization	Zhaoxuan Wu, Yao Shu, Bryan Kian Hsiang Low	2022	Summary Wu et al. (2022) introduce a validation-based and training-free method for efficient data valuation with large and complex deep neural networks (DNNs). They derive and exploit a domain-aware generalization bound for DNNs to characterize their performance without training and uses this bound as the scoring function while keeping conventional techniques such as Shapley values as the valuation function.	Bibtex @inproceedings{wu2022davinz, title={DAVINZ: Data Valuation using Deep Neural Networks at Initialization}, author={Wu, Zhaoxuan and Shu, Yao and Low, Bryan Kian Hsiang}, booktitle={International Conference on Machine Learning}, pages={24150--24176}, year={2022}, organization={PMLR} }		🎥

Out-of-Bag score

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Yongchan Kwon, James Zou

2023

Summary

Kwon et al. (2023) propose using the out-of-bag estimate of a bagging estimator for computationally efficient data valuation.

Bibtex

@inproceedings{DBLP:conf/icml/Kwon023, 
    author={Yongchan Kwon and James Zou}, 
    editor={Andreas Krause and Emma Brunskill and Kyunghyun Cho and Barbara Engelhardt and Sivan Sabato and Jonathan Scarlett},
    title={Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value},
    booktitle={International Conference on Machine Learning, {ICML} 2023, 23-29 July 2023, Honolulu, Hawaii, {USA}}, 
    series={Proceedings of Machine Learning Research}, 
    volume={202}, 
    pages={18135--18152},
    publisher={{PMLR}}, 
    year={2023}, 
    url={https://proceedings.mlr.press/v202/kwon23e.html}, 
    timestamp={Mon, 28 Aug 2023 17:23:08 +0200}, 
    biburl={https://dblp.org/rec/conf/icml/Kwon023.bib}, 
    bibsource={dblp computer science bibliography, https://dblp.org} 
}

💻

🎥

Task Agnostic


Fundamentals of Task-Agnostic Data Valuation	Mohammad Mohammadi Amiri, Frederic Berdoz, Ramesh Raskar	2023	Summary This paper addresses the challenge of valuing data without specific task assumptions, focusing on task-agnostic data valuation. It discusses valuing a data seller's dataset from a buyer's perspective without validation requirements. The approach involves estimating statistical differences through diversity and relevance measures without needing the raw data, and designing queries that maintain the seller's blindness to the buyer's raw data. The work is significant for practical scenarios where utility metrics like test accuracy on a validation set are not feasible.	Bibtex @article{Amiri2023FundamentalsOT, title={Fundamentals of Task-Agnostic Data Valuation}, author={Mohammad Mohammadi Amiri and Frederic Berdoz and Ramesh Raskar}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={37}, pages={9226-9234}, year={2023}, doi={10.1609/aaai.v37i8.26106} }

Benchmarks

OpenDataVal: a Unified Benchmark for Data Valuations

Kevin Jiang, Weixin Liang, James Zou, Yongchan Kwon

2023

Summary

Jiang et al. (2023) provides a Python library to build and test data evaluators across different datasets, data evaluators, models, and new benchmarks.

Bibtex

@article{jiang2023opendataval,
      title={OpenDataVal: a Unified Benchmark for Data Valuation},
      author={Kevin Fu Jiang and Weixin Liang and James Zou and Yongchan Kwon},
      booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
      year={2023},
url={https://openreview.net/forum?id=eEK99egXeB}
}

💻

🎥

Libraries


influenciae	Deel-AI	2023	Summary A stable implementation of influence functions in tensorflow.	Bibtex	💻
pyDVL	appliedAI Institute	2023	Summary A library of stable and efficient implementations of algorithms for computing Shapley values and influence functions in pytorch.	Bibtex	💻
datascope	Bojan Karlaš	2024	Summary A tool for repairing datasets using Shapley value as a measure of data importance.	Bibtex	💻

Surveys


Data Valuation in Machine Learning: “Ingredients”, Strategies, and Open Challenges	Rachael Hwee Ling Sim, Xinyi Xu, Bryan Kian Hsiang Low	2022	Summary Sim et al. (2022) present a technical survey of data valuation and its "ingredients" and properties. The paper outlines common desiderata as well as some open research challenges.	Bibtex @inproceedings{sim2022data, title={Data valuation in machine learning:“ingredients”, strategies, and open challenges}, author={Sim, Rachael Hwee Ling and Xu, Xinyi and Low, Bryan Kian Hsiang}, booktitle={Proc. IJCAI}, year={2022} }		🎥

Designing data marketplaces

Data market system designs


A demonstration of sterling: a privacy-preserving data marketplace	Nick Hynes, David Dao, David Yan, Raymond Cheng, Dawn Song	2018	Bibtex @article{hynes2018demonstration, title={A Demonstration of Sterling: A Privacy-Preserving Data Marketplace}, author={Hynes, Nick and Dao, David and Yan, David and Cheng, Raymond and Song, Dawn}, journal={Proceedings of the VLDB Endowment}, volume={11}, number={12}, year={2018} }
DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation	David Dao, Dan Alistarh, Claudiu Musat, Ce Zhang	2018	Bibtex @article{dao2018databright, title={Databright: Towards a global exchange for decentralized data ownership and trusted computation}, author={Dao, David and Alistarh, Dan and Musat, Claudiu and Zhang, Ce}, journal={arXiv preprint arXiv:1802.04780}, year={2018} }
A Marketplace for Data: An Algorithmic Solution	Anish Agarwal, Munther Dahleh, Tuhin Sarkar	2019	Bibtex @inproceedings{agarwal2019marketplace, title={A marketplace for data: An algorithmic solution}, author={Agarwal, Anish and Dahleh, Munther and Sarkar, Tuhin}, booktitle={Proceedings of the 2019 ACM Conference on Economics and Computation}, pages={701--726}, year={2019} }
Computing a Data Dividend	Eric Bax	2019	Bibtex @misc{bax2019computing, title={Computing a Data Dividend}, author={Eric Bax}, year={2019}, eprint={1905.01805}, archivePrefix={arXiv}, primaryClass={cs.GT} }
Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards	Sebastian Shenghong Tay, Xinyi Xu, Chuan Sheng Foo, Bryan Kian Hsiang Low	2021	Bibtex @article{tay2021incentivizing, title={Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards}, author={Tay, Sebastian Shenghong and Xu, Xinyi and Foo, Chuan Sheng and Low, Bryan Kian Hsiang}, journal={arXiv preprint arXiv:2112.09327}, year={2021} }

Automatic data compliance

Data Capsule: A New Paradigm for Automatic Compliance with Data Privacy Regulations

Lun Wang, Joseph P. Near, Neel Somani, Peng Gao, Andrew Low, David Dao, Dawn Song

2019

Bibtex

@misc{wang2019data,
      title={Data Capsule: A New Paradigm for Automatic Compliance with Data Privacy Regulations}, 
      author={Lun Wang and Joseph P. Near and Neel Somani and Peng Gao and Andrew Low and David Dao and Dawn Song},
      year={2019},
      eprint={1909.00077},
      archivePrefix={arXiv},
      primaryClass={cs.CY}
}

💻

Data valuation applications


A Principled Approach to Data Valuation for Federated Learning	Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, Dawn Song	2020		Bibtex @misc{wang2020principled, title={A Principled Approach to Data Valuation for Federated Learning}, author={Tianhao Wang and Johannes Rausch and Ce Zhang and Ruoxi Jia and Dawn Song}, year={2020}, eprint={2009.06192}, archivePrefix={arXiv}, primaryClass={cs.LG} }
Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset	Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A Dunnmon, James Zou, Daniel L Rubin	2021		Bibtex @article{tang2021data, title={Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset}, author={Tang, Siyi and Ghorbani, Amirata and Yamashita, Rikiya and Rehman, Sameer and Dunnmon, Jared A and Zou, James and Rubin, Daniel L}, journal={Scientific reports}, volume={11}, number={1}, pages={1--9}, year={2021}, publisher={Nature Publishing Group} }
Efficient and Fair Data Valuation for Horizontal Federated Learning	Shuyue Wei, Yongxin Tong, Zimu Zhou, Tianshu Song	2020	Summary Availability of big data is crucial for modern machine learning applications and services. Federated learning is an emerging paradigm to unite different data owners for machine learning on massive data sets without worrying about data privacy. Yet data owners may still be reluctant to contribute unless their data sets are fairly valuated and paid. In this work, the authors adapt Shapley value, a widely used data valuation metric to valuating data providers in federated learning. Prior data valuation schemes for machine learning incur high computation cost because they require training of extra models on all data set combinations. For efficient data valuation, the authors approximately construct all the models necessary for data valuation using the gradients in training a single model, rather than train an exponential number of models from scratch. On this basis, they devise three methods for efficient contribution index estimation. Evaluations show that their methods accurately approximate the contribution index while notably accelerating its calculation.	Bibtex @inbook{wei2020efficient, title={Efficient and fair data valuation for horizontal federated learning}, author={Wei, Shuyue and Tong, Yongxin and Zhou, Zimu and Song, Tianshu}, year={2020}, booktitle={Federated Learning: Privacy and Incentive}, pages={139--152}, publisher={Springer} }
Improving Fairness for Data Valuation in Horizontal Federated Learning	Zhenan Fan, Huang Fang, Zirui Zhou, Jian Pei, Michael P. Friedlander, Changxin Liu, Yong Zhang	2020	Summary Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. This paper focuses on fairness in data valuation within federated learning. The authors propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. This approach leverages the concepts and tools from optimization and provides both theoretical analysis and empirical evaluation to verify the improvement in fairness.	Bibtex @misc{fan2020improving, title={Improving Fairness for Data Valuation in Horizontal Federated Learning}, author={Zhenan Fan and Huang Fang and Zirui Zhou and Jian Pei and Michael P. Friedlander and Changxin Liu and Yong Zhang}, year={2020}, eprint={2109.09046}, archivePrefix={arXiv}, primaryClass={cs.LG} }
Data Valuation for Vertical Federated Learning: An Information-Theoretic Approach	Xiao Han, Leye Wang, Junjie Wu	2021	Summary Federated learning (FL) is a machine learning paradigm that enables privacy-preserving cross-party data collaboration. This work introduces "FedValue," the first privacy-preserving, task-specific, model-free data valuation method for vertical FL tasks. It incorporates Shapley-CMI, an information-theoretic metric, for assessing data values from a game-theoretic perspective. The paper also proposes a novel server-aided federated computation mechanism and techniques to accelerate Shapley-CMI computation. Extensive experiments demonstrate the effectiveness and efficiency of FedValue.	Bibtex @misc{han2021datavaluation, title={Data Valuation for Vertical Federated Learning: An Information-Theoretic Approach}, author={Xiao Han and Leye Wang and Junjie Wu}, year={2021}, eprint={URL or DOI link TBD}, }
Towards More Efficient Data Valuation in Healthcare Federated Learning Using Ensembling	Sourav Kumar, A. Lakshminarayanan, Ken Chang, Feri Guretno, Ivan Ho Mien, Jayashree Kalpathy-Cramer, Pavitra Krishnaswamy, Praveer Singh	2021	Summary This paper addresses the challenge of data valuation in federated learning within healthcare. The authors propose a method called SaFE (Shapley Value for Federated Learning using Ensembling), which is designed to be efficient in settings where the number of contributing institutions is manageable. SaFE approximates the Shapley value using gradients from training a single model and develops methods for efficient contribution index estimation. This approach is particularly relevant in medical imaging where data heterogeneity is common and fast, accurate data valuation is necessary for multi-institutional collaborations.	Bibtex @article{Kumar2021TowardsME, title={Towards More Efficient Data Valuation in Healthcare Federated Learning Using Ensembling}, author={Sourav Kumar and A. Lakshminarayanan and Ken Chang and Feri Guretno and Ivan Ho Mien and Jayashree Kalpathy-Cramer and Pavitra Krishnaswamy and Praveer Singh}, journal={ArXiv}, year={2021}, volume={abs/2209.05424} }
Data Debugging with Shapley Importance over Machine Learning Pipelines	Bojan Karlaš, David Dao, Matteo Interlandi, Sebastian Schelter, Wentao Wu, Ce Zhang	2024	Summary This paper focuses on repairing datasets with the goal of improving the quality of end-to-end machine learning pipelines. Data repairs are prioritized by Shapley value. The authors propose methods for efficiently computing the Shapley value for different types of pipelines and empirically demonstrate the effectiveness of this approach.	Bibtex @inproceedings{ karlas2024data, title={Data Debugging with Shapley Importance over Machine Learning Pipelines}, author={Bojan Karla{\v{s}} and David Dao and Matteo Interlandi and Sebastian Schelter and Wentao Wu and Ce Zhang}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=qxGXjWxabq} }	💻	🎥

Data markets and society

Economics of Data

Nonrivalry and the Economics of Data

Charles I. Jones, Christopher Tonetti

2019

Bibtex

@article{10.1257/aer.20191330,
  Author = {Jones, Charles I. and Tonetti, Christopher},
  Title = {Nonrivalry and the Economics of Data},
  Journal = {American Economic Review},
  Volume = {110},
  Number = {9},
  Year = {2020},
  Month = {September},
  Pages = {2819-58},
  DOI = {10.1257/aer.20191330},
  URL = {https://www.aeaweb.org/articles?id=10.1257/aer.20191330}
}

Data Dignity

Chapter 5: Data as Labor, Radical Markets

Eric A. Posner and E Glen Weyl

2019

Bibtex

@book{posner2019radical,
  title={Radical Markets},
  author={Posner, Eric A and Weyl, E Glen},
  year={2019},
  publisher={Princeton University Press}
}

Should We Treat Data as Labor? Moving beyond "Free"

Imanol Arrieta-Ibarra, Leonard Goff, Diego Jiménez-Hernández, Jaron Lanier, E. Glen Weyl

2018

Bibtex

@article{10.1257/pandp.20181003,
  Author = {Arrieta-Ibarra, Imanol and Goff, Leonard and Jiménez-Hernández, Diego and Lanier, Jaron and Weyl, E. Glen},
  Title = {Should We Treat Data as Labor? Moving beyond "Free"},
  Journal = {AEA Papers and Proceedings},
  Volume = {108},
  Year = {2018},
  Month = {May},
  Pages = {38-42},
  DOI = {10.1257/pandp.20181003},
  URL = {https://www.aeaweb.org/articles?id=10.1257/pandp.20181003}
}

Strategic adaptation

Performative prediction


Performative Prediction	Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, Moritz Hardt	2020	Summary Perdomo et al. (2020) introduce the concept of "performative prediction" dealing with predictions that influence the target they aim to predict, e.g. through taking actions based on the predictions, causing a distribution shift. The authors develop a risk minimization framework for performative prediction and introduce the equilibrium notion of performative stability where predictions are calibrated against future outcomes that manifest from acting on the prediction.	Bibtex @inproceedings{perdomo2020performative, title={Performative prediction}, author={Perdomo, Juan and Zrnic, Tijana and Mendler-D{"u}nner, Celestine and Hardt, Moritz}, booktitle={International Conference on Machine Learning}, pages={7599--7609}, year={2020}, organization={PMLR} }
Stochastic Optimization for Performative Prediction	Celestine Mendler-Dünner, Juan Perdomo, Tijana Zrnic, Moritz Hardt	2020	Summary Mendler-Dünner et al. (2020) look at stochastic optimization for performative prediction and prove convergence rates for greedily deploying models after each stochastic update (which may cause distribution shift affecting convergence to a stability point) or lazily deploying the model after several updates.	Bibtex @article{mendler2020stochastic, title={Stochastic optimization for performative prediction}, author={Mendler-D{"u}nner, Celestine and Perdomo, Juan and Zrnic, Tijana and Hardt, Moritz}, journal={Advances in Neural Information Processing Systems}, volume={33}, pages={4929--4939}, year={2020} }

Strategic classification


Strategic Classification is Causal Modeling in Disguise	John Miller, Smitha Milli, Moritz Hardt	2020	Summary Miller et al. (2020) argue that strategic classication involves causal modelling and designing incentives for improvement requires solving a non-trivial causal inference problem. The authors provide a distinction between gaming and improvement as well as provide a causal framework for strategic adaptation.	Bibtex @inproceedings{miller2020strategic, title={Strategic classification is causal modeling in disguise}, author={Miller, John and Milli, Smitha and Hardt, Moritz}, booktitle={International Conference on Machine Learning}, pages={6917--6926}, year={2020}, organization={PMLR} }
Alternative Microfoundations for Strategic Classification	Meena Jagadeesan, Celestine Mendler-Dünner, Moritz Hardt	2021	Summary Jagadeesan et al. (2021) show that standard microfoundations in strategic classification, that typically uses individual-level behaviour to deduce aggregate-level responses, can lead to degenerate behaviour in aggregate: discontinuities in the aggregate response, stable points ceasing to exist, and maximizing social burden. The authors introduce a noisy response model inspired by performative prediction that mitigates these limitations for binary classification.	Bibtex @inproceedings{jagadeesan2021alternative, title={Alternative microfoundations for strategic classification}, author={Jagadeesan, Meena and Mendler-D{"u}nner, Celestine and Hardt, Moritz}, booktitle={International Conference on Machine Learning}, pages={4687--4697}, year={2021}, organization={PMLR} }

Data Valuation Researchers

Name	Institute	h-index
Costas Spanos	University of California, Berkeley	61
Jinsung Yoon	Google Cloud AI	33
Tomas Pfister	Google Cloud AI	39
Amirata Ghorbani	Stanford	18
James Zou	Stanford	64
Nektaria Tryfona	Virginia Tech	27
Rachael Hwee Ling Sim	National University of Singapore	4
Bryan Kian Hsiang Low	National University of Singapore	38
Dawn Song	University of California, Berkeley	142
Zhaoxuan Wu	National University of Singapore	4
Xinyi Xu	National University of Singapore	8
Tianhao Wang	University of Virginia	18
José González Cabañas	UC3M-Santander Big Data Institute	7
Ruben Cuevas Rumin	Universidad Carlos III de Madrid	26
Jiachen T. Wang	Princeton University	9
Bohong Wang	Tsinghua University	6
Yongchan Kwon	Columbia University	10
Siyi Tang	Artera	8
Li Xiong	Emory University	52
Jessica Vitak	University of Maryland	49
Katie Chamberlain Kritikos	University of Illinois at Urbana-Champaign	6
Zhenan Fan	Huawei Technologies Canada	6
Shuyue Wei	Beihang University	4
Hannah Stein	Saarland University	3
Wolfgang Maass	Saarland University	26
Mohammad Mohammadi Amiri	Rensselaer Polytechnic Institute	18
Ramesh Raskar	MIT	103
Konstantin D. Pandl	Karlsruhe Institute of Technolgoy	6
Ali Sunyaev	Karlsruhe Institute of Technolgoy	43
Ludovico Boratto	University of Cagliari	25
han xiao		70
Junjie Wu	Center for High Pressure Science & Technology Advanced Research	55
Xiao Tian	National University of Singapore	1
Kean Birch	Institute for Technoscience & Society	40
Callum Ward	Uppsala University	10
Praveer Singh	University of Colorado School of Medicine	19
Anran Xu	Shanghai Jiao Tong University	2
Guihai Chen		67
Andre Esteva	Co-Founder & CEO, Artera	23
Prateek Mittal	Princeton University	55
Hyeontaek Oh	Institute for IT Convergence	9
Lingjiao Chen	Stanford	13
Xiangyu Chang	Xi'an Jiaotong University	17
Hoang Anh Just	Virginia Tech	3
David Dao	ETH	13
Mark Mazumder	Harvard	12
Vijay Janapa Reddi	Harvard	46
Sabri Eyuboglu	Stanford	6
Wenqian Li	National University of Singapore	2
Bojan Karlaš	Harvard	14

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Data Valuation

Legend

Contents

What is your data worth?

Shapley Value & Cooperative Game Theory

Efficient algorithms

Benchmarks, Criticism & Relaxations

Influence functions & LOO

Reinforcement Learning

Deep Neural Networks

Out-of-Bag score

Task Agnostic

Benchmarks

Libraries

Surveys

Designing data marketplaces

Data market system designs

Automatic data compliance

Data valuation applications

Data markets and society

Economics of Data

Data Dignity

Strategic adaptation

Performative prediction

Strategic classification

Data Valuation Researchers

About

Releases

Packages

Contributors 6

License

daviddao/awesome-data-valuation

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Valuation

Legend

Contents

What is your data worth?

Shapley Value & Cooperative Game Theory

Efficient algorithms

Benchmarks, Criticism & Relaxations

Influence functions & LOO

Reinforcement Learning

Deep Neural Networks

Out-of-Bag score

Task Agnostic

Benchmarks

Libraries

Surveys

Designing data marketplaces

Data market system designs

Automatic data compliance

Data valuation applications

Data markets and society

Economics of Data

Data Dignity

Strategic adaptation

Performative prediction

Strategic classification

Data Valuation Researchers

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Packages