Home

What Is The Data Artifacts Glossary?
What Is The Data Artifacts Glossary Ultimate Goal?
What Problem Does The Data Artifacts Glossary Solve?
What Design Principles Underlie The Data Artifacts Glossary
How Does The Data Artifacts Glossary Accomplish Its Goals?
The Data Artifacts Glossary initial categories

What is the Data Artifacts Glossary? {#what}

The Data Artifacts Glossary is a dynamic, collaborative platform designed to document and update biases within various healthcare datasets continuously. Unlike traditional frameworks that offer only static views of data characteristics, this living document serves as both a repository and an interactive forum, evolving with new medical data and insights. It provides detailed documentation of each dataset, including its origin and any updates to its biases, ensuring transparency and traceability. By inviting contributions from a diverse community of researchers, clinicians, and AI developers, the Data Artifacts Glossary not only aids in understanding both general attributes and specific biases of datasets but also promotes a critical approach to their analysis and use. As a valuable educational resource, it fosters a community-driven effort to identify and resolve biases, thereby enhancing accountability and ethical responsibility in AI development in healthcare.

What is the Data Artifacts Glossary's Ultimate Goal? {#purpose}

The ultimate goal of the Data Artifacts Glossary is to enhance the integrity and efficacy of AI applications in healthcare by providing a resource that helps mitigate the risk of bias from the outset. By equipping stakeholders with detailed, up-to-date information on dataset biases, the Glossary aids in the development of more accurate and fair AI algorithms. It also supports the broader objective of ethical AI use, aligning with international efforts to ensure that AI systems are safe, secure, and trustworthy. As AI continues to permeate healthcare, the Data Artifacts Glossary will be indispensable in promoting a more equitable healthcare system, where decisions supported by AI are as unbiased and inclusive as possible.

What Problem Does the Data Artifacts Glossary Solve? {#problem}

Clinical bias is a well-documented issue in the medical field, rooted in unconscious cognitive shortcuts and perpetuated through societal inequities, leading to significant disparities in patient care. These biases affect various aspects of healthcare, such as pain management and disease screening, where minority groups often receive substandard care due to prejudiced beliefs and practices. Furthermore, when AI is trained on data embedded with such biases, it not only replicates these issues but also risks amplifying them. This includes a broad range of biases in AI applications, from sex-based disparities in cardiovascular risk predictions to ethnic disparities in disease detection, which can affect any individual whose data deviates from the typical profiles seen in training datasets.

Recognizing the critical need for transparency and accountability in AI, significant steps have been taken by global entities like the European Parliament and the White House to mitigate bias in AI applications. Despite these efforts, many current initiatives such as Data Cards and Model Cards focus predominantly on general dataset characteristics and lack the capability to dynamically track and update as new biases emerge. The Data Artifacts Glossary addresses this gap by providing a dynamic, community-driven platform that not only documents biases in healthcare datasets, but also updates continuously as new data and insights become available. This approach allows for a more nuanced understanding and proactive mitigation of biases, enhancing the integrity and applicability of AI in healthcare.

What Design Principles Underlie the Data Artifacts Glossary? {#principles}

Similar to open-source projects like Linux and Python, which utilize a variety of software development practices and principles that contribute to their success and widespread adoption, the Data Artifacts Glossary also follows several key practices commonly used in these and other open-source projects. These include:

Version Control: the glossary embraces a transparent process to manage changes to our codebase using GitHub—a platform widely recognized for its robust version control and collaborative features. This allows multiple community members, including researchers, clinicians, AI developers, and other stakeholders, to work on the project simultaneously, tracking changes and managing versions efficiently.
Code Review and Pull Requests: changes to the code are submitted as pull requests, which are then reviewed by other community members. Pull requests are essentially proposals for revisions or additions to the glossary, which are made publicly available for review. Each submission is rigorously evaluated through a peer review process managed by the project maintainers—carefully selected based on their expertise and dedication to promoting unbiased AI in healthcare. This peer review process ensures that the code meets our project's standards for quality and functionality.
Documentation: good documentation is crucial for open-source projects. It not only helps new users understand how to use our tool but also assists new contributors in understanding the codebase and the project’s architecture.

This methodology not only fosters an ongoing, dynamic update process but also ensures that the glossary maintains a high level of academic rigor. The open-source model promotes inclusivity and collective responsibility, essential for addressing the multifaceted nature of biases in healthcare datasets. This approach mirrors the principles of openness, peer review, and community engagement that are hallmarks of both academic rigor and the key practices of open-source projects.

How Does the Data Artifacts Glossary Accomplish Its Goals? {#how}

Of note, we do not aim to provide the final structure of the Data Artifacts Glossary, nor do we claim that the following categories are exhaustive. Instead, we aim to suggest one potential structure as a starting point to populate the Data Artifacts Glossary, consisting of four initial categories, namely: Participants not missing at random, Validity of data points, Data not missing at random, and Miscellaneous.

The Data Artifacts Glossary's Initial Categories {#categories}

Participants not missing at random
Validity of data points
Data not missing at random
Miscellaneous

Participants not missing at random

This category captures bias stemming from absence or underrepresentation of specific patient groups within the dataset, encompassing not only demographic factors but also clinical conditions, socioeconomic statuses, and accessibility variables which may skew research outcomes and subsequent clinical applications. The Data Artifacts Glossary under this category aims to illuminate the hidden disparities by documenting the absence of certain groups due to various selection biases or data collection constraints. This awareness is critical as it allows researchers and clinicians to critically evaluate the dataset and its applicability to the general population, ensuring that medical interventions developed from AI models do not inadvertently perpetuate health inequities.

Data not missing at random

The second category examines the integrity of data collected, focusing on potential biases introduced through the use of various medical devices and data recording methodologies. This category is pivotal as it questions the foundational accuracy of the dataset itself —whether the data points reflect true patient states or are distorted by technological and procedural variances. By cataloging these potential sources of error, the Data Artifacts Glossary promotes a more nuanced understanding of the data, which is essential for developing reliable AI models.

Validity of data points

This category investigates the uneven data collection practices that may occur across various patient groups due to factors such as race, socioeconomic status, geographical location, and other demographic or contextual influences. It underscores the necessity to meticulously examine and question the consistency and fairness of data collection protocols and their execution among diverse patient populations. This detailed scrutiny is crucial for identifying and understanding the systemic errors and biases that could detrimentally impact clinical research and the training of AI algorithms.

Miscellaneous

The fourth category encompasses a broad range of biases that do not neatly fit into the other categories but are nonetheless crucial for understanding and using the dataset responsibly. These might include biases related to the geographic location of data collection, time-period specific healthcare practices, or administrative biases in how data are recorded and processed. This section will be populated with examples that highlight less obvious but impactful biases affecting data interpretation and application in AI systems.

Learn more about Add-a-New-Example-of-Bias-to-the-MIMIC-IV-Bias-Glossary.

A MIT Critical Data Original Production
MIT Critical Data

The Data Artifacts Glossary Project 🚀

Data Artifacts Glossary for the MIMIC-IV Dataset

How to Update the MIMIC-IV Data Artifacts Glossary 👩‍💻

Add a New Example of Bias to the MIMIC-IV Data Artifacts Glossary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly