Skip to content

Commit

Permalink
Updated schemas and documentation with consensus terms and deduplicat…
Browse files Browse the repository at this point in the history
…ed file path instructions.
  • Loading branch information
mmcdermott committed Jul 30, 2024
1 parent 5985629 commit 1da2ec0
Show file tree
Hide file tree
Showing 2 changed files with 132 additions and 79 deletions.
162 changes: 115 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ up MEDS, we will define some key terminology that we use in this standard.
a hospital admission rather than an individual.
2. A _code_ is the categorical descriptor of what is being observed in any given observation of a patient.
In particular, in almost all structured, longitudinal datasets, a measurement can be described as
consisting of a tuple containing a `patient_id` (who this measurement is about); a `timestamp` (when this
consisting of a tuple containing a `patient_id` (who this measurement is about); a `time` (when this
measurement happened); some categorical qualifier describing what was measured, which we will call a
`code`; a value of a given type, such as a `numerical_value`, a `text_value`, or a `categorical_value`;
and possibly one or more additional measurement properties that describe the measurement in a
Expand All @@ -28,7 +28,7 @@ MEDS consists of four main data components/schemas:
1. A _data schema_. This schema describes the underlying medical data, organized as sequences of patient
observations, in the dataset.
2. A _patient subsequence label schema_. This schema describes labels that may be predicted about a patient
at a given timestamp in the patient record.
at a given time in the patient record.
3. A _code metadata schema_. This schema contains metadata describing the codes used to categorize the
observed measurements in the dataset.
4. A _dataset metadata schema_. This schema contains metadata about the MEDS dataset itself, such as when it
Expand All @@ -43,7 +43,7 @@ found in the following subfolders:
series of possibly nested sharded dataframes stored in `parquet` files. In particular, the file glob
`glob("$MEDS_ROOT/data/**/*.parquet)` will capture all sharded data files of the raw MEDS data, all
organized into _data schema_ files, sharded by patient and sorted, for each patient, by
timestamp.
time.
- `$MEDS_ROOT/metadata/codes.parquet`: This file contains per-code metadata in the _code metadata schema_
about the MEDS dataset. As this dataset describes all codes observed in the full MEDS dataset, it is _not_
sharded. Note that some pre-processing operations may, at times, produce sharded code metadata files, but
Expand All @@ -69,67 +69,135 @@ does not exist.

### Schemas

**TODO**: copy here from the schema file and describe.

#### The Data Schema
MEDS data also must satisfy two important properties:
1. Data about a single patient cannot be split across parquet files. If a patient is in a dataset it must be
in one and only one parquet file.
2. Data about a single patient must be contiguous within a particular parquet file and sorted by time.

The data schema has four mandatory fields:
1. `patient_id`: The ID of the patient this event is about.
2. `time`: The time of the event. This field is nullable for static events.
3. `code`: The code of the event.
4. `numeric_value`: The numeric value of the event. This field is nullable for non-numeric events.

In addition, it can contain any number of custom properties to further enrich observations. The python
function below generates a pyarrow schema for a given set of custom properties.

```python
def data_schema(custom_properties=[]):
return pa.schema(
[
("patient_id", pa.int64()),
("time", pa.timestamp("us")), # Static events will have a null timestamp
("code", pa.string()),
("numeric_value", pa.float32()),
] + custom_properties
)
```

#### The label schema.
Models, when predicting this label, are allowed to use all data about a patient up to and including the
prediction time. Exclusive prediction times are not currently supported, but if you have a use case for them
please add a GitHub issue.

## Old -- to be deleted.
```python
label = pa.schema(
[
("patient_id", pa.int64()),
("prediction_time", pa.timestamp("us")),
("boolean_value", pa.bool_()),
("integer_value", pa.int64()),
("float_value", pa.float64()),
("categorical_value", pa.string()),
]
)

Label = TypedDict("Label", {
"patient_id": int,
"prediction_time": datetime.datetime,
"boolean_value": Optional[bool],
"integer_value" : Optional[int],
"float_value" : Optional[float],
"categorical_value" : Optional[str],
}, total=False)
```

The core of the standard is that we define a ``patient`` data structure that contains a series of time stamped events, that in turn contain measurements of various sorts.
#### The patient split schema.

The Python type signature for the schema is as follows:
Three sentinel split names are defined for convenience and shared processing:
1. A training split, named `train`, used for ML model training.
2. A tuning split, named `tuning`, used for hyperparameter tuning. This is sometimes also called a
"validation" split or a "dev" split. In many cases, standardizing on a tuning split is not necessary and
models should feel free to merge this split with the training split if desired.
3. A held-out split, named `held_out`, used for final model evaluation. In many cases, this is also called a
"test" split. When performing benchmarking, this split should not be used at all for model selection,
training, or for any purposes up to final validation.

```python
Additional split names can be used by the user as desired.

Patient = TypedDict('Patient', {
'patient_id': int,
'events': List[Event],
})

Event = TypedDict('Event',{
'time': NotRequired[datetime.datetime], # Static events will have a null timestamp here
'code': str,
'text_value': NotRequired[str],
'numeric_value': NotRequired[float],
'datetime_value': NotRequired[datetime.datetime],
'metadata': NotRequired[Mapping[str, Any]],
})
```
train_split = "train"
tuning_split = "tuning"
held_out_split = "held_out"
patient_split = pa.schema(
[
("patient_id", pa.int64()),
("split", pa.string()),
]
)
PatientSplit = TypedDict("PatientSplit", {
"patient_id": int,
"split": str,
}, total=True)
```

We also provide ETLs to convert common data formats to this schema: https://github.com/Medical-Event-Data-Standard/meds_etl

An example patient following this schema
#### The dataset metadata schema.

```python

patient_data = {
"patient_id": 123,
"events": [
# Store static events like gender with a null timestamp
{
"time": None,
"code": "Gender/F",
dataset_metadata = {
"type": "object",
"properties": {
"dataset_name": {"type": "string"},
"dataset_version": {"type": "string"},
"etl_name": {"type": "string"},
"etl_version": {"type": "string"},
"meds_version": {"type": "string"},
},
}

# It's recommended to record birth using the birth_code
{
"time": datetime.datetime(1995, 8, 20),
"code": meds.birth_code,
},
# Python type for the above schema

# Arbitrary events with sophisticated data can also be added
DatasetMetadata = TypedDict(
"DatasetMetadata",
{
"time": datetime.datetime(2020, 1, 1, 12, 0, 0),
"code": "some_code",
"text_value": "Example",
"numeric_value": 10.0,
"datetime_value": datetime.datetime(2020, 1, 1, 12, 0, 0),
"properties": None
"dataset_name": NotRequired[str],
"dataset_version": NotRequired[str],
"etl_name": NotRequired[str],
"etl_version": NotRequired[str],
"meds_version": NotRequired[str],
},
]
}
total=False,
)
```

#### The code metadata schema.

```python
def code_metadata_schema(custom_per_code_properties=[]):
code_metadata = pa.schema(
[
("code", pa.string()),
("description", pa.string()),
("parent_codes", pa.list(pa.string()),
] + custom_per_code_properties
)

return code_metadata

# Python type for the above schema

CodeMetadata = TypedDict("CodeMetadata", {"code": str, "description": str, "parent_codes": List[str]}, total=False)
```
49 changes: 17 additions & 32 deletions src/meds/schema.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,23 @@
"""The core schemas for the MEDS format.
Please see the README for more information, including expected file organization on disk, more details on what
each schema should capture, etc.
"""
import datetime
from typing import Any, List, Mapping, Optional

import pyarrow as pa
from typing_extensions import NotRequired, TypedDict

# Medical Event Data Standard consists of four main components:
# 1. A patient event schema
# 2. A label schema
# 3. A dataset metadata schema.
# 4. A code metadata schema.
#
# Event data, labels, and code metadata is specified using pyarrow. Dataset metadata is specified using JSON.

# We also specify a directory structure for how these should be laid out on disk.

# Every MEDS extract consists of a folder that contains both metadata and patient data with the following structure:
# - data/
# A (possibly nested) folder containing multiple parquet files containing patient event data following the events_schema folder.
# glob("data/**/*.parquet") is the recommended way for obtaining all patient event files.
# - dataset_metadata.json
# Dataset level metadata containing information about the ETL used, data version, etc
# - (Optional) code_metadata.parquet
# Code level metadata containing information about the code descriptions, standard mappings, etc
# - (Optional) patient_split.csv
# A specification of patient splits that should be used.

############################################################

# The patient event data schema.
# The data schema.
#
# Patient event data also must satisfy two important properties:
# MEDS data also must satisfy two important properties:
#
# 1. Patient event data cannot be split across parquet files. If a patient is in a dataset it must be in one and only one parquet file.
# 2. Patient event data must be contiguous within a particular parquet file and sorted by event time.
# 1. Data about a single patient cannot be split across parquet files. If a patient is in a dataset it must be in one and only one parquet file.
# 2. Data about a single patient must be contiguous within a particular parquet file and sorted by time.

# Both of these restrictions allow the stream rolling processing (see https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rolling.html),
# which vastly simplifies many data analysis pipelines.
Expand All @@ -41,22 +26,24 @@
birth_code = "MEDS_BIRTH"
death_code = "MEDS_DEATH"

def patient_events_schema(custom_per_event_properties=[]):
def data_schema(custom_properties=[]):
return pa.schema(
[
("patient_id", pa.int64()),
("time", pa.timestamp("us")), # Static events will have a null timestamp
("code", pa.string()),
("numeric_value", pa.float32()),
] + custom_per_event_properties
] + custom_properties
)

# No python type is provided because Python tools for processing MEDS data will often provide their own types.
# See https://github.com/EthanSteinberg/meds_reader/blob/0.0.6/src/meds_reader/__init__.pyi#L55 for example.

############################################################

# The label schema.
# The label schema. Models, when predicting this label, are allowed to use all data about a patient up to and
# including the prediction time. Exclusive prediction times are not currently supported, but if you have a use
# case for them please add a GitHub issue.

label = pa.schema(
[
Expand Down Expand Up @@ -85,9 +72,9 @@ def patient_events_schema(custom_per_event_properties=[]):

# The patient split schema.

train_split = "train"
tuning_split = "tuning"
test_split = "test"
train_split = "train" # For ML training.
tuning_split = "tuning" # For ML hyperparameter tuning. Also often called "validation" or "dev".
held_out_split = "held_out" # For final ML evaluation. Also often called "test".

patient_split = pa.schema(
[
Expand All @@ -105,7 +92,6 @@ def patient_events_schema(custom_per_event_properties=[]):

# The dataset metadata schema.
# This is a JSON schema.
# This data should be stored in dataset_metadata.json within the dataset folder.


dataset_metadata = {
Expand Down Expand Up @@ -137,7 +123,6 @@ def patient_events_schema(custom_per_event_properties=[]):

# The code metadata schema.
# This is a parquet schema.
# This data should be stored in code_metadata.parquet within the dataset folder.

def code_metadata_schema(custom_per_code_properties=[]):
code_metadata = pa.schema(
Expand Down

0 comments on commit 1da2ec0

Please sign in to comment.