Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update synthetic_news_data.Rmd #88

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 90 additions & 61 deletions vignettes/synthetic_news_data.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Synthetic NEWS Data"
author: "Dr Muhammad Faisal, Gary Hutson, and Professor Mohammed A Mohammed"
author: "Jason Pott"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Synthetic NEWS Data}
Expand All @@ -15,95 +15,124 @@ knitr::opts_chunk$set(
)
```

## What is Synthetic data?
## Loading the dataset from NHSRDatasets

This dataset is available from the [NHSRDatasets](https://CRAN.R-project.org/package=NHSRdatasets) package and similar comparisons can be made with the above. These examples can be used for data wrangling and data visualisation.

```{r}
library(NHSRdatasets)

The goal is to generate a data set which contains no real units, therefore safe for public release and retains the structure of the data.
NEWS_var <- NHSRdatasets::synthetic_news_data

In other words, one can say that synthetic data contains all the characteristics of original data minus the sensitive content.
```

Synthetic data is generally made to validate mathematical models. This data is used to compare the behaviour of the real data against the one generated by the model.
For mode information about the [synthpop](http://gradientdescending.com/generating-synthetic-data-sets-with-synthpop-in-r/) package.

## What is NEWS?

## How we generate synthetic data?
NEWS is short for the National Early Warning Score. [NHS England have provided a detailed introduction here](https://www.england.nhs.uk/ourwork/clinical-policy/sepsis/nationalearlywarningscore/)

The principle is to observe real-world statistic distributions from the original data and reproduce fake data by drawing simple numbers.
The latest iteration of the NEWS score is NEWS2.

Consider a data set with $p$ variables. In a nutshell, synthesis follows these steps:
The premise of NEWS is that physiology such as heart rate (pulse), respiration rate, consciousness (GCS or AVPU) are all routinely measured.

1. Take a simple random sample of $x_{1,obs}$ and set as $x_{1,syn}$
2. Fit model $f(x_{2,obs}|x_{1,obs})$ and draw $x_{2,syn}$ from $f(x_{2,syn}|x_{1,syn})$
3. Fit model $f(x_{3,obs}|x_{1,obs},x_{2,obs})$ and draw $x_{3,syn}$ from $f(x_{3,syn}|x_{1,syn},x_{2,syn})$
4. And so on, until $f(x_{p,syn}|x_{1,syn},x_{2,syn},...,x_{p-1,syn})$
GCS = Glasgow Coma Score (Categorical score 3-15) measuring the Eyes, verbal and motor responses.
AVPU = A categorical description of how concious a patient is A - Alert, V - Responds to voice, P - Responds to painful stimuli, U - Unresponsive.

Fitting statistical models to the original data and generating completely new records for public release.
Joint distribution $f(x_1,x_2,x_3,…,x_p)$ is approximated by a set of conditional distributions $f(x_2|x_1)$.
However there are a range of professional groups who use these measurements, and it can e challenging to recognise the deteriorating patient from the raw measurements alone especially if you do not often work with acutely unwell patients.

NEWS(2) provides categorical classifications for distinct ranges of physiology. Each category is scored 0-3.

## Synthetic data generation - National early warning score (NEWS) utilising real data
The more abnormal a measure of physiology the greater the categorical score attributed. The score is supposed to be calculated at the time the physiology is measured. In a hospital this is often when the nurse or healthcare assistant completes their observation rounds.

The data this is based on is the [NEWS](https://www.rcplondon.ac.uk/projects/outputs/national-early-warning-score-news-2) Score devised by the Royal College of Physicians.
The categorical NEWS score then is linked to distinct actions that should be followed. These actions will typically be localised by organisations depending on the level of resource that is available to support medical emergencies.

Synthetic data can be generated from new data, utilising the above methodology, on the real observed data:
### Criticisms of NEWS

```{r observed_data_generation}
library(readr)
library(dplyr)
df <- suppressWarnings(read_csv("https://raw.githubusercontent.com/StatsGary/SyntheticNEWSData/main/observed_news_data.csv") %>%
dplyr::select(everything(), -X1))
There are some criticisms of NEWS that were addressed by NEWS2. These were that normal measures of Oxygen saturation (SpO2) were not universal and often meant over escalation of "normal" abnormal physiology in patients with respiratory diseases such as COPD. These were addressed though adjusted ranges for SpO2.

glimpse(df)
```
There have also been concerns that in some cases the NEWS score has been introduced to settings (often mandatory) where it has not been validated. The Score was developed by the Royal College of Physicians. They often represent clinical specialties who work in-patient medicine. As such the data that was used to develop the score was based on data from patients who were typically out of the acute phase of their illness and so abnormal physiology was a measure post therapeutic interventions. In most Cases NEWS has been shown to be robust to these criticisms.

This reads in the observed NEWS data from the GitHub repository. Now, we will utilise the `synthpop` package to create a synthetically generated dataset.
NEWS is more work for (typically nursing) staff to complete, NEWS is also not validated as an incomplete score for example where just a heart rate, Blood pressure and SpO2 are recorded which is a common set of measurements in most outpatient settings.

## Generating synthetic NEWS dataset using synthpop package
### Here are some code chunks for the calculation of NEWS sub scores:

As stated, now we will use the real observed data and generate a synthetic set, utilising the equations and process mapped out in the preceding sections:
#### Systolic Blood pressure (column `syst`)

```{r synth}
library(synthpop)
syn_df <- syn(df, seed = 4321)
#### synthetic data
synthetic_news_data <- syn_df$syn
glimpse(synthetic_news_data)
```{r}
library(NHSRdatasets)
library(dplyr)

sbp_news <- NEWS_var |>
mutate(sbp = as.numeric(syst)) |>
mutate(news = case_when(
sbp <= 90 | sbp >= 220 ~ 3,
sbp %in% c(91:100) ~ 2,
sbp %in% c(101:110) ~ 1,
!is.numeric(pulse) ~ NA_real_,
TRUE ~ 0
))
```

```{r visuals}
library(ggplot2)
# Create temperature tibbles to compare observed vs synthetically generated labels
obs <- tibble(label = "observed_data", value = df$temp)
synth <- tibble(label = "synthetic_data", value = synthetic_news_data$temp)

# Merge the frames together to get a comparison
merged <- obs %>%
bind_rows(synth)

# Create the plot
plot <- merged %>%
ggplot(aes(value, fill = label)) +
geom_histogram(alpha = 0.9, position = "identity") +
theme_minimal() +
scale_fill_manual(values = c("#BCBDC1", "#2061AC")) +
labs(
title = "Observed vs Synthetically NEWS values",
subtitle = "Based on NEWS Temperature score",
x = "NEWS Temperature Score", y = "Score frequency"
) +
theme(legend.position = "none")

print(plot)
#### Heart Rate (column `pulse`)

```{r}
hr_news <- NEWS_var |>
mutate(pulse = as.numeric(pulse)) |>
mutate(news = case_when(
pulse <= 40 | pulse >= 131 ~ 3,
pulse %in% c(111:130) ~ 2,
pulse %in% c(41:50, 91:110) ~ 1,
!is.numeric(pulse) ~ NA_real_,
TRUE ~ 0
))
```

#### Resp Rate (column `resp`)

```{r}
rr_news <- NEWS_var |>
mutate(resp_rate = as.numeric(resp)) |>
mutate(news = case_when(
resp_rate <= 8 | resp_rate >= 25 ~ 3,
resp_rate %in% c(21:24) ~ 2,
resp_rate %in% c(9:11) ~ 1,
!is.numeric(resp_rate) ~ NA_real_,
TRUE ~ 0
))
```

## Loading the dataset from NHSRDatasets
#### SpO2 Oxygen Saturation (column `sat`)

```{r}
NEWS_var |>
mutate(news = case_when(
sat <= 91 ~ 3,
sat %in% c(92:93) ~ 2,
sat %in% c(94:95) ~ 1,
!is.numeric(sat) ~ NA_real_,
TRUE ~ 0
))
```

This dataset is available from the [NHSRDatasets](https://CRAN.R-project.org/package=NHSRdatasets) package and similar comparisons can be made with the above. These examples can be used for data wrangling and data visualisation.
#### Temperature (column `temp`)

```{r}
NEWS_var |>
mutate(news = case_when(
temp <= 35 ~ 3,
temp >= 39.1 ~ 2,
temp %in% c(38.1:39, 35.1:36) ~ 1,
!is.numeric(temp) ~ NA_real_,
TRUE ~ 0
))
```

For mode information about the [synthpop](http://gradientdescending.com/generating-synthetic-data-sets-with-synthpop-in-r/) package.
In addition NEWS2 has altered ranges for patients with known respiratory diseases. These need additional logic on a per patient basis to implement.

## Summary

In many ways, synthetic data reflects George Boxs observation that all models are wrong, but some are useful while providing a useful approximation [of] those found in the real world,”
In many ways, synthetic data reflects George Box's observation that "all models are wrong, but some are useful" while providing a "useful approximation [of] those found in the real world".

The connection between the clinical outcomes of a patient visits and costs rarely exist in practice, so being able to assess these trade-offs in synthetic data allow for measurement and enhancement of the value of care – cost divided by outcomes.

Expand Down
Loading