NHANES-BP-Appendix.Rmd

---
title: "NHANES Blood Pressure-Based Mortality Risk - Appendix"
author: "Rscripts by Hamish Patten, DW Bester and David Steinsaltz"
date: "03/08/2024"
output: 
  bookdown::pdf_document2:
    keep_tex: true
    toc: true
    toc_depth: 3
    number_sections: true
    table_caption: true
    latex_engine: xelatex  # Change the LaTeX engine to xelatex
header-includes:
  - "\\usepackage{lscape}"  # Use lscape package to change page orientation
bibliography: nhanesBP.bib
---

```{r setup, include=FALSE,message=FALSE,warning=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(tinytex.verbose = TRUE)
library(tidyverse)
library(bookdown)
library(formatR)
library(magrittr)
library(knitr)
library(tinytex)
library(kableExtra)
library(VennDiagram)
library(lattice)
library(ggtern)
library(parallel)
library(survival)
c14<-c("dodgerblue2", "#E31A1C", "green4", "#6A3D9A", "#FF7F00", "gold1", "skyblue2",  "gray70", "maroon", "orchid1", "darkturquoise", "darkorange4", "brown", "black")
```
# Appendix A -- The data


```{r data-cleaning, include=FALSE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Load data
nhanes_old=read.csv('Data_raw/nh3bpdat290716.csv')
# Update mortality with 2017 public release
nhanes_update=read.csv('Data_raw/NHANES_ML_update.csv',colClasses = 'integer') %>%
  mutate(yrsfuExam = round(permth_exm/12,2), yrsfuHome = round(permth_int/12, 2)) %>%
  filter(eligstat == 1)

nhanes_old %<>% mutate(yrsfuExam = round(yrsfuExam, 2), yrsfuHome = round(yrsfuHome,2), permth_exm = round(yrsfuExam *12), permth_int = round(yrsfuHome*12))

# Compare old and new mortality
old_dead <- sort(nhanes_old$SEQN[nhanes_old$dead==1])
new_dead <- sort(nhanes_update$seqn[nhanes_update$mortstat==1])
conflict_dead <- old_dead[ !(old_dead %in% new_dead)] # Tested: No resurrections
missing_update <- nhanes_old$SEQN[!(nhanes_old$SEQN %in% nhanes_update$seqn)] #Tested: No one from old data set missing from followup
missing_old <- !(nhanes_update$seqn %in% nhanes_old$SEQN) # 6 subjects in the updated data set with mortality data (all alive) who were not in the old data set.

nhanes_update %<>% filter(!missing_old) # remove them
identical(nhanes_old$SEQN,nhanes_update$seqn) # Sequence of ids now identical
fu_diff <- nhanes_update$permth_int - nhanes_update$permth_exm
fu_diff_update <- nhanes_update$permth_int - nhanes_old$permth_int

identical(nhanes_old$UCOD_LEADING[!is.na(nhanes_old$UCOD_LEADING)], nhanes_update$ucod_leading[!is.na(nhanes_old$UCOD_LEADING)])
#TRUE, so cause-of-death codes for all individuals who were dead in the first data set are identical

conflict_old <- subset(nhanes_old, nhanes_old$SEQN %in% conflict_dead)
conflict_new <- subset(nhanes_update, nhanes_update$seqn %in% conflict_dead)

nhanes <- nhanes_old
nhanes$UCOD_LEADING <- nhanes_update$ucod_leading
nhanes$yrsfuExam <- nhanes_update$yrsfuExam
nhanes$yrsfuHome <- nhanes_update$yrsfuHome
nhanes$dead <- nhanes_update$mortstat
nhanes$mrtHrt <- as.integer(nhanes$UCOD_LEADING==1)  # 1 codes for Heart
nhanes$mrtNeo <- as.integer(nhanes$UCOD_LEADING==2)  # 2 codes for Neoplasm
nhanes$mrtInj <- as.integer(nhanes$UCOD_LEADING==4)  # 4 codes for Injury 
nhanes$mrtOtherCVD <- as.integer(nhanes$UCOD_LEADING==5)  # 5 codes for other CVD

# Change NAs to 0
nhanes$mrtHrt[is.na(nhanes$mrtHrt)] <- 0
nhanes$mrtNeo[is.na(nhanes$mrtNeo)] <- 0
nhanes$mrtInj[is.na(nhanes$mrtInj)] <- 0
nhanes$mrtOtherCVD[is.na(nhanes$mrtOtherCVD)] <- 0

nhanes$eventhrt <- with(nhanes, mrtOtherCVD+mrtHrt)
nhanes$eventother <- with(nhanes, dead-eventhrt)

nhanes$yrsfu=pmin(nhanes_update$yrsfuHome,nhanes$yrsfuExam, na.rm = TRUE)
# Note: The most recent examination (home or clinic) is a left truncation time,
#   so follow-up is the minimum. For some the clinic is missing.
#   We call the follow-up the home follow-up time, but it doesn't matter,
#   since those individuals will be excluded from analysis.

whichsys=match(c('systolicA','systolicB','systolicC'),names(nhanes))
whichdias=match(c('diastolicA','diastolicB','diastolicC'),names(nhanes))  

whichsyshome=match(c('systolicAhome','systolicBhome','systolicChome'),names(nhanes))
whichdiashome=match(c('diastolicAhome','diastolicBhome','diastolicChome'),names(nhanes))
whichBP=c(whichsys,whichdias)
whichBPhome=c(whichsyshome,whichdiashome)
allsys=c(whichsys,whichsyshome)
alldias=c(whichdias,whichdiashome)

### Means and variances

sys=nhanes[,whichsys]
dias=nhanes[,whichdias]

sysH=nhanes[,whichsyshome]
diasH=nhanes[,whichdiashome]

nhanes$meandiasH=apply(diasH,1,mean)
nhanes$meansysH=apply(sysH,1,mean)
nhanes$meandiasC=apply(dias,1,mean)
nhanes$meansysC=apply(sys,1,mean)
nhanes$meandias=(apply(sysH,1,mean)+apply(sys,1,mean))/2
nhanes$meansys=(apply(diasH,1,mean)+apply(dias,1,mean))/2
nhanes$sysDel=(apply(sysH,1,mean)-apply(sys,1,mean))/2
nhanes$diasDel=(apply(diasH,1,mean)-apply(dias,1,mean))/2
nhanes$vardiasC=apply(dias,1,var)
nhanes$varsysC=apply(sys,1,var)
nhanes$vardiasH=apply(diasH,1,var)
nhanes$varsysH=apply(sysH,1,var)
nhanes$sddiasC=sqrt(nhanes$vardiasC)
nhanes$sdsysC=sqrt(nhanes$varsysC)
nhanes$sddiasH=sqrt(nhanes$vardiasH)
nhanes$sdsysH=sqrt(nhanes$varsysH)

nhanes$precdiasC=1/(nhanes$vardiasC+1/3)
nhanes$precsysC=1/(nhanes$varsysC+1/3)
nhanes$precdiasH=1/(nhanes$vardiasH+1/3)
nhanes$precsysH=1/(nhanes$varsysH+1/3)

race <- with(nhanes, ifelse(white==1, 2, ifelse(black==1, 1,ifelse(mexican==1,3, 0))))
nhanes$race <- as.factor(race)

type=factor(1+race+3*(1-nhanes$female)*(race>0)+6*(race==0)) # other race all type 7
levels(type)=c('female, black','female, white','female, Mex','male, black','male, white','male, Mex','other')
nhanes$type=type

nhanesna=!(is.na(apply(nhanes[,c(whichBP,whichBPhome)],1,prod)))
diaslow <- apply(nhanes[,alldias],1,min)<40
diashigh <- apply(nhanes[,alldias],1,max)>140
syslow <- apply(nhanes[,allsys],1,min)<60
syshigh <- apply(nhanes[,allsys],1,max)>250
nhanessysrange=(syslow | syshigh)
nhanesdiasrange=(diaslow | diashigh)
dias0 <- apply(nhanes[,alldias],1,min) == 0
sys0 <- apply(nhanes[,allsys],1,min) == 0
nhanesgood=(nhanesna&(nhanes$yrsfu>0)&nhanes$other==0)

n_original <- dim(nhanes)[1]
n_no_other <- sum(nhanes$other==0)
n_followup <- sum(nhanes$yrsfu>0 & nhanesna & nhanes$other==0)
n_no_followup <- sum(nhanes$yrsfu==0 & nhanesna & nhanes$other==0)

frs=read.csv('Data_raw/FRS.csv')

whichfrs <- '1998' # 'ATP' or '1998'
#Choose versions of FRS
if(whichfrs=='ATP'){nhanes$FRS=frs$ATP.FRS[match(nhanes$SEQN,frs$SEQN)]} else{nhanes$FRS=frs$X1998.FRS[match(nhanes$SEQN,frs$SEQN)]}


#################################################
#
#			Add in observer data
#
#################################################


exm=read.csv('Data_raw/Examiners.csv')

nhanes$Exam=exm$PEPTECH[match(nhanes$SEQN,exm$SEQN)]
nhanes$Exam[is.na(nhanes$Exam)]=0
nhanes$Exam[nhanes$Exam == 88888] = 0
## Combine the 88888 examiner with 0

#nhanesA=subset(nhanesA,Exam>0)

examname=sort(unique(nhanes$Exam))

examlist=lapply(examname,function(en) subset(nhanes,Exam==en)$nunq.dias)
examlists=lapply(examname,function(en) subset(nhanes,Exam==en)$nunq.sys)


nhanesA=nhanes[nhanesgood & !nhanessysrange&!nhanesdiasrange, ]
nhanesA$type=factor(nhanesA$type,exclude=7)

N=dim(nhanesA)[1]
k=3
BP_type_names <- c('Systolic','Diastolic')
BP_place_names <- c('Home','Clinic')
whichsys=match(c('systolicA','systolicB','systolicC'),names(nhanesA))
whichdias=match(c('diastolicA','diastolicB','diastolicC'),names(nhanesA)) 

whichsyshome=match(c('systolicAhome','systolicBhome','systolicChome'),names(nhanesA))
whichdiashome=match(c('diastolicAhome','diastolicBhome','diastolicChome'),names(nhanesA))
whichBP=c(whichsys,whichdias)
whichBPhome=c(whichsyshome,whichdiashome)


# Make BP measures into array
sys=data.matrix(nhanesA[,whichsys])
dias=data.matrix(nhanesA[,whichdias])
sysH=data.matrix(nhanesA[,whichsyshome])
diasH=data.matrix(nhanesA[,whichdiashome])

allBP <- list(Systolic = list(Home = sysH, Clinic = sys), Diastolic = list(Home = diasH, Clinic = dias))
L=length(sys)
gamma_dimnames <- list( c('alpha','theta','beta') , BP_place_names)
norm_dimnames <- c('m_M','m_Delta', 'sigma2_M', 'sigma2_Delta')

```


```{r correlations, include=TRUE, echo=FALSE, message=FALSE, warning=FALSE}

demog.data <- data.frame(Ethnicity = droplevels(nhanesA$race) %>% fct_recode(Black = '1',White = '2',Mexican = '3'),
      Sex = factor(nhanesA$female) %>% fct_recode(Male = '0', Female = '1') , 
                      Exam = as.factor(nhanesA$Exam) ) %>% mutate(Demog = factor(paste(Ethnicity,Sex)))
levels(demog.data$Exam)[levels(demog.data$Exam) == '0'] <- 'Unknown'
# Make a table of means for systolic home BP by ethnicity and sex
bp_table <- list()
bp.data <- data.frame()

which_diff <- function(x) { # for a vector with 3 entries, with two the same, return the one that is different
  if (length(unique(x)) == 1 | length(unique(x))==3) {return(0)}
if (x[1] == x[2]) {return(3)}
if (x[1] == x[3]) {return(2)} 
  return(1)
  }
  
  
for (BPtype in BP_type_names){
  for (BPplace in BP_place_names){
    bp.data %<>% rbind(cbind(demog.data, BPtype = factor(BPtype, levels = BP_type_names) , BPplace = factor(BPplace), Mean = apply(allBP[[BPtype]][[BPplace]],1,mean), SD = apply(allBP[[BPtype]][[BPplace]],1,sd),
    TotalMean = (apply(allBP[[BPtype]][['Home']],1,mean)+apply(allBP[[BPtype]][['Clinic']],1,mean))/2,
    Delta = abs(apply(allBP[[BPtype]][['Home']],1,mean)-apply(allBP[[BPtype]][['Clinic']],1,mean))/2,
    # Note: TotalMean and Delta are identical for home and clinic
    Number = apply(allBP[[BPtype]][[BPplace]],1,function(x) length(unique(x))),
    Which_Diff = apply(allBP[[BPtype]][[BPplace]],1,which_diff)))
  }
}

mean_sd_summary <- bp.data %>%
  group_by(BPplace,BPtype, Sex, Ethnicity) %>%
  summarise(
    Mean_of_Mean = mean(Mean, na.rm = TRUE),
    Mean_of_SD = mean(SD, na.rm = TRUE)
  )

bp.data.cor <- bp.data %>%
  group_by(BPtype, BPplace) %>%
  summarize(correlationSD = cor(Mean, SD, use = "complete.obs"),correlationDelta = cor(TotalMean, Delta, use = "complete.obs"))

bp.data.cor2 <- bp.data %>% 
  group_by(BPtype,BPplace,Demog) %>%
  summarize(correlationSD = cor(Mean, SD, use = "complete.obs"),correlationDelta = cor(TotalMean, Delta, use = "complete.obs"))

# Make a table of last digit fractions for bp by type and place
bp_last_digit <- matrix(0,nrow=0,ncol=5)

for (BPtype in BP_type_names){
  for (BPplace in BP_place_names){
 digit_table <- unname(table(allBP[[BPtype]][[BPplace]]%%10))
  digits=digit_table/sum(digit_table)
  bp_last_digit %<>% rbind(digits) 
  }
}
bp_last_digit %<>% as_tibble
names(bp_last_digit) <- 2*(0:4)
bp_last_digit$Place <- rep(BP_place_names,each=2)
bp_last_digit$Type <- rep(BP_type_names,2)
bp_last_digit %<>% select(Place,Type,everything())
```

```{r Venn1, echo=FALSE, message=FALSE, warning=FALSE, fig.width=4, fig.height=4, fig.align='center', fig.cap='Venn diagram of subjects excluded from the analysis.'}

bprange <- nhanessysrange | nhanesdiasrange

bprange[is.na(bprange)] <- FALSE

v <- draw.quad.venn(
  area1=sum(!nhanesna), # Missing BP
  area2=sum(nhanes$yrsfu==0), # No follow-up
  area3=sum(nhanes$other==1), # Other ethnicity
  area4 = sum(bprange), # BP out of range
  n12=sum(!nhanesna & nhanes$yrsfu==0),
  n23=sum(nhanes$yrsfu==0 & nhanes$other==1),
  n13=sum(!nhanesna & nhanes$other==1),
  n14 = sum(!nhanesna & bprange),
  n24 = sum(nhanes$yrsfu==0 & bprange),
  n34 = sum(nhanes$other==1 & bprange),
  n124 = sum(!nhanesna & nhanes$yrsfu==0 & bprange),
  n134 = sum(!nhanesna & nhanes$other==1 & bprange),
  n123=sum(!nhanesna & nhanes$yrsfu==0 & nhanes$other==1),
  n234=sum(nhanes$yrsfu==0 & nhanes$other==1 & bprange),
  n1234=sum(!nhanesna & nhanes$yrsfu==0 & nhanes$other==1 & bprange),
  category = c("Missing", "Follow-up", "Other Ethnicity","BP out of range"),  
  fill = c("skyblue", "pink1", "mediumorchid", "orange"),
  lty = "blank",
  cat.pos = c(-20, 20,0,0),
  cex = 1
)
  

```

## Exclusions

There were `r dim(nhanes)[1]` subjects in the initial data set.
Of these `r sum(!nhanesgood)` were excluded because they had missing data or were not followed up, or belonged to the "Other" ethnic group.
This left `r sum(nhanesgood)` subjects for further consideration.
A small number of subjects were excluded because their blood pressure measurements were outside the normal range, as described below in section \ref{sec:BPrange}.
As our method depends on estimating the mortality rates for each demographic group (ethnicity and sex),
we removed the small number of subjects whose ethnic group was given as "Other" (n=`r sum(nhanes$other)`).
(The three included ethnic groups were Mexican American (n=`r sum(nhanes$mexican)`), Black (n=`r sum(nhanes$black)`), and White (n=`r sum(nhanes$white)`).
In the end there were `r N` subjects in the analysis data set.
A Venn diagram of the different causes of exclusion is given in figure \ref{fig:Venn1}.
We will refer to this as the "full population".
Of these, `r sum(!is.na(nhanesA$FRS))` had a computable FRS score.
We call this the ``FRS population''.


## Exploratory data analysis
The empirical means of the home and clinic measures in population B are tabulated in Table \ref{tab:summaries}. We note that the home measures are systematically higher than the clinic measures, within every demographic group, with greater differences for subjects who are white or Mexican, and female. 
The average difference is about `r sprintf('%.1f',mean(diasH)-mean(dias))` for diastolic and `r sprintf('%.1f', mean(sysH) - mean(sys))` for systolic, which is small compared with the general range of the differences, which have SD of 
`r sprintf('%.1f (diastolic) and %.1f (systolic)', sd(diasH-dias),sd(sysH-sys))`.

```{r summaries, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Printing the table using kable
    cat(kable(mean_sd_summary,format="latex", escape = F,booktabs = T, digits=1,
              linesep = rep(c(rep("",5),"\\addlinespace"),4),
                  col.names=c("Place","Sys/Dias", "Sex", "Ethnicity", 'Mean', 'SD'),caption = 'Summary data for blood pressure') %>%  kable_styling(latex_options = c("hold_position","striped")) )
    cat('\n')
```

### Correlations between measurements {#sec:correlations}
In Figure \ref{fig:SD-mean} see that there is relatively little correlation between empirical SD and empirical mean SD for the different BP types and places. This is reassuring, as it avoids the possibility of a collinearity effect confounding the sampling of mean and SD, which are being treated as independent covariates in the model.


```{r SD-mean,fig.pos='H', fig.cap='Scatterplot of individual mean BP against individual SD of BP', echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Scatterplot of mean BP against SD of BP from bp.data
    ggplot(bp.data, aes(x = Mean, y = SD)) +
    geom_point(alpha=.03, color = 'navyblue' ) +
    labs(title = "Scatterplot of Mean against SD",
         x = "Mean",y = "SD") + ylim(0,15)+
      geom_smooth(method = "lm", se = FALSE, color = "black") +  # Add regression line
    facet_grid(BPtype ~ BPplace, scales = 'free_x') +
    theme(plot.title = element_text(hjust = 0.5)) +
    geom_text(data = bp.data.cor, aes(label = sprintf("Cor: %.2f", correlationSD), x = Inf, y = Inf), 
                 vjust = "top", hjust = "right", inherit.aes = FALSE)

```

<!--#```{r Delta-mean,fig.pos='H', fig.cap='Scatterplot of individual mean BP against individual absolute difference between Clinic and Home mean', echo=FALSE,results='asis',message=FALSE,warning=FALSE}
#bp.sub <- data.frame(Mean=bp.data$TotalMean[bp.data$BPplace=='Clinic'],Delta=bp.data$Delta[bp.data$BPplace=='Clinic'], #BPtype=bp.data$BPtype[bp.data$BPplace=='Clinic']) %>%   # Home and clinic are identical
#  group_by(BPtype)
  
# bp.data.cor <- bp.sub %>%
#   summarize(correlationDelta = cor(Mean, Delta, use = "complete.obs"))
# # Scatterplot of mean BP against SD of BP from bp.data
#     ggplot(bp.sub %>% group_by(BPtype), aes(x = Mean, y = Delta)) + 
#     geom_point(alpha=.03, color = 'navyblue' ) +
#     labs(title = "Scatterplot of Mean against |Delta|",
#          x = "Mean",y = "|Delta|") + ylim(0,20)+
#       geom_smooth(method = "lm", se = FALSE, color = "black") +  # Add regression line
#     facet_grid(~ BPtype, scales = 'free_x') +
#     theme(plot.title = element_text(hjust = 0.5)) +
#     geom_text(data = bp.data.cor2, aes(label = sprintf("Cor: %.2f", correlationDelta), x = Inf, y = Inf), 
#                  vjust = "top", hjust = "right", inherit.aes = FALSE)
-->
```{r Delta-mean,include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
DeltaSD <- cbind(nhanesA$meansys,nhanesA$meandias)
MeanSD <- abs(cbind(nhanesA$sysDel,nhanesA$diasDel))
colnames(DeltaSD) <- c('SysDelta','DiasDelta')
colnames(MeanSD) <- c('SysMean','DiasMean')
cor_table <- cor(DeltaSD,MeanSD,use='complete.obs')
cat(kable(cor_table, 
          caption = 'Correlation between mean and Delta. Rows correspond to type of Delta, columns to type of mean.',
          format = "latex", 
          escape = FALSE,
          booktabs = TRUE, 
          digits = 3))
    cat('\n')
```

```{r SD-mean2,include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
DiasSDMean <- cbind(nhanesA$sddiasC,nhanesA$meandiasC,nhanesA$sddiasH,nhanesA$meandiasH)
SysSDMean <- cbind(nhanesA$sdsysC,nhanesA$meansysC,nhanesA$sdsysH,nhanesA$meansysH)
allSD <- with(nhanesA,cbind(sdsysC,sdsysH,sddiasC,sddiasH))
allMean <- with(nhanesA,cbind(meansysC,meansysH,meandiasC,meandiasH))
colnames(allSD) <- c('Clinic Sys SD','Home Sys SD','Clinic Dias SD','Home Dias SD')
colnames(allMean) <- c('Clinic Sys Mean','Home Sys Mean','Clinic Dias Mean','Home Dias Mean')
cor_table2 <- cor(allSD,allMean,use='complete.obs')
cat(kable(cor_table2, 
          caption = 'Correlation between mean and SD. Rows correspond to type and location of SD, columns to type and location of mean.',
          format = "latex", 
          escape = FALSE,
          booktabs = TRUE, 
          digits = 3) %>%  
          kable_styling(latex_options = c("hold_position","striped")) )
    cat('\n')
```

In Table \ref{tab:Delta-mean} we show the correlations between overall mean and absolute difference ($|\Delta|$) between clinic and home measurements.
The results are given as a $2\times2$ table, showing correlations within systolic and diastolic BP, and between the two.
The only moderately high correlation is between Systolic mean and Diastolic absolute Delta, which would correspond to a Variance Inflation Factor of `r round(1/(1-cor_table["DiasDelta","SysMean"]^2),2)`.
While this is not directly relevant to the present Bayesian methodology, it suggests that this correlation should not substantially affect the estimation of the model coefficients.

In Table \ref{tab:SD-mean2} we show the correlations between mean and standard deviation for the three BP measures, considering all pairs of (Clinic,Home) and (Systolic,Diastolic). 
Finally, Table \ref{tab:SD-mean3} shows the correlations between systolic and diastolic, ranging over (Clinic,Home) and (Mean,SD).
(Some of the numbers here of course duplicate those in Table \ref{tab:SD-mean2}.)
Again, the correlations are too low to require any special treatment.

```{r SD-mean3,include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
DiasSDMean <- cbind(nhanesA$sddiasC,nhanesA$meandiasC,nhanesA$sddiasH,nhanesA$meandiasH)
SysSDMean <- cbind(nhanesA$sdsysC,nhanesA$meansysC,nhanesA$sdsysH,nhanesA$meansysH)
colnames(DiasSDMean) <- c('Clinic Dias SD','Clinic Dias Mean','Home Dias SD','Home Dias Mean')
colnames(SysSDMean) <- c('Clinic Sys SD','Clinic Sys Mean','Home Sys SD','Home Sys Mean')
cor_table3 <- cor(DiasSDMean,SysSDMean,use='complete.obs')
cat(kable(cor_table3, 
          caption = 'Correlation between diastolic and systolic summary statistics. Rows correspond to variables and locations for diastolic, columns to variables and locations for systolic.',
          format = "latex", 
          escape = FALSE,
          booktabs = TRUE, 
          digits = 3) %>%  
          kable_styling(latex_options = c("hold_position","striped")) )
    cat('\n')
```

## Errors in blood pressure measurement or recording {#sec:errors}

The blood pressure measurement or recording errors were found particularly in the home measurements. 
While these did not destroy the usefulness of the home measurements, they did require some attention and decisions for how to work with these defects. 
We also consider them inherently interesting, and worth registering for future researchers working on these or similar data. 
In particular, the problem we have called “dependent replication” was entirely unexpected, although not unprecedented, and is of particular concern to researchers trying to estimate individual variation in clinically relevant measures.

### Last-digit preference {#sec:lastdigit}

Mild tendency for observers to prefer certain last digits in reporting BP measurements has been reported in other studies, though an analysis of the 1999 wave of NHANES reported no last-digit preference [@ostchega2003national].  
The last-digit preference in NHANES III, on the other hand, is substantial, with about `r sprintf('%.1f', 100*bp_last_digit$'0'[3])`% of all the clinic-measured systolic BP measurements ending in 0, but only about `r sprintf('%.1f', 100*bp_last_digit$'4'[3]+100*bp_last_digit$'6'[3])`% ending in 4 or 6. Because the shifts due to last-digit preference are presumably small, we expect them to have little effect on the main effects that we are examining in this paper, but they do increase the probability of two measurements being rounded to the same value, something that needs to be taken into account in examining the problem of dependent replication.

```{r digit-summary, include=TRUE, echo=FALSE,results='asis'}
# Printing the table using kable
    cat(kable(bp_last_digit,format="latex", escape = F,booktabs = T, digits=3,col.names=c("Place","Sys/Dias",2*(0:4)),caption = 'Summary data for BP end digits') %>%  
          kable_styling(latex_options = c("hold_position","striped")))
    cat('\n')
```

### Dependent replicates {#sec:pseudorep}

While the protocol calls for each subject to have three independent BP measures taken, it is not impossible that the observers may have been influenced by one measure in recording the next.
This could happen in either direction: later measurements could be pulled closer to the first, or there could be an inclination to avoid repeated measures.
This is relevant, because erroneously repeated measures would artificially decrease the variance of the three measurements, and avoiding repeated measures would have the opposite effect.

The end-digit bias may be expected to have an effect here, since it influences the probability of two measurements being rounded to the same value.
We begin by noting the standard deviations for measurements of individual subjects as given in the column 'Mean of SD' in Table \ref{tab:sd-summary}.
The column 'Prob all rep' gives the theoretical probability that two of the three measurements for a subject would have the same value, if the measurements were independent and normally distributed with the given standard deviation (adjusted for the rounding), and assuming that rounding to particular digits is done in proportion to the fractions listed in Table \ref{tab:digit-summary}.
The column 'Prob 2 rep' gives the probability that two of the three measurements would have the same value, under the same conditions.
The column 'Frac all rep' gives the observed fraction of subjects for whom all three measurements were equal, and 'Frac 2 rep' gives the fraction for whom two of the three measurements were equal.
The observed fractions for three equal measurements are all very close to the theoretical probabilities, but the observed fractions for two equal measurements are substantially lower than the theoretical probabilities.
(For comparison, a 95\% probability range for the fraction of subjects with two equal measurements is about $\pm 0.008$.)

In Figure \ref{fig:examinerPlot}, we show the fraction of subjects with two equal measurements, by examiner, blocked by place and type.
We see that the fraction of subjects with two equal measurements varies substantially by examiner, and that the variation is greater for the systolic than for the diastolic measurements.


```{r pseudorep, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Calculate expected number of all with same last digit given 3 observations of sd = S
prob_repeated3 <- function(S, intervalwidth=2){
  # Assume overall mean is uniform
  # so average over the interval [0,10]
  # then probability of same is sum_j (Phi(10j+intervalwidth-x)-Phi(10j-x))^3
  
  sum(sapply(seq(-1,1),function(j)
             .1*integrate(function(x) #(1-x/intervalwidth)*dnorm(x,sd=S)*(pnorm(x,sd=S)-.5),0,intervalwidth)$value
     (pnorm(10*j+intervalwidth-x,sd=S)-pnorm(10*j-x,sd=S))^3,-5,5)$value) )
}

prob_repeated2 <- function(S, intervalwidth=2){
  # Assume overall mean is uniform
  # so average over the interval [0,10]
  # then probability of same is sum_j 3*(Phi(10j+intervalwidth-x)-Phi(10j-x))^2 - 2*prob_repeated3
  
  sum(sapply(seq(-1,1),function(j)
             .1*integrate(function(x) 
     3*(pnorm(10*j+intervalwidth-x,sd=S)-pnorm(10*j-x,sd=S))^2,-5,5)$value) ) - 2*prob_repeated3(S,intervalwidth)
}

sd_summary <- bp.data %>%
  group_by(BPplace,BPtype) %>%
  summarise(
    Mean_of_SD = mean(SD, na.rm = TRUE),
    # fraction with all three measurements equal
    Fraction3 = mean(Number==1, na.rm = TRUE),
    Fraction2 = mean(Number==2, na.rm = TRUE)
  )

# Calculate probability of three measurements being equal given unequal interval lengths
prob_repeated3_intervals <- function(S,intervals){
  sum(sapply(intervals, function(intervalwidth) prob_repeated3(S,intervalwidth)))
}

prob_repeated2_intervals <- function(S,intervals){
  sum(sapply(intervals, function(intervalwidth) prob_repeated2(S,intervalwidth)))
}


sd_summary$Prob_repeated3 <- sapply(1:4, function(j) prob_repeated3_intervals(sqrt(sd_summary$Mean_of_SD[j]^ 2+1/3),10*unlist(bp_last_digit[j,3:7])))
sd_summary$Prob_repeated2 <- sapply(1:4, function(j) prob_repeated2_intervals(sqrt(sd_summary$Mean_of_SD[j]^2+1/3),10*unlist(bp_last_digit[j,3:7])))

sd_summary %<>% relocate(Fraction2, .after = Prob_repeated3)

```
```{r sd-summary, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
#Print table of sd_summary
cat(kable(sd_summary,format="latex", escape = F,booktabs = T, digits=3,
          linesep = rep(c(rep("",5),"\\addlinespace"),4),
          col.names=c("Place","Sys/Dias","Mean of SD","Frac all rep",  "Prob all rep", "Frac 2 rep", "Prob 2 rep"),caption = 'Summary data for repeated measures') %>%  kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')

```


We show the fraction of subjects with two equal measurements in Figure {fig:examinerPlot}, split by examiner, blocked by place and type.
We see that the fraction of subjects with two equal measurements varies substantially by examiner, and that the variation is greater for the systolic than for the diastolic measurements.


```{r examinerCalc, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# 2 by 2 grid of columns of number of subjects with 2 equal measurements
#   by examiner, blocked by place and type

# First, calculate the number of subjects with 2 equal measurements for each examiner
examiner2 <- bp.data %>% 
  group_by(BPplace,BPtype,Exam) %>%
  summarise(
    Number2 = sum(Number==2, na.rm = TRUE),
    Number3 = sum(Number==1, na.rm = TRUE),
        Total = n(),
  )

examiner2 %<>% mutate(Frac2 = Number2/Total,
    Frac3 = Number3/Total,
    SD2 = sqrt(Frac2*(1-Frac2)/Total),
    SD3 = sqrt(Frac3*(1-Frac3)/Total)) 
```


```{r examinerPlot, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE,fig.cap="Number of subjects with 2 equal measurements by examiner, blocked by place and type. Red band shows 95% probability range. Vertical green dashed line shows expected fraction; blue dotted line shows observed fraction over all examiners."}
ggplot(examiner2, aes(x=Number2/Total,y=Exam)) +
  geom_point() +
  geom_errorbarh(aes(xmin=Frac2 - 2*SD2,xmax=Frac2 + 2*SD2,col='red',height=.3)) +
  facet_grid(BPplace~BPtype) +
  labs(x="Fraction of subjects with 2 equal measurements",y="Examiner") +
  theme_bw() +
  # add dashed vertical line at corresponding Prob_repeated2 for that place and type
  geom_vline(data= sd_summary, aes(xintercept=Prob_repeated2),linetype="dashed", col= 'darkolivegreen') +
  geom_vline(data= sd_summary, aes(xintercept=Fraction2),linetype="dotted", col= 'navyblue') +
  theme(axis.text.y = element_text(size=10),
        axis.text.x = element_text(size=10),
        axis.title.x = element_text(size=14),
        axis.title.y = element_text(size=14),
        strip.text.y = element_text(size=12),
        strip.text.x = element_text(size=12),
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        legend.position = "none"
  )
```

```{r examinerPlot3, include=TRUE, echo=FALSE,message=FALSE,warning=FALSE,fig.cap="Number of subjects with 3 equal measurements by examiner, blocked by place and type. Red band shows 95% probability range. Vertical green dashed line shows expected fraction; blue dotted line shows observed fraction over all examiners."}

# do the same for 3 equal measurements
stripchart3 <- ggplot(examiner2,
                     aes(x=Number3/Total,y=Exam)) +
  geom_point() +
  geom_errorbarh(aes(xmin=pmax(0,Frac3 - 2*SD3),xmax=Frac3 + 2*SD3,col='red',height=.3)) +
  facet_grid(BPplace~BPtype) +
  labs(x="Fraction of subjects with 3 equal measurements",y="Examiner") +
  theme_bw() +
  # add dashed vertical line at corresponding Prob_repeated3 for that place and type
  geom_vline(data= sd_summary, aes(xintercept=Prob_repeated3),linetype="dashed", col= 'darkolivegreen') +
  geom_vline(data= sd_summary, aes(xintercept=Fraction3),linetype="dotted", col= 'navyblue') +
  theme(axis.text.y = element_text(size=10),
        axis.text.x = element_text(size=10),
        axis.title.x = element_text(size=14),
        axis.title.y = element_text(size=14),
        strip.text.y = element_text(size=12),
        strip.text.x = element_text(size=12),
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        legend.position = "none"
  )
stripchart3

```

In Figure \ref{fig:examinerPlot3}, we show the fraction of subjects with three equal measurements, by examiner, blocked by place and type.
Relative to the expected random fluctuations, we see that there is even more variation among the examiners.
One examiner (3001) produced consistently excessive numbers of triple repeats in Home measurements, and a deficit of triple repeats in Clinic measurements.

One further point to explore is the position of the two equal measures in a group of three.
If there are three independent measures, with two equal, each of the three has equal probability of being the odd one out.
On the other hand, if there is a trend in the measurements, then the second is least likely to be the odd one out.

In fact, what we observe is that it is the third measurement that is least likely to differ from the other two, while the first is most likely.
This is what we would expect if examiners sometimes either intentionally copied the second measurement into the space for the third, or unintentionally allowed themselves to be influenced into observing the same number.
The proportions are listed in Table \ref{tab:proportionChisq}, together with chi-squared tests for difference from the expected equal proportions for each site and type.
On the other hand, if there is a trend in the measurements, then the second is least likely to be the odd one out, which is also not what we see.

We see that there is a huge deviation from the expected proportions in the Home measurements, but less in the Home measurements, and more deviation in Systolic than in Diastolic measurements.

```{r proportionChisq, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}

# make a table of bp.data$Which_Diff blocked by BPplace and BPtype

b0 <- subset(bp.data,Which_Diff>0)

table1 <- as.data.frame(table(b0$Which_Diff,b0$BPplace,b0$BPtype,b0$Exam)) 
names(table1) <-c("Which_Diff","BPplace","BPtype","Exam","Freq")

# add a column for normalized frequencies
table1$Normalized <- ave(table1$Freq,table1$BPplace,table1$BPtype,table1$Exam,FUN=function(x) x/sum(x)) 

# add a column for frequencies


table1 %<>% pivot_wider(names_from = Which_Diff, values_from = c(Normalized,Freq),id_cols = c("BPplace","BPtype","Exam")) %>%
  rbind(data.frame(BPplace=rep(c("Home","Clinic"),2),BPtype=rep(c("Systolic","Diastolic"),each=2),Exam='Centre',Normalized_1=1/3,Normalized_2=1/3,Normalized_3=1/3,Freq_1=1,Freq_2=1,Freq_3=1))

# Perform chi-square test for difference between observed proportions of position of unequal measurement and expect 1/3,1/3,1/3

# Data are in columns Freq_1,Freq_2,Freq_3 of table1
# Expected values are 1/3, 1/3, 1/3
table1 %<>% rowwise() %>% mutate( chisq.val = { observed=c(Freq_1,Freq_2,Freq_3)
  expected=c(1/3,1/3,1/3)
  unname(chisq.test(x=observed, p=expected)$statistic)},
chisq.p = pchisq( chisq.val, df=2, lower.tail=FALSE))
table1$chisq.p <- chisq.test(cbind(table1$Freq_1,table1$Freq_2,table1$Freq_3),p=c(1/3,1/3,1/3))$p.value

# make a table of chi-squared for total counts of unequal measurements by place and type
chisq1 <- table1 %>% subset(Exam != 'Centre') %>%
  group_by(BPplace,BPtype) %>% 
  summarise(Freq_1=sum(Freq_1),Freq_2=sum(Freq_2),Freq_3=sum(Freq_3)) %>% rowwise() %>%
  mutate(chisq.val = { observed=c(Freq_1,Freq_2,Freq_3)
  expected=c(1/3,1/3,1/3)
  signif(unname(chisq.test(x=observed, p=expected)$statistic),3)},
chisq.p = format( pchisq( chisq.val, df=2, lower.tail=FALSE), scientific = TRUE, digits = 3) )

cat(kable(chisq1, format="latex", escape = F,booktabs = T, 
      linesep = rep(c(rep("",5),"\\addlinespace"),4),
          col.names=c("Place","Sys/Dias","Freq1","Freq2","Freq3","ChiSq","p-value"),
      caption="Chi-square test for difference between observed proportions (all examiners), stratified by place and type")%>%
  kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')

```
To explore this further, we can look at the proportions of first, second and third measurements from each examiner that are different from the other two.
The results of a chi-squared test for each examiner (stratified by site and type of BP) for difference from the expected equal proportions are shown in Figure \ref{fig:proportionChisq3}.
The dashed line represents a p-value of 0.001.
Here we see that the Home measurements are extremely variable, while the Clinic measurements are quite consistent with the expected proportions, with the single exception of examiner 3004, who is far from the expected equal proportions in all categories of measurement.

```{r proportionChisq3,include=TRUE,echo=FALSE,results='asis',message=FALSE,warning=FALSE,fig.height=7,fig.cap="Proportions of first, second and third measurements from each examiner that are different from the other two, by place and type. Chi-squared value for difference from expected proportions. Dashed line represents p-value 0.001."}
ggplot(subset(table1,Exam != 'Centre'),
                     aes(y=chisq.val,x=Exam))+
  geom_point() +
  facet_grid(BPplace~BPtype) +
  labs(y="Chi-square value for difference from expected proportions",x="Examiner") +
  theme_bw() +
  # add dashed horizontal line corresponding to p=0.001
  geom_hline(yintercept=qchisq(0.999,df=2),linetype="dashed",col='navyblue')+
  theme(axis.text.y = element_text(size=10),
        axis.text.x = element_text(size=10, angle=90, vjust =.5, hjust =1),
        axis.title.x = element_text(size=14),
        axis.title.y = element_text(size=14),
        strip.text.y = element_text(size=12),
        strip.text.x = element_text(size=12),
        panel.spacing = unit(0.5, "lines"),
         panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(color="black", fill=NA, size=0.5),
        panel.background = element_blank(),
        legend.position = "none")
```

Given that the position of the differing measure clearly differs from the expected equal proportions, we might ask whether the examiners agree on a common proportion, suggesting that there might be some underlying systematic (observer-independent) reason for the differing measurements.
In Table \ref{tab:ternaryChi2} we show the results of a chi-squared test for equality of observed proportions among the examiners, stratified by place and type.
Interestingly, we see here that the examiners are fairly consistent in their proportions for the Home measures, but not for the Clinic measures.


```{r ternaryChi2, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Chi-square test for difference between observed proportions and expected proportions for the table stratified by place and type
chisq2 <- table1 %>% group_by(BPplace,BPtype) %>% summarise(ChiSq=signif(chisq.test(cbind(Freq_1,Freq_2,Freq_3))$statistic,3), 
                                                            p.value=format(chisq.test(cbind(Freq_1,Freq_2,Freq_3))$p.value,scientific=TRUE,digits=3))

# Print table of chisq2
cat(kable(chisq2, format="latex", escape = F,booktabs = T,
      linesep = rep(c(rep("",5),"\\addlinespace"),4),
          col.names=c("Place","Sys/Dias","ChiSq","p-value"),
      caption="Chi-square test for difference between observed proportions among the examiners, stratified by place and type")%>%
  kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')

```

```{r ternaryPlot, include=TRUE,fig.height=8,fig.width=8, echo=FALSE,results='asis',message=FALSE,warning=FALSE,fig.cap="Ternary plot of the position of the measurement that is unique, among subjects with 2 equal measurements. V1 is the fraction with the first distinct, V2 is the fraction with the second distinct, V3 is the fraction with the third distinct."}
# make a ternary plot of the normalized frequencies

size_values <- setNames(rep(1.5, length(levels(table1$Exam))), levels(table1$Exam))
size_values["Centre"] <- 3 # make the centre bigger
size_values['3004'] <- 3 # make the outlier bigger

color_values <- setNames(c14, levels(table1$Exam)) # set the colours

ggtern(data=table1, aes(x=Normalized_1, y=Normalized_2, z=Normalized_3)) +
  geom_point(aes(fill=Exam, size=Exam),alpha=.75, shape=21) +
  theme_bw() + 
  facet_grid(BPplace~BPtype) +
theme(tern.axis.text.L = element_text(size=10),
      tern.axis.text.R = element_text(size=10),
      tern.axis.text.T = element_text(size=10),
      tern.axis.title.L = element_text(size=14, vjust=1,hjust=.5,face='bold'),
      tern.axis.title.R = element_text(size=14, vjust=1,hjust=.5,face='bold'),
      tern.axis.title.T = element_text(size=14, vjust=1,hjust=.5,face='bold'),
      strip.text.y = element_text(size=14),
      strip.text.x = element_text(size=14),
       panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
) + labs(L="V1", T="V2", R="V3") +
  theme_showgrid()+ theme(legend.position ='bottom') +
  scale_size_manual(values=size_values, name='Exam') +
  scale_color_manual(values=color_values, name='Exam') +
  guides(size=guide_legend(override.aes=list(size=5))) 
```

Looking at a ternary plot Figure \ref{fig:ternaryPlot} for the proportions from the 13 different examiners, we see very clearly the bias toward having the last two measures agree, for almost all examiners, and examiner 3004 (marked larger) standing out as a clear outlier.


Overall, we can only conclude that there are clearly some irregularities in the BP measurement process, but we cannot identify a specific structure to them, or propose a remedy.
As the irregularities are not very large, we will proceed with the analysis without attempting to correct for them.


### Missing or implausible measurements {#sec:BPrange}

Some of the reported measures were extremely implausible, particularly for diastolic BP. `r sum(dias0)`subjects had at least one diastolic BP measure recorded as 0, in addition to the `r sum(!nhanesna)` subjects who were missing at least one measurement. We excluded all of these subjects, and indeed any subject who had at least one measurement recorded outside the ranges (40,140) for diastolic and (60,250) for systolic BP, as recommended by the CDC [@CDCBP]. 
There was just one subject with systolic BP measures that were too low, but `r sum(nhanesdiasrange)` subjects with low diastolic BP (in addition to those with measures recorded as 0). 
One subject was excluded for diastolic BP 156, and three were excluded for systolic BP that was too high, with the maximum being 264. 

\newpage

<!-- ## Exploratory data analysis -->
<!-- The empirical means of the home and clinic measures in population B are tabulated in Table 1. We note that the home measures are systematically higher than the clinic measures, within every demographic group, with greater differences for subjects who are white or Mexican, and female.  -->
<!-- The average difference is about `r sprintf('%.1f',mean(diasH)-mean(dias))` for diastolic and `r sprintf('%.1f', mean(sysH) - mean(sys))` for systolic, which is small compared with the general range of the differences, which have SD of  -->
<!-- `r sprintf('%.1f (diastolic) and %.1f (systolic)', sd(diasH-dias),sd(sysH-sys))`. -->

<!-- ```{r summaries, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE} -->
<!-- # Printing the table using kable -->
<!--     cat(kable(mean_sd_summary,format="latex", escape = F,booktabs = T, digits=1, -->
<!--               linesep = rep(c(rep("",5),"\\addlinespace"),4), -->
<!--                   col.names=c("Sys/Dias", "Place","Sex", "Ethnicity", 'Mean', 'SD'),caption = 'Summary data for blood pressure') %>%  kable_styling(latex_options = "hold_position") ) -->
<!--     cat('\n') -->
<!-- ``` -->

<!-- ### Correlations between measurements {#sec:correlations} -->
<!-- We see that there is relatively little correlation between empirical SD and empirical mean SD for the different BP types and places. This is reassuring, as it avoids the possibility of a collinearity effect confounding the sampling of mean and SD, which are being treated as independent covariates in the model. -->

<!-- ```{r SD-mean, dpi = 1,fig.pos='H', fig.width=8,fig.height=6, fig.cap='2D density plot of individual mean BP against individual SD of BP', echo=FALSE,results='asis',message=FALSE,warning=FALSE} -->
<!-- # 2D density plot of mean BP against SD of BP from bp.data -->
<!-- ggplot(bp.data, aes(x = Mean, y = SD)) + -->
<!--   geom_bin2d() + -->
<!--   labs(title = "2D Density Plot of Mean against SD", -->
<!--        x = "Mean",y = "SD") + ylim(0,15)+ -->
<!--   scale_fill_continuous(type = "viridis") + -->
<!--   theme_bw()+ -->
<!-- geom_smooth(method = "lm", se = FALSE, color = "black") +  # Add regression line -->
<!--   facet_grid(BPtype ~ BPplace, scales = 'free_x') + -->
<!--   theme(plot.title = element_text(hjust = 0.5)) + -->
<!--   geom_text(data = bp.data.cor, aes(label = sprintf("Cor: %.2f", correlation), x = Inf, y = Inf),  -->
<!--             vjust = "top", hjust = "right", inherit.aes = FALSE) -->

<!-- ``` -->


# Appendix B -- Model details
This appendix aims to add more detail about the numerical modelling than was provided in the article. This is to ensure that the research methods are transparent and entirely reproducible. The numerical modelling presented in this paper was performed using R combined with Rstan. More detail will be provided here about the model, about the specific methodology used to parameterize the model, and more results are provided that were not included in the main text.

## The Statistical Model
The model used in this research is built from the theory of joint modelling of longitudinal and time-to-event data. This will be described in detail later on in this section, however, in brief, this allows the simultaneous modelling of both longitudinal observation data (in this article, this is blood pressure measurements) and also the time-to-event outcome. 
In this research the event of interest is either death from any cause, or death from specifically cardiovascular or cerebrovascular causes. We henceforth will refer to this latter mortality as CVD.
In the latter case, death from a different cause is treated as a noninformative censoring event.

### Survival Analysis (Time-to-Event)

The basic survival model is a Gompertz hazard rate with proportional hazards influences of the blood pressure covariates. 
The Gompertz equation 
\begin{equation}\label{gompertz}
h_0(t)=B\exp{\left(\theta(x+T)\right)},
\end{equation}
describes the baseline hazard of the population to a particular risk, which, for this article, investigates CVD mortality specifically, as well as studying mortality risk in general. $x\in\mathbb{N^N}$ is the age of the individual at the initial interview time, for $N$ the number of individuals, and $T\in\mathbb{R}^{+,N}$ the time since the individual entered the survey. 
Note that both $B$ and $\theta$ have 6 different values, depending on the sex reported at the initial interview -- female or male --- or the race --- black, white or 'other'. 
Note that 'other' in the race category is a combination of all non-black or non-white racial identities, such as Hispanic populations. 
The log-linear proportional hazards model links the covariates of the model (mean systolic blood pressure, variance in the diastolic blood pressure, etc) to the survival outcome of the individual via the equation
\begin{equation}\label{prophaz}
h(t)=h_0(t)\exp{\left(\boldsymbol{\beta}\cdot(\boldsymbol{X}-\hat{\boldsymbol{X}})\right)},
\end{equation}
where $\boldsymbol{X}\in\mathbb{R}^{+,N\times d}$ is a vector of summary statistics of the blood pressure measurements of individual covariates in our model, $\hat{\boldsymbol{X}}\in\mathbb{R}^{+,d}$ is the centering of the covariates such that the equation $\sum_i^N \exp{(\boldsymbol{\beta}\cdot(\boldsymbol{X}-\hat{\boldsymbol{X}}))}=0$ is approximately satisfied (more on this later), and $\boldsymbol{\beta}\in\mathbb{R}^d$ implies the strength of the influence of the covariate on the mortality risk. 
The majority of mortality events are censored --- not yet known at the time of data collection --- the censoring indicator being notated as $\delta\in \{0,1\}$.
When CVD mortality is the event being analysed, deaths due to other causes are treated as noninformative censoring events.
In this study, we explored the following covariates:

```{r, include=FALSE,message=FALSE,warning=FALSE}
DF<-data.frame(varnames=c("$FRS-1998$",
                          "$FRS-ATP$",
                          "$M_S$",
                          "$M_D$",
                          "$\\Delta_S$",
                          "$\\Delta_D$",
                          "$\\sigma_{\\{S,H\\}}$",
                          "$\\sigma_{\\{D,H\\}}$",
                          "$\\sigma_{\\{S,C\\}}$",
                          "$\\sigma_{\\{D,C\\}}$",
                          "$\\tau_{\\{S,H\\}}$",
                          "$\\tau_{\\{D,H\\}}$",
                          "$\\tau_{\\{S,C\\}}$",
                          "$\\tau_{\\{D,C\\}}$"),
               support=c("$R^N$",
                         "$R^N$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$",
                         "$R^{+,N}$"),
               desc=c("1998 version of the FRS score",
                      "ATP version of the FRS score",
                      "Mean systolic blood pressure",
                      "Mean diastolic blood pressure",
                      "Semi-difference between Home and Clinic mean systolic blood pressure",
                      "Semi-difference between Home and Clinic mean diastolic blood pressure",
                      "Standard deviation of the systolic blood pressure taken at home",
                      "Standard deviation of the diastolic blood pressure taken at home",
                      "Standard deviation of the systolic blood pressure taken at the clinic",
                      "Standard deviation of the diastolic blood pressure taken at the clinic",
                      "Precision of the systolic blood pressure taken at home",
                      "Precision of the diastolic blood pressure taken at home",
                      "Precision of the systolic blood pressure taken at the clinic",
                      "Precision of the diastolic blood pressure taken at the clinic"))

```

```{r, echo=F, label= "runnumers",message=FALSE,warning=FALSE,results='asis'}

cat(kable(DF,col.names = c("Variable Name", "Support", "Description"),
          format = "latex", 
          caption="Explanations of the different models simulated in this work, according to run number.",
          escape = FALSE,
          booktabs = TRUE, 
          linesep = "",
          digits = 3) %>%  
          kable_styling(latex_options = c("hold_position","striped")) )
    cat('\n')

```

Please note that the last four elements of this list, the precision values, were only carried out to ensure model consistency with the use of standard deviation instead. 
Note as well that the $\Delta$ covariates, representing the medium-term variability, enter into the log relative risk sum as an **absolute value**.

For the parametrization of this model, we assume that the Gompertz parameters and the parameters in the linear predictor term have prior distributions as follows:
\begin{equation}\label{priorsS}
\begin{aligned}
  \boldsymbol{B}\sim\mathcal{N}(\mu_B,\sigma_B),\\
  \boldsymbol{\theta}\sim\mathcal{N}(\mu_\theta,\sigma_\theta),\\
  \boldsymbol{\beta}\sim \mathcal{N}(\mu_\beta,\sigma_\beta),
\end{aligned}
\end{equation}

The likelihood for this Gompertz proportional hazards model, over all individuals in the census, is as follows:
\begin{equation}\label{likesurv}
L_S(\boldsymbol{v},\boldsymbol{\delta})=\prod_i^N f(v_i,\delta_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}})=\prod_i^N h(v_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}})^{\delta_i} \exp{\left( -\sum_i^N H(v_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}}) \right)},
\end{equation}
with $H(v)=\int_0^v h(w) \mathrm{d}w$ the cumulative hazard.

### Longitudinal Modelling

The mortality hazard rates are assumed to be influenced by individual-level blood pressure means and variability characteristics.
These characteristics are not directly observed, but are inferred from their influence on the individual blood pressure measurements, which have been observed.
Let $Y_i(t_j)$ be the observed blood pressure for patient $i$ at time $t_j$, for the individual $i\in 1,2,...,N$ and the number of blood pressure measurements per individual $j\in 1,2,...,k$. Due to the fact that the blood pressure measurement data was taken at both the home and clinic (written using subscripts H and C, respectively), with approximately 6 months between these two measurements, we model the blood pressure using the following model, assuming the diastolic $Y_{i}^D$ and systolic $Y_{i}^S$ blood pressure to be Gaussian-distributed:
\begin{equation}\label{bp}
\begin{aligned}
  (Y_{i}^D)_{H} \sim \mathcal{N}(M_i^D+\Delta_i^D,(\sigma_i^D)_H),\\
  (Y_{i}^D)_{C} \sim \mathcal{N}(M_i^D-\Delta_i^D,(\sigma_i^D)_C),\\
  (Y_{i}^S)_{H} \sim \mathcal{N}(M_i^S+\Delta_i^S,(\sigma_i^S)_H),\\
  (Y_{i}^S)_{C} \sim \mathcal{N}(M_i^S-\Delta_i^S,(\sigma_i^S)_C),
\end{aligned}
\end{equation}
where superscripts $D$ and $S$ refer to diastolic and systolic blood pressure, respectively. 

The blood pressure characteristics --- the individual-level parameters --- are themselves distributed according to a hierarchical model, determined by population-level parameters (also called ``hyperparameters''):
\begin{equation}\label{priorsL}
\begin{aligned}
  M_i^{\{D,S\}}\sim \mathcal{N}(\mu_M^{\{D,S\}},\sigma_M^{\{D,S\}}),\\
  \Delta_i^{\{D,S\}}\sim \mathcal{N}(\mu_D^{\{D,S\}},\sigma_D^{\{D,S\}}),\\
  \sigma_{i,C}^{\{D,S\}}\sim \Gamma(r_C^{\{D,S\}},\lambda_C^{\{D,S\}}),\\
  \sigma_{i,H}^{\{D,S\}}\sim \Gamma(r_H^{\{D,S\}},\lambda_H^{\{D,S\}}).
\end{aligned}
\end{equation}
The longitudinal outcome modelling therefore aims to infer these hyperparameters
\begin{equation}
  \Theta=\left\{\mu_M^{\{D,S\}},\mu_D^{\{D,S\}},\sigma_M^{\{D,S\}},\sigma_D^{\{D,S\}},r_C^{\{D,S\}},\lambda_C^{\{D,S\}},r_H^{\{D,S\}},\lambda_H^{\{D,S\}}\right\},
\end{equation}
and to use the implied uncertainty about the individual-level parameters to inform the inference about the survival parameters.
The likelihood for the longitudinal measurements is therefore (combining the systolic and diastolic into a single parameter for simplicity):
\begin{equation}\label{likelong}
  L_L(\Theta|Y)=\prod_{i=1}^N\left(\prod_{j=1}^{k}f(y_{ij}|M_i,\Delta_i,\sigma_i)\right)f(M_i|\mu_M,\sigma_M)f(\Delta_i|\mu_D,\sigma_D)f(\tau_{i,C}|r_C,\lambda_C)f(\tau_{i,H}|r_H,\lambda_H)
\end{equation}

### Combined Hierarchical Model

Combining the longitudinal outcome and time-to-event partial likelihoods, and for a given parameter space value of $\Omega=\{\beta,B,\theta\}\cup \Theta$, the joint likelihood is
\begin{equation}
\begin{split}
  L(\Omega|Y)=\prod_{i=1}^N\left(\prod_{j=1}^{k}f(y_{ij}|M_i,\Delta_i,\sigma_i)\right)f&(v_i,\delta_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}})f(M_i|\mu_M,\sigma_M)\\
  &f(\Delta_i|\mu_D,\sigma_D)f(\tau_{i,C}|r_C,\lambda_C)f(\tau_{i,H}|r_H,\lambda_H).
  \end{split}
\end{equation}
One approach to estimating the complete set of hyperparameters
\begin{equation}
  \Omega_H=\{\mu_B,\sigma_B,\mu_\theta,\sigma_\theta,\mu_\beta,\sigma_\beta,\mu_M^{\{D,S\}},\sigma_M^{\{D,S\}},\mu_D^{\{D,S\}},\sigma_D^{\{D,S\}},r_C^{\{D,S\}},\lambda_C^{\{D,S\}},r_H^{\{D,S\}},\lambda_H^{\{D,S\}}\}
\end{equation}
is to impose a higher-level prior distribution, and use the machinery of Bayesian inference to produce posteriors for everything.
This approach runs into computational difficulties, which have led us to a two-stage `empirical Bayes' approach, where the hyperparameters for the longitudinal model are first fixed by a maximum-likelihood calculation, after which the remaining hyperparameters and individual-level parameters can be estimated with Bayesian machinery. 
For the time-to-event parameters we choose flat hyperpriors, selecting the hyperparameters $\mu_B=\mu_\theta=\mu_\beta=0$,  $\sigma_B=\sigma_\theta=2$, and $\sigma_\beta=100$.

### The modelling variants

In this article, we researched into 16 variants of the model-fitting problem, but focussed mainly on 8 of them. 
The 8 main models use the standard deviation, $\sigma$, as the measure of the influence of blood-pressure variability on mortality. 
We also produced the same 8 models but using precision, $\tau=1/\sigma^2$, as the measure of the influence of blood-pressure variability on mortality. 
However, this was only to ensure that there were no differences between the use of one over the other. 
Throughout the remainder of this appendix, we refer to the 8 main models using the following run numbers:

\begin{enumerate}
\item All participants (14,654), using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as death specifically from CVD.
\item All participants (14,654), using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as all-causes of death.
\item Only participants that had data from which FRS values could be computed (N=9,008) --- the ``FRS population'' but using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as death specifically from CVD.
\item FRS population, but using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as all-causes of death.
\item FRS population, and using the FRS ATP-III value in the linear predictor term, with the outcome data as death specifically from CVD.
\item FRS population, and using the FRS ATP-III value in the linear predictor term, with the outcome data as all-causes of death.
\item FRS population, and using the FRS 1998-version value in the linear predictor term, with the outcome data as death specifically from CVD.
\item FRS population, and using the FRS 1998-version value in the linear predictor term, with the outcome data as all-causes of death.
\label{tab:runnums}
\end{enumerate}

We also include Directed Acyclical Graph (DAG) sketches to help visualize the different models, as shown in figures \ref{fig:DAGmean} and \ref{fig:DAGFRS}. 
In order to read the DAGs, note that each square background layer that appears as a stack of layers represents different measured outcomes that were made in the first wave of the survey. 
The outcome variables measured are represented by a square-shaped text box, and a parameter of the model is represented by a circular-shaped text box. If either a square or circular text box is placed on top of a stacked rectangular layer, it means that multiple values of that variable (as many as there are layers to the stack) are either measured (for outcome variables) or simulated (for parameters of the model). Please note that the number of layers in the stack is written in the text box that does not contain a frame which is intentionally displayed on top of the stacked layer that it represents. For example, $i=1,...,N$. 
Finally, the direction of the arrows implies causality assumed in the model.

The distribution of the blood pressure parameters in the population are derived from the model, and are summarised with other outputs of the model in table \ref{tab:Mean-SD} of Appendix C.

![An illustration of the DAG of the mean blood pressure-based model presented in this article. ](./DAG_Mean2.png){#fig:DAGmean}

![An illustration of the DAG of the FRS-based model presented in this article. ](./DAG_FRS2.png){#fig:DAGFRS}

<!--
For convenience, we provide an overview of the different blood pressure values in the full and FRS populations in tables \ref{tab:BloodFull} and \ref{tab:BloodFRS}.


```{r BloodFull,echo=F,message=FALSE,warning=FALSE}
Blood<-read_csv("./Results/BloodVals.csv",show_col_types = FALSE)
kable(Blood[Blood$Population=="Full",-1] ,
format="latex", escape = F,booktabs = T, caption = "Parameters for distribution of blood pressure, for the full population") %>%
  row_spec(4, extra_latex_after = "\\addlinespace")
```


```{r BloodFRS,echo=F,message=FALSE,warning=FALSE}
Blood<-read_csv("./Results/BloodVals.csv",show_col_types = FALSE)
kable(Blood[Blood$Population=="FRS",-1],
format="latex", escape = F,booktabs = T, caption = "Parameters for distribution of blood pressure, for the FRS population") %>%
  row_spec(4, extra_latex_after = "\\addlinespace")
```
-->


## Computational methodology

The methodology for this research can be split into three main sections: 1) calculating the empirical Bayes' parameters, 2) parameterizing the model using Hamiltonian Monte Carlo (HMC) and 3) re-centering the variables in the linear predictor equation. By applying empirical Bayes', Maximum Likelihood Estimates (MLEs) of some of the parameter distributions are provided. Note that the parameters estimated here are only the prior distribution of the global (not individual) blood pressure means and the variances, for both systolic and diastolic and home and clinic measurements. These estimates are then provided as prior distributions for the Stan MCMC simulations using HMC, where estimates can be made for all the parameter distributions of the model, given the specific centering applied. Finally, section (3) recalculates the centering values based on the previous MCMC iteration, and sets of the next iteration, while simultaneously checking for convergence in both the MCMC simulations and the centering values.

## Empirical Bayes Parameters

First, we extract the intervals for the digits in the blood pressure measurement recordings. Suppose the fractions of digits 0,2,4,6,8 are $b_0,b_2,b_4,b_6,b_8$.
Letting $B_0=0$ and $B_k=10\sum_{j=0}^{k-1}b_{2j}$ for $k=1,\dots,5$,
we want to choose a positive $a$ and place breaks at $-a+B_k$, so that measurements between $-a+B_k$ and $-a+B_{k+1}$ modulo
10 are assigned the final digit $2k$, for $k=0,\dots,4$.
We choose $a$ to minimise the total distance of the intervals from the rounded value:
$$
  \sum_{k=0}^4 \int_{-a+B_k}^{-a+B_{k+1}} \bigl| x-2k\bigr|\mathrm{d} x=\frac12\sum_{k=0}^4 \left(-a+B_k-2k\right)^2 + \left(-a+B_{k+1}-2k\right)^2,
$$
as long as $2k$ is in the appropriate interval. This is minimized at
$$
  a= \frac{1}{5}\left(B_1+B_2+B_3+B_4 - 15\right)=\sum_{j=0}^3 (8-2j) b_{2j} \, - 3.
$$


```{r imputation, eval=TRUE, tidy=TRUE,include=FALSE,message=FALSE, warning=FALSE}
digitbreaks <- function(bp){
  digit_table <- unname(table(unlist(bp)%%10))
  digits=digit_table/sum(digit_table)
  a <- sum(seq(8,0,by = -2)*digits) - 3 # Find the starting point for the first interval that minimises
  list(BP=bp,digit.breaks=c(0,cumsum(digits[-5]))*10-a)
}

# Combine the measures into a list, so they can be processed uniformly
#   First level, Systolic or Diastolic
#   Second level, Home or Clinic
#  Third level, BP and breaks
allBP <- list(Systolic=list(Home=digitbreaks(sysH),Clinic=digitbreaks(sys)),Diastolic=list(Home=digitbreaks(diasH),Clinic=digitbreaks(dias)))

  # Shift by the mean of the cumulative sums; apply transposes, so we transpose back

### Impute fractional parts. 

impute <- function(d, breaks=c(1,3,5,7,9)){
  # Intervals defined relative to the centre
  right_breaks <- breaks-seq(0,8,by=2)
    left_breaks <- c(breaks[5]-10,breaks[1:4]-seq(2,8,by=2))
  if (!all(d%%2==0)){stop('Not all even digits')}
  else{
    if (is.null(dim(d))){ # Not an array
      d2 <- (d%%10)/2+1
      runif(length(d2))*(right_breaks[d2]-left_breaks[d2])+left_breaks[d2]+d
    }
    else{
      apply(d,2,function(dd) impute(dd,breaks))
    }
  }
}

# Input is a list BP=matrix of measures, digit.breaks=break points
imputeBP <- function(bp_with_breaks){
  breaks <- bp_with_breaks$digit.breaks
  d <- bp_with_breaks$BP
  list(BP=impute(d,breaks),digit.breaks=breaks)
}

allBP_imp <- lapply(allBP, function(BPtype) lapply(BPtype,function(BPplace) imputeBP(BPplace)))
```
Next step, we fit the BP distribution parameters. We suppose that each individual has BP measures
$\tilde{y}_{ij}^l$ for $i=1,\dots,n$, $j=1,\dots,k$ (default $k=3$),
and $l=1,2$, which are rounded versions of 
$$
  y_{ij}^l \sim \mathcal{N}\bigl( \mu_i^l,(\tau_i^l)^{-1}\bigr),
$$
where 
\begin{align*}
\mu_i^1&=(M_i+\Delta_i)/2,\\
\mu_i^2&=(M_i-\Delta_i)/2,\\
M_i&\sim \mathcal{N}\bigl(m_M,\sigma^2_M) \text{ and }
  \Delta_i\sim \mathcal{N}\bigl(m_\Delta,\sigma^2_\Delta) \text{ independent,}\\
  \tau_i^l &\sim \mathrm{Gamma}(\alpha^l,\alpha^l/\theta^l ).
\end{align*}
(Note that $\alpha^l$ is the usual shape parameter,
while $\theta^l$ is the expectation.)

We wish to estimate the eight parameters 
$$
(m_M,m_\Delta,\sigma^2_M,\sigma^2_\Delta,\alpha^1,\theta^1,\alpha^2,\theta^2)
$$
We begin by assuming $y_{ij}^l$ observed directly. We estimate
by maximising the partial likelihood on the observations
\begin{align*}
  \bar{y}_{i+}&:= \frac{1}{2k} \sum_{j=1}^k \bigl( y_{ij}^1 + y_{ij}^2\bigr),\\
  \bar{y}_{i-}&:= \frac{1}{2k} \sum_{j=1}^k \bigl( y_{ij}^1 - y_{ij}^2 \bigr),\\
  s_i^l&:=  \frac{1}{k-1}\sum_{j=1}^k \Bigl( y_{ij}^l - \frac{1}{k} \sum_{j=1}^k y_{ij}^l \Bigr)^2.
\end{align*}
Note that 
$$
(k-1)s_i^l \tau_i^l =\sum_{j=1}^k \Bigl( z_{ij}^l - \frac{1}{k} \sum_{j=1}^k z_{ij}^l \Bigr)^2.
$$ 
where $z_{ij}^l$ are i.i.d.\ standard normal
is independent of $\tau_i^l$, thus has a chi-squared distribution
with $k-1$ degrees of freedom --- hence $\frac{k-1}{2}\cdot s_i^l\tau_i^l$ is
gamma distributed with parameters $(\frac{k-1}{2},1)$. Since $\frac{\alpha}{\theta}\tau_i^l$ is independent of $s_i^l\tau_i^l$, with $\mathrm{Gamma}(\alpha,1)$ distribution, we see that $\frac{(k-1)\theta}{2\alpha}s_i^l$ is the ratio of two independent gamma random variables, hence has beta-prime distribution with parameters $\left(\frac{k-1}{2}, \alpha \right)$, so log partial likelihood
$$
  \ell_{\operatorname{Beta}}(\alpha,\theta;s^l_\cdot)=n\alpha \log\frac{\alpha}{\theta}+n\log\Gamma\left(\alpha+\frac{k-1}{2}\right)-n\log\Gamma(\alpha)
  + \frac{k-1}{2} \sum_{i=1}^n \log s_i^l -\left(\alpha+\frac{k-1}{2}\right) \sum_{i=1}^n \log \left(s_i^l+\frac\alpha\theta\right).
$$
Note as well that these quantities $(k-1)s_i^l$ should correspond to empirically observed individual variances; hence we will compare these empirical variances (with imputed fractional parts) divided by the normalization factor $2\alpha/(k-1)\theta$ to the beta-prime distribution below as a goodness-of-fit test.

The partial Fisher Information has entries
\begin{align*}
 -\frac{\partial^2 \ell}{\partial \alpha^2} &=
   n\psi_1\left(\alpha\right) - n\psi_1\left(\alpha+\frac{k-1}{2}\right)
  - \frac{n}{\alpha} +\sum_{i=1}^n \frac{2\theta s_i^l + \alpha-(k-1)/2}{(\theta s_i^l + \alpha)^2}\\
-\frac{\partial^2 \ell}{\partial \theta^2} &=
   -\frac{n \alpha}{\theta^2} +\frac{\alpha}{\theta^2}\left(\alpha+\frac{k-1}{2}\right)\sum_{i=1}^n \frac{2\theta s_i^l + \alpha}{(\theta s_i^l + \alpha)^2}\\
-\frac{\partial^2 \ell}{\partial \theta\partial\alpha} &= \frac{n}{\theta}-
   \frac1\theta \sum_{i=1}^n \frac{\alpha^2+2\alpha\theta s_i^l+\frac{k-1}{2}\theta s_i^l}{(\theta s_i^l + \alpha)^2}.
\end{align*}
where $\psi_1$ is the trigamma function.
```{r log_likelihoods, tidy=TRUE, echo=FALSE,message=FALSE, warning=FALSE}
  beta_prime_LL=function(alpha,theta,s,k=3){ # k= # of BP measures
    k1= (k-1)/2
    n<- length(s)
    alpha * n * log( alpha / theta ) - n * lbeta(k1 , alpha )+ (k1-1)* sum(log(s)) -
      (alpha + k1 )*sum(log(s+alpha/theta))
  }

beta_prime_gradient= function(alpha,theta,s,k=3){
  n<- length(s)
  k1 <- (k-1)/2
  d_a <- -n*log(alpha/theta)  - n*digamma(alpha) + n*digamma(alpha+ k1 ) - (alpha+ k1) * sum(log(s * theta -k1)/(s*theta+alpha))
  d_t <- -n*alpha/theta + (alpha+ k1 ) * alpha/theta * sum( 1 /(theta*s+ alpha))
  c(d_a,d_t)
}

# Fisher Information
beta_prime_FI=function(alpha,theta,s,k=3){
    n<- length(s)
    k1 <- (k-1)/2
    d_aa <- -n/alpha + sum(( 2*theta*s+alpha - k1 )/(theta*s+alpha)^2) -
      n*trigamma(alpha+k1 ) + n * trigamma(alpha)
    d_tt <- -n* alpha /theta^2 + ( alpha + k1 )*alpha/theta^2 * sum((alpha+2*theta*s)/
                                           (theta*s+alpha)^2) 
    d_at <- sum((s^2*theta - s*k1) / (alpha+ theta*s)^2)
    matrix(c(d_aa,d_at,d_at,d_tt), 2, 2)
}

alphatheta <- function(BP){
  k <- dim(BP)[2]
  s <- (k-1)/2*apply(BP,1,var)
  LL <- function(gamma_params){
    -beta_prime_LL(gamma_params[1],gamma_params[2],s,k)
  }
  LL_grad <- function(gamma_params){
    -beta_prime_gradient(gamma_params[1],gamma_params[2],s,k)
  }
  # Using optim because constrOptim doesn't work
  #fit <- suppressWarnings( constrOptim(c(1,1),f = LL, grad = LL_grad, ui = diag(1,nrow = 2,ncol = 2), ci = c(0,0)) ) # Constraint matrix identity, so constrained >0
  fit <- suppressWarnings( optim(par = c(1,1),fn = LL, gr = LL_grad ) )
  fisher_info <- beta_prime_FI(fit$par[1],fit$par[2],s,k)
  list( alpha = fit$par[1], theta = fit$par[2], variance = solve(fisher_info), LogLikelihood = -fit$value)
}
```
Let $(\hat\alpha^l,\hat\theta^l)$ be the maximum partial likelihood estimators. Conditioned on $(\tau_i^l)$ we have
\begin{align*}
  \bar{y}_{i+}&\sim \mathcal{N}\left(m_M, \sigma^2_M + \frac{1}{4k}\left( \frac{1}{\tau_i^1}+\frac{1}{\tau_i^2}\right)\right),\\
  \bar{y}_{i-}&\sim \mathcal{N}\left(m_\Delta,\sigma^2_\Delta + \frac{1}{4k}\left( \frac{1}{\tau_i^1}+\frac{1}{\tau_i^2}\right)\right).
\end{align*}
We would then have MLEs
\begin{align*}
  \hat{m}_M&= \frac{1}{n} \sum_{i=1}^n \bar{y}_{i+},\\
  \hat{m}_\Delta&= \frac{1}{n} \sum_{i=1}^n \bar{y}_{i-},
\end{align*}
which are approximately normally distributed, with means $m_M$ and $m_\Delta$ respectively, and conditional on $\tau_i^l$ standard errors
$$
  \frac{\sigma_M^2}{n}+\frac{1}{4kn^2} \sum_{i=1}^n (\tau_i^1)^{-1} + (\tau_i^2)^{-1} \quad \text{ and } \quad
  \frac{\sigma_\Delta^2}{n}+\frac{1}{4kn^2} \sum_{i=1}^n (\tau_i^1)^{-1} + (\tau_i^2)^{-1},
$$
which we may approximate --- with error on the order of $n^{-3/2}$ --- replacing the mean of $(\tau_i^l)^{-1}$ by its expected value $\theta^l/(\alpha^l-1)$ to obtain
\begin{align*}
  \mathrm{Var}(\hat{m}_M) &\approx \frac{\sigma_M^2}{n}+\frac{1}{4kn}\left( \frac{\theta^1}{\alpha^1-1}+ \frac{\theta^2}{\alpha^2-1}\right) \\
  \mathrm{Var}(\hat{m}_\Delta) &\approx \frac{\sigma_\Delta^2}{n}+\frac{1}{4kn}\left( \frac{\theta^1}{\alpha^1-1}+ \frac{\theta^2}{\alpha^2-1}\right) 
\end{align*}
Finally, conditioned on the $\tau_i^l$ we have that the random variables $\bar{y}_{i+}$ are normal with variance
$$
  \sigma_M^2+\frac{1}{4k}\left((\tau_i^1)^{-1} + (\tau_i^2)^{-1} \right),
$$
so the unconditional variance is the expected value, or
$$
  \sigma_M^2+\frac{1}{4k}\left(\frac{\theta_1}{\alpha_1-1}+ \frac{\theta_2}{\alpha_2-1} \right).
$$
This yields the estimators
\begin{align*}
  \hat\sigma_M^2 &=\frac{1}{n-1}\sum_{i=1}^n\left(\bar{y}_{i+}-n^{-1}\sum_{i=1}^n \bar{y}_{i+}\right)^2 - \frac{1}{4k}\left(\frac{\hat\theta_1}{\hat\alpha_1-1}+ \frac{\hat\theta_2}{\hat\alpha_2-1} \right),\\
  \hat\sigma_\Delta^2 &=\frac{1}{n-1}\sum_{i=1}^n\left(\bar{y}_{i-}-n^{-1}\sum_{i=1}^n \bar{y}_{i-}\right)^2 - \frac{1}{4k}\left(\frac{\hat\theta_1}{\hat\alpha_1-1}+ \frac{\hat\theta_2}{\hat\alpha_2-1} \right).
\end{align*}
Using the delta method, and the fact that the correlation between $\hat\alpha$ and $\hat\theta$ is small, we see that the variance of $\hat\theta/(\hat\alpha-1)$ is approximately
$$
  \frac{\sigma_\theta^2}{(\hat\alpha-1)^2} + \frac{\hat\theta^2\sigma_\alpha^2}{(\hat\alpha-1)^4},
$$
where $\sigma_\alpha$ and $\sigma_\theta$ are the standard errors for $\hat\alpha$ and $\hat\theta$ respectively. Define
$$
  \hat\sigma_{\alpha\theta}^2 := \frac{1}{16k^2} \left(\frac{\hat\sigma_{\theta_1}^2}{(\hat\alpha_1-1)^2} + \frac{(\hat\theta_1)^2\hat\sigma_{\alpha_1}^2}{(\hat\alpha_1-1)^4} + \frac{\hat\sigma_{\theta_2}^2}{(\hat\alpha_2-1)^2} + \frac{(\hat\theta_2)^2\hat\sigma_{\alpha_2}}{(\hat\alpha_2-1)^4}\right)
$$
so the standard errors for $\hat\sigma_M^2$ and $\hat\sigma_\Delta^2$ are approximately
\begin{align*}
  \operatorname{SE}\left(\hat\sigma_M^2\right)&\approx \Bigl(\frac{2\hat\sigma_M^4}{n} + \hat\sigma_{\alpha\theta}^2 \Bigr)^{1/2},\\
  \operatorname{SE}\left(\hat\sigma_\Delta^2\right)&\approx \Bigl(\frac{2\hat\sigma_\Delta^4}{n} + \hat\sigma_{\alpha\theta}^2 \Bigr)^{1/2}.
\end{align*}
```{r parameter_fitting, tidy=TRUE, cache=TRUE,echo=FALSE, eval=TRUE,include=FALSE,message=FALSE,warning=FALSE}
# Input a pair of BP lists (clinical and home), and output 8 parameters, with variances for each
all_BP_parameters <- function(measures){
  stopifnot(setequal(names(measures) , BP_place_names),all.equal(dim(measures$Clinic),dim(measures$Home)))
  k <- dim(measures$Clinic$BP)[2]
  n <- dim(measures$Clinic$BP)[1]
  s <- lapply(measures,function(m) (k-1)/2*apply(m$BP,1,var))
  ybar <- vapply(measures,function(m) apply(m$BP,1,mean),FUN.VALUE = rep(0,n))
  ybar_plus <- (ybar[,'Home']+ybar[,'Clinic'])/2
  ybar_minus <- (ybar[,'Home']-ybar[,'Clinic'])/2
  m_M <- mean(ybar_plus)
  m_Delta <- mean(ybar_minus)
  gamma_fits <- lapply(measures,function(m) alphatheta(m$BP))
  variance_correction <- sum(sapply(gamma_fits,function(GF) GF$alpha/GF$theta/(GF$alpha - 1)))/4/k
  sigmaalphatheta <- sum(sapply(gamma_fits,function(GF) diag(GF$variance)*c(GF$alpha^2/GF$theta^2/(GF$alpha-1)^4,1/(GF$alpha-1)^2)))/16/k^2 # correction to the variance of the variance
  sigma2_M <- var(ybar_plus) - variance_correction
  sigma2_Delta <- var(ybar_minus) - variance_correction
  list(Gamma=gamma_fits,Normal=list(M=list(m=m_M,m.std.error=sd(ybar_plus)/sqrt(n),sigma2 = sigma2_M, sigma2.std.error = sqrt(2*sigma2_M^2/n+sigmaalphatheta)),Delta=list(m=m_Delta,m.std.error=sd(ybar_minus)/sqrt(n),sigma2 = sigma2_Delta, sigma2.std.error = sqrt(2*sigma2_Delta^2/n+sigmaalphatheta))))
}

# Apply separately to Systolic and Diastolic
BP_parameters <- lapply(allBP_imp,function(bp) all_BP_parameters(bp))
save(BP_parameters, file = 'BP_parameters.RData')

## Define functions to convert parameters between the format where 
##   Sys and Dias are the top level, and where normal and gamma are separate
##   convert_param1 takes list (systolic, diastolic), outputs list (normal,gamma)
##   convert_param2 does the opposite
convert_param1 <- function(all_params){
  norm_params <- lapply(all_params, function(params) {
    with(params$Normal, c(m_M = M$m , m_Delta = Delta$m, sigma2_M = M$sigma2, sigma2_Delta = Delta$sigma2))
  })
  gamma_params <- lapply(all_params, function(params) {
    GP <- array( 0, dim = c(3,2) , dimnames = gamma_dimnames)
  GP[ 'alpha', ] <- c(params$Gamma$Home$alpha, params$Gamma$Clinic$alpha)
  GP[ 'theta', ] <- c(params$Gamma$Home$theta, params$Gamma$Clinic$theta)
  GP[ 'beta', ] <- GP[ 'alpha', ] / GP[ 'theta', ]
  GP
  })
  list(Normal = norm_params, Gamma = gamma_params )
}
```

<!-- ## Test whether parameters are being fit correctly -->
<!-- We simulate bootstrap data sets, find the average parameter estimate, and compare to the "true" parameters. -->
<!-- We also compare the average estimated SE to the "true" SE (which is the SD of the estimates). -->
```{r testparameters, eval=TRUE, echo=FALSE, cache = TRUE, results= 'asis',include=TRUE,message=FALSE,warning=FALSE}
k <- 3
num_reps <- 10

BP_simulation <- function(norm_params,gamma_params,num_indiv, num_sim=1,k=3){
  # norm_params is a vector with named entries 'm_M','m_Delta','sigma_M','sigma_Delta',
  # gamma_params is a matrix with rows 'alpha' and 'beta', columns 'Home" and 'Clinic"
  # n is number of simulations
   stopifnot(all(is.element(c('alpha','beta'),dimnames(gamma_params)[[1]])),
              all(is.element(c('Clinic','Home'),dimnames(gamma_params)[[2]])),
              setequal(names(norm_params),norm_dimnames)
    )
  mclapply(seq_len(num_sim),function(i){
    M <- rnorm(num_indiv,mean = norm_params['m_M'],sd = sqrt(norm_params['sigma2_M']))
    Delta <- rnorm(num_indiv,mean = norm_params['m_Delta'], sd = sqrt(norm_params['sigma2_Delta']))
    muH <- (M+Delta)
    muC <- (M-Delta)
    tauH <- rgamma(num_indiv,shape = gamma_params['alpha','Home'], rate = gamma_params['beta','Home'])
    tauC <- rgamma(num_indiv,shape = gamma_params['alpha','Clinic'], rate = gamma_params['beta','Clinic'])
    all_individual_params=cbind(mu.Home = muH, mu.Clinic = muC, sigma.Home = tauH^(-.5), sigma.Clinic = tauC^(-.5))
    list(Home = list(BP = t(apply(all_individual_params, 1, function(pars) rnorm(k,pars['mu.Home'],pars['sigma.Home'])))),
             Clinic = list(BP= t(apply(all_individual_params, 1, function(pars) rnorm(k,pars['mu.Clinic'],pars['sigma.Clinic'])))))
  })
}

# Simulation that inputs two sets of parameters, and outputs pairs (systolic, diastolic)
BP_simulation2 <- function(norm_params,gamma_params, num_indiv, num_sim=1,k=3){
  stopifnot(setequal(names(norm_params) , BP_type_names), 
            setequal(names(gamma_params) , BP_type_names))
  bp <- mclapply(BP_type_names, function(BPtype) BP_simulation(norm_params[[BPtype]],gamma_params[[BPtype]],num_indiv, num_sim,k))
  names(bp) <- BP_type_names
  lapply(seq_len(num_sim),function(i) lapply(BP_type_names, function(BP_type) bp[[BP_type]][[i]]) %>% 
           setNames(BP_type_names))
}

round_bp <- function(bp,rounding_boundaries=c(-1, 1,3,5,7), real_digits=c(0,2,4,6,8)){
  stopifnot(length(rounding_boundaries)==length(real_digits))
  if (is.array(bp)){apply(bp,2,round_bp)}
  else{
    bp_shifted <- bp - rounding_boundaries[1] # Shift the lowest boundary up to 0
    bp_int <- bp_shifted %% 10
    bp10 <- bp_shifted - bp_int
    sapply(seq_along(bp),function(i) {
      real_digits[sum(bp_int[i]>= (rounding_boundaries-rounding_boundaries[1]) ) ]} ) +bp10
  }
}

# First simulate BPs, then round, according to a different set of boundaries for each place
BP_simulation_rounded <- function(norm_params,gamma_params,num_indiv, num_sim=1,
                                  k=3,rounding_boundaries=list(Home=c(-1, 1,3,5,7), Clinic= c(-1,1, 3, 5,7)),
                                  real_digits=c(0,2,4,6,8)){
  bp <- BP_simulation(norm_params,gamma_params,num_indiv, num_sim,k)
  lapply(bp,function(simulated_dataset) lapply(BP_place_names, function(BPplace) list(BP=round_bp(simulated_dataset[[BPplace]]$BP,rounding_boundaries[[BPplace]], real_digits),digit.breaks=rounding_boundaries[[BPplace]])) %>%
           setNames(BP_place_names)
         )
}

BP_simulation2_rounded <- function(norm_params,gamma_params, num_indiv,
                                   num_sim=1,k=3, rounding_boundaries=list(Home=c(-1, 1,3,5,7), Clinic= c(-1,1, 3, 5,7)),
                                  real_digits=c(0,2,4,6,8)){
  stopifnot(setequal(names(norm_params) , BP_type_names), 
            setequal(names(gamma_params) , BP_type_names))
  bp <- lapply(BP_type_names, function(BPtype) BP_simulation_rounded(norm_params[[BPtype]],gamma_params[[BPtype]],num_indiv, num_sim,k,rounding_boundaries, real_digits)) %>%
    setNames(BP_type_names)
  lapply(seq_len(num_sim),function(i) lapply(BP_type_names, function(BPtype) bp[[BPtype]][[i]])) %>% setNames(BP_type_names)
}

params <- convert_param1(BP_parameters)
#b= BP_simulation2(norm_params , gamma_params2,num_indiv = N,num_sim = num_reps)

# Make a list to store the output tables
output_tables <- vector("list",2) %>% setNames(BP_type_names)

norm_dimnames2 <- c('$m_M$','$m_\\Delta$','$\\sigma^2_M$','$\\sigma^2_\\Delta$')
gamma_dimnames2 <- c('$\\alpha_H$','$\\theta_H$','$\\alpha_C$','$\\theta_C$')

# Helper function to format numbers with sf significant figures, and scientific notation when less than 10^-sn
format_numbers <- function(x,sf=3, sn=3) {
  sapply(x, function(xx)
  if (is.numeric(xx)) {
    if (abs(xx) < 10^(-sn)) {
      return(formatC(xx, format = "e", digits = sf-1)) # Scientific notation for numbers < 0.001
    } else {
      return(signif(xx, sf)) # sf significant figures for other numbers
    }
  }
  )
}

for (which_bp in BP_type_names) {
  estimates <- array(0, dim = c(num_reps, 8), dimnames = list(NULL, c(norm_dimnames2, gamma_dimnames2)))
  std_errors <- array(0, dim = c(num_reps, 8), dimnames = list(NULL, c(norm_dimnames2, gamma_dimnames2)))

  # Simulate blood pressure and calculate parameters
  bp <- BP_simulation(params$Normal[[which_bp]], params$Gamma[[which_bp]], num_indiv = N, num_sim = num_reps)
  simulation_params <- mclapply(bp, function(bp_sim) all_BP_parameters(bp_sim)) # Simulated parameter estimates

  # Fill estimates and std_errors arrays
  for (i in seq_len(num_reps)) {
    spe <- simulation_params[[i]]
    estimates[i, '$m_M$'] <- spe$Normal$M$m
    estimates[i, '$m_\\Delta$'] <- spe$Normal$Delta$m
    estimates[i, '$\\sigma^2_M$'] <- spe$Normal$M$sigma2
    estimates[i, '$\\sigma^2_\\Delta$'] <- spe$Normal$Delta$sigma2
    std_errors[i, '$m_M$'] <- spe$Normal$M$m.std.error
    std_errors[i, '$m_\\Delta$'] <- spe$Normal$Delta$m.std.error
    std_errors[i, '$\\sigma^2_M$'] <- spe$Normal$M$sigma2.std.error
    std_errors[i, '$\\sigma^2_\\Delta$'] <- spe$Normal$Delta$sigma2.std.error
    estimates[i, '$\\alpha_C$'] <- spe$Gamma$Clinic$alpha
    estimates[i, '$\\theta_C$'] <- spe$Gamma$Clinic$theta
    estimates[i, '$\\alpha_H$'] <- spe$Gamma$Home$alpha
    estimates[i, '$\\theta_H$'] <- spe$Gamma$Home$theta
    std_errors[i, c('$\\alpha_C$', '$\\theta_C$')] <- sqrt(diag(spe$Gamma$Clinic$variance))
    std_errors[i, c('$\\alpha_H$', '$\\theta_H$')] <- sqrt(diag(spe$Gamma$Home$variance))
  }

  # Mean estimates and standard errors
  simulation_mean_estimate <- apply(estimates, 2, mean)
  simulation_mean_SE <- apply(std_errors, 2, mean)
  simulation_SE <- apply(estimates, 2, sd)
  
  # True parameters (dummy in this example)
  true_params <- simulation_SE
  true_params[norm_dimnames2] <- params$Normal[[which_bp]][norm_dimnames]
  true_params[gamma_dimnames2] <- params$Gamma[[which_bp]][c('alpha', 'theta'), c('Home', 'Clinic')]

  # Compare simulation estimates to true parameters
  mean_mat <- rbind(
    matrix(sub("\\.?0*$", "", format_numbers(c(simulation_mean_estimate, true_params, simulation_mean_estimate / true_params - 1))),
           nrow = 8, ncol = 3, dimnames = list(names(simulation_SE), c("SimAverage", 'True', 'RelError'))),
    matrix(sub("\\.?0*$", "", format_numbers(c(simulation_mean_SE, simulation_SE, simulation_mean_SE / simulation_SE - 1))),
           nrow = 8, ncol = 3, dimnames = list(names(simulation_SE), c("SimAverage", 'True SE', 'RelError')))
  )

  # Apply the formatting function to mean_mat
  output_tables[[which_bp]] <- mean_mat
}

# Combine and display the table
combined_data <- cbind(output_tables[[1]], output_tables[[2]])
final_table <- kable(combined_data, format = "latex", escape = FALSE, 
                     booktabs = TRUE, linesep = rep(c(rep("", 4), "\\addlinespace"), 4),
                     caption = "Results of estimating parameters from simulated data from the whole population. 
    First column on top is the average parameter estimate from the simulations, second is the true parameter from which the simulations were made, third is the relative error.
    On bottom are the standard errors for the parameters: True is the theoretically computed standard error, SimAverage is the SD of the simulated parameter estimates, and RelError is the relative error.") %>%
  pack_rows("Parameter estimates", 1, 8, bold = TRUE) %>%
  pack_rows("Parameter SE estimates", 9, 16, bold = TRUE) %>%
  kable_styling(latex_options = c("striped", "hold_position")) %>%
  add_header_above(c(" " = 1, "Systolic" = 3, "Diastolic" = 3))

# Print the final table
cat(final_table)
cat('\n')

```


<!-- ## Multiple imputation for the real data -->

```{r multiple_impute, cache=TRUE, tidy= TRUE,include=FALSE,message=FALSE,warning=FALSE}
number_impute <- 10
all_variables_est <- c('m_M','m_Delta','sigma2_M','sigma2_Delta','alpha_C','theta_C','alpha_H','theta_H')
all_variables_SE <- c( setNames(vapply(all_variables_est,function(N) paste0(N,'.SE'),'x'),NULL))
all_variables_cov <- c('Covar_C','Covar_H')
all_variables <- c(all_variables_est, all_variables_SE, all_variables_cov)
impute_results <- lapply(allBP, function(BPtype) 
  setNames(data.frame( array(0,dim = c(number_impute,length(all_variables))) ), all_variables))

suppressWarnings(remove('Delta','Gamma'))

## Make containers for the results, one each for systolic and diastolic
for(i in seq_len(number_impute)){
    allBP_imp <- lapply(allBP, function(BPtype) lapply(BPtype,function(BPplace) imputeBP(BPplace)))
      # quantifying over allBP gives Systolic vs diastolic
      # Next level gives Home vs Clinic
      # Next level has BPs and digit breaks
    for (BPtype in BP_type_names){
      spe <- all_BP_parameters(allBP_imp[[BPtype]]) # simulated parameter estimates
      attach(spe$Gamma)
      attach(spe$Normal)
      gamma_SE_C <- sqrt(diag(Clinic$variance))
      gamma_SE_H <- sqrt(diag(Home$variance))
    
  impute_results[[BPtype]][i,all_variables_est] <- c(M$m, Delta$m, M$sigma2, Delta$sigma2,
                                  Clinic$alpha,Clinic$theta, Home$alpha,Home$theta)
  impute_results[[BPtype]][i,all_variables_SE] <- c(M$m.std.error,Delta$m.std.error, M$sigma2.std.error, Delta$sigma2.std.error,
                                   gamma_SE_C,gamma_SE_H)
  impute_results[[BPtype]][i,all_variables_cov] <- c( Clinic$variance[1,2], Home$variance[1,2] )
    detach(spe$Gamma)
    detach(spe$Normal)
    }
}

total_M_I_results <- lapply(impute_results, function(result) 
{ 
  M_I_estimates = apply(result[,all_variables_est],2,mean)
                M_I_std_error = apply(result[,all_variables_est], 2, sd)
                M_I_covariance = c('Clinic' = cov(result[,'alpha_C'],result[,'theta_C']), 'Home' = cov(result[,'alpha_H'],result[,'theta_H']))
  c(list(estimates= M_I_estimates, std_errors = c(sqrt(apply(result[,all_variables_SE]^2, 2 , mean ) + M_I_std_error^2), apply(result[,all_variables_cov], 2 , mean )+M_I_covariance)))
})

total_M_I_results <- lapply(total_M_I_results, function(result) c(result, list(Correlation = setNames(c( Clinic = result$std_errors['Covar_C']/prod(result$std_errors[c('alpha_C.SE','theta_C.SE')]),
                                                                                           'Home' = result$std_errors['Covar_H']/prod(result$std_errors[c('alpha_H.SE','theta_H.SE')])),c('Corr_C','Corr_H'))))
)
```
Now we compute the combined variance. For a parameter like $\alpha$ we estimate the variance of $\hat\alpha$ by
\newcommand{\E}{\mathbb{E}}
\renewcommand{\P}{\mathbb{P}}
$$
  \mathrm{Var}(\hat\alpha) = \mathbb{E}\bigl[ \mathrm{Var}\left(\hat\alpha\, |\, I\right)\bigr] + \mathrm{Var}\left(\mathbb{E} \left[ \hat\alpha\, |\, I \right]\right).
$$
Here $I$ represents the randomly imputed fractional part. 
We can estimate the first term by averaging the estimated variance (from Fisher Information) over all random imputations.
We estimate the second term by the variance of the $\alpha$ estimates over imputations. Note that this is not quite right, since what we really
want the variance of is $\alpha_0(I)$ --- effectively, the "true" parameter consistent with the imputation. This is a plug-in estimate,
as is the Fisher Information estimate of the variance.  

### Estimates for whole population
The estimates of the empirical Bayes parameters together with their standard errors are given in the column labelled "True" in Table \ref{tab:testparameters}.

<!--
We then compute the residuals. We define the deviance for an individual $i$ with observations $(Y_i)$
given the hyperparameters $h=(m_M,m_\Delta,\sigma^2_M,\sigma^2_\Delta,\alpha^H,\theta^H,\alpha^C,\theta^C)$
$$
  D= \sum_{i=1}^n \log \mathbb{P}\left\{ \mathbf{Y}_{i}\,|\, \text{hyperparameters}=h\right\}.
$$
\newcommand{\wtb}{\widetilde\mathbf}
Since the $\mathbf{Y}_i$ are independent conditioned on $h$,
\begin{align*}
D&= \sum_{i=1}^n \log \E_h\left[ \P\left\{ \mathbf{Y}_i \, |\, M_i,\Delta_i,\tau_i^{C},\tau_i^H \right\} \right]\\
    &\approx \sum_{i=1}^n \log \frac1R\sum_{r=1}^R \left[ \P\left\{ \mathbf{Y}_i \, |\, M_{i,r},\Delta_{i,r},\tau_{i,r}^{C},\tau_{i,r}^{H} \right\}\right] \frac{\pi_h(M_{i,r},\Delta_{i,r},\tau_{i,r}^{C},\tau_{i,r}^{H} )}{q(M_{i,r},\Delta_{i,r},\tau_{i,r}^{C},\tau_{i,r}^{H} \, | \, h,\, \mathbf{Y}_i)},
\end{align*}
where $(M_{i,r},\Delta_{i,r},\tau_{i,r}^{C},\tau_{i,r}^{H})$ are independent samples from a distribution $q$ that may depend
on $\mathbf{Y}_i$ and $h$, and $\pi_h$ is the true density of those individual parameters given hyperparameters $h$.
-->
<!-- 
We can try estimating this simply by Monte Carlo sampling of the individual parameters.
-->
<!--
We estimate this by importance sampling on the four individual parameters $(M_i,\Delta_i,\tau_i^H, \tau_i^C)$ from an approximate posterior.
We have, conditioned on the observations of sample variances for clinical and home measures $S_{Ci}^2$ and $S_{Hi}^2$,
\begin{align*}
  \tau_i^H &\sim \mathrm{Gamma}\left( \alpha^H+\frac{k-1}{2},\, \beta^H+\frac{k-1}{2} S_{Hi}^2 \right),\\
  \tau_i^C &\sim \mathrm{Gamma}\left( \alpha^H+\frac{k-1}{2},\, \beta^H+\frac{k-1}{2} S_{Hi}^2 \right).
\end{align*}
Then, recalling the definitions of $y_{i+}$ and $y_{i-}$,
conditioned on $\tau_i^H$ and $\tau_i^C$ we have
\begin{align*}
  M_i &= \mathcal{N} \left( (\tau_i^M)^{-1} \left( \frac{m_M}{\sigma_M^2} +\y_{i+}\cdot \frac{4k \tau_i^C \tau_i^H}{\tau_i^C + \tau_i^H}  \right) \, ,  (\tau_i^M)^{-1} \right),\\
  \Delta_i &= \mathcal{N} \left( (\tau_i^Delta)^{-1} \left( \frac{m_\Delta}{\sigma_\Delta^2} +\y_{i+}\cdot \frac{4k \tau_i^C \tau_i^H}{\tau_i^C + \tau_i^H}  \right) \, ,  (\tau_i^\Delta)^{-1} \right),\\
\end{align*}
where
\begin{align*}
  \tau_i^M & = \frac{1}{\sigma_M^2} + \frac{4k \tau^C_i \tau^H_i}}{\tau^C_i + \tau^H_i},\\
  \tau_i^\Delta & = \frac{1}{\sigma_\Delta^2} + \frac{4k \tau^C_i \tau^H_i}}{\tau^C_i + \tau^H_i},\\
\end{align*}
Because of rounding, the observed $S^2_i$ will be too small. We approximate the true $S^2_i$ by adding $\frac13$, the variance
of a uniform random variable on $[-1,1]$.
-->

```{r VarianceTest,echo=FALSE,message=FALSE, warning=FALSE}
# Compare to beta-prime distribution
params <- convert_param1(BP_parameters)
dbetaprime <- function(x, A=1, B= BP_parameters$Systolic$Gamma$Clinic$alpha)
  {
    x^(A-1)*(1+x)^(-A-B)/ beta(A,B)
}

all_s2k <- NULL
all_s2k_sim <- NULL
all_s2k_sim_noimp <- NULL
all_SD <-NULL
all_HC <- NULL

oversample <- 10

bp_sim <- BP_simulation2(params$Normal,params$Gamma,
                      num_indiv= N*2*oversample, num_sim = 1)[[1]] # Get oversample sets of simulations; 
    #We could take 1/oversample of them, to get a more accurate
#    estimate of the distribution, particularly at the upper end


# Function to combine histograms and beta curves
# add_beta_curve <- function(data) {
#   p <- ggplot(data, aes(x = s2k)) +
#     geom_histogram(aes(y = ..density..), binwidth = 0.04, boundary=0, fill = "grey", color = "black", alpha = 0.2) +
#     stat_function(fun = function(x) dbetaprime(x, A=1, B=data$params),
#                   color = "red", size = 1.5)
#   return(p)
# }

# Compute individual variances for imputed data, and normalise based on appropriate beta prime distribution
s2k <- lapply(BP_type_names,function(BPtype) lapply(BP_place_names,function(BPplace) sort(apply(allBP_imp[[BPtype]][[BPplace]]$BP,1,var) /
                        params$Gamma[[BPtype]]['beta',BPplace])) %>% setNames(BP_place_names) ) %>% setNames(BP_type_names)

# Do the same for data simulated from the fitted model
#   We have oversample times as many simulations as we need, so we can take 1/oversample of them
s2k_sim <- mclapply(BP_type_names,function(BPtype) mclapply(BP_place_names,function(BPplace) sort(apply(bp_sim[[BPtype]][[BPplace]]$BP,1,var) /
params$Gamma[[BPtype]]['beta',BPplace])[seq(from=oversample,by = 2*oversample,length.out=N)]) %>% setNames(BP_place_names) ) %>% setNames(BP_type_names)

# Convert lists into a data frame suitable for plotting
plot_data <- expand.grid( s2k = seq(N),Type=BP_type_names, Place = BP_place_names) # Empty data frame of appropriate structure
plot_data$s2k <- unlist(lapply(s2k, unlist))
plot_data$s2k_sim <- unlist(lapply(s2k_sim, unlist))
plot_data %<>% mutate(dbetaprime_val = dbetaprime(s2k, A=1, B = rep(c(sapply(params$Gamma,function(x) x['alpha',])), rep(N,4)) ))
```

Finally, we check the variance distribution empirically, to check whether the continuous distribution we have fit for individual variances describes the true distribution of variances in the population reasonably well.
The first thing we do is to compare the empirical
variances (with fractional parts imputed according to the observed proportions for the unequal digit preference, as discussed in section \ref{sec:lastdigit}) to the theoretical beta-prime distribution.
To match the standard distribution, the variances are normalized by being divided by the factor $\alpha/\theta$.
We show histograms of these "unrounded" empirical variances and the theoretical beta-prime distribution in Figure \ref{fig:histograms}.
Note that the distribution has a very long tail, and we have truncated about 2\% of the data to make the figures more readable.

In Figure \ref{fig:QQplots} we show essentially the same data in the form of Q--Q plots.
Here we have extended the plot far out into the tails of the distribution, including values in the range $[0,10]$, covering around `r round(100*mean(plot_data$s2k<10),1)`\% of the data.
We generate data from the inferred model that mimic the true data, with three systolic and three diastolic BP measures per person.
As before, we impute the fractional parts to the real data.
This gives us a set of true variances and a set of simulated variances, which we hope will have approximately the same distribution.
We see some deviation here, but it is slight, and quite deep into the tails.
Furthermore, the deviation is in the direction of the simulated data having slightly fatter tails than the true data, which is the direction we would wish to err in for the sake of making conservative inferences.

The estimates of the empirical Bayes parameters together with their standard errors are given in the column labelled "True" in Table \ref{tab:testparameters}.
These parameters (and SEs) are accompanied by the results of `r num_reps` estimates of data simulated from the model with the parameters inferred from the data, and then fitted by the same procedure.
Note that the errors for the estimates are consistent with the stated standard errors ($\pm \sqrt{\operatorname{SE}}$), and the relative errors for the SE are small, confirming that the estimation procedure is reliable.

\begin{landscape}
```{r histograms,echo=FALSE,message=FALSE, warning=FALSE, fig.align='center',fig.show='hold',fig.width=12,fig.height=8, fig.cap='Comparison of the distribution of empirical variances, normalized by dividing by $\\beta=\\alpha/\\theta$, to the fitted beta-prime distribution.'}
p <- ggplot(plot_data, aes(x = s2k)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.04, fill = "grey", color = NA, alpha=.6)+
  geom_line(aes(y = dbetaprime_val), color = "blue", size = 1) + xlim(0,4)+
  facet_grid(Type ~ Place) + labs(x = "Normalized Variance", y = "Density") +
  theme_bw() +
  theme(strip.background = element_blank(),
        strip.text = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        legend.position = "none") 

print(p)


```

\end{landscape}


```{r QQplots,echo=FALSE,message=FALSE, warning=FALSE, fig.align='center', fig.cap="Q--Q plots of the variances of the observed data with imputed fractional parts (x-axis) against the variances of the simulated data (y-axis)."}
# Add the simulated data to the plot_data data frame

plot_limit <- 10
# Note: s2k and s2k_sim are already sorted

s2k.plot <- ggplot(plot_data%>% group_by(Type, Place) %>% filter(pmax(s2k,s2k_sim)<plot_limit), aes(x=s2k, y=s2k_sim,color=interaction(Type,Place), group=interaction(Type,Place))) +
  geom_point() + geom_line() +  geom_abline(intercept=0,slope = 1, color = 'black' ) + labs( colour = 'BP type', title = '') +
  scale_color_brewer(palette="Dark2") + xlim(0,10) + ylim(0,10)+ labs(x='Observed variances', y='Simulated variances')

print(s2k.plot)

# # Function to calculate quantiles
# calculate_qq <- function(df) {
#   n <- min(length(df$s2k), length(df$s2k_sim))
#   qq_df <- data.frame(qq_s2k = quantile(df$s2k, probs = (1:n - 0.5) / n),
#                       qq_s2k_sim = quantile(df$s2k_sim, probs = (1:n - 0.5) / n),
#                       Type = df,
#                       Place = df)
#   return(qq_df)
# }
# # Apply the function to each group
# qq_data <- plot_data %>%
#   group_by(Type, Place) %>%
#   do(calculate_qq(.)) %>%
#   ungroup()
# # suppressWarnings( s2k.plot)
# 
# # Create the QQ plot
# p <- ggplot(qq_data, aes(x = qq_s2k_sim, y = qq_s2k, color = interaction(Type, Place))) +
#   geom_point() +
#   geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
#   theme_minimal() +
#   labs(x = "Theoretical Quantiles (s2k_sim)", y = "Sample Quantiles (s2k)", 
#        title = "QQ Plots of s2k against s2k_sim",
#        color = "Group")
# 
# p

```
<!--
Note: The empirical Bayes priors are now included in Table VarianceTest
To finish this section, we include a table of the empirical Bayes priors, shown in tables \ref{tab:empestsS} and \ref{tab:empestsD}.

```{r empestsS,echo=FALSE, message=FALSE, warning=FALSE}

empPriors<-data.frame(variable=names(unlist(BP_parameters)),value=unname(unlist(BP_parameters)))
# Filter the irrelevant terms
empPriors%<>%filter(!(grepl(variable,pattern = "variance2") |
                      grepl(variable,pattern = "variance3") |
                      grepl(variable,pattern = "LogLikelihood")))
# Which blood pressure type does it correspond to?
empPriors$BPtype<-sapply(empPriors$variable,function(x) ifelse(grepl(x,pattern = "Diastolic"),"Diastolic","Systolic"))
# What does the summary statistic refer to?
empPriors$Statistic<-NA_character_
# Mu value on the Normal fits
empPriors$Statistic[grepl(empPriors$variable,pattern = "Normal.Delta.m")]<-"$\\mu$ for $\\Delta$"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Normal.M.m")]<-"$\\mu$ for $M$"
# Sigma values on the Normal fits
empPriors$Statistic[grepl(empPriors$variable,pattern = "Normal.Delta.sigma2")]<-"$\\sigma^2$ for $\\Delta$"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Normal.M.sigma2")]<-"$\\sigma^2$ for $M$"
# Expected values on the Gamma fits
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Clinic.alpha")]<-"$\\alpha$ for Clinic SD"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Home.alpha")]<-"$\\alpha$ for Home SD"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Clinic.theta")]<-"$\\theta$ for Clinic SD"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Home.theta")]<-"$\\theta$ for Home SD"
# Variance values on the Gamma fits
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Clinic.variance1")]<-"$\\alpha$ for Clinic SD"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Home.variance1")]<-"$\\alpha$ for Home SD"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Clinic.variance4")]<-"$\\theta$ for Clinic SD"
empPriors$Statistic[grepl(empPriors$variable,pattern = "Gamma.Home.variance4")]<-"$\\theta$ for Home SD"
# Is it the expected value or the standard error?
empPriors$Error<-sapply(empPriors$variable,function(x) ifelse(grepl(x,pattern = "std.error") | grepl(x,pattern = "variance"),"Error Value","Expected Value"))
# Convert variance to standard deviation
empPriors$value[grepl(empPriors$Statistic,pattern = "sigma")]%<>%sqrt()
# Clean it up and output as table
empPriors%<>%dplyr::select(BPtype,Statistic,Error,value)

outEmpP<-empPriors%>%group_by(BPtype,Statistic)%>%
  summarise(Estimate=paste0(signif(unique(value[Error=="Expected Value"]),3),"$",
                            " \\pm ","$",signif(unique(sqrt(value[Error!="Expected Value"])),3)))

outEmpP%>%filter(BPtype=="Systolic")%>%as.data.frame()%>%
  dplyr::select(Statistic,Estimate)%>%
  kableExtra::kable(format="latex", escape = F,booktabs = T,
                  col.names=c("Parameter","Estimate"),
      caption = "Empirical Bayes prior hyperparameter estimates for the systolic blood pressure for the NHANES III, full population.")

```


```{r empestsD,echo=FALSE, message=FALSE, warning=FALSE}

outEmpP%>%filter(BPtype=="Diastolic")%>%as.data.frame()%>%
  dplyr::select(Statistic,Estimate)%>%
  kableExtra::kable(format="latex", escape = F,booktabs = T, 
                  col.names=c("Parameter","Estimate"),
      caption = "Empirical Bayes prior hyperparameter estimates for the diastolic blood pressure for the NHANES III, full population.")


```

### Estimates for the FRS population
The estimates of the empirical Bayes parameters together with their standard errors are given in the column labelled "True" in Table \ref{tab:testparameters}.


```{r VarianceTestFRS,echo=FALSE,message=FALSE, warning=FALSE}
# Compare to beta-prime distribution
params <- convert_param1(BP_parameters)
dbetaprime <- function(x, A=1, B= BP_parameters$Systolic$Gamma$Clinic$alpha)
  {
    x^(A-1)*(1+x)^(-A-B)/ beta(A,B)
}

all_s2k <- NULL
all_s2k_sim <- NULL
all_s2k_sim_noimp <- NULL
all_SD <-NULL
all_HC <- NULL

oversample <- 10

bp_sim <- BP_simulation2(params$Normal,params$Gamma,
                      num_indiv= N*2*oversample, num_sim = 1)[[1]] # Get oversample sets of simulations; 

# Compute individual variances for imputed data, and normalise based on appropriate beta prime distribution
s2k <- lapply(BP_type_names,function(BPtype) lapply(BP_place_names,function(BPplace) sort(apply(allBP_imp[[BPtype]][[BPplace]]$BP,1,var) /
                        params$Gamma[[BPtype]]['beta',BPplace])) %>% setNames(BP_place_names) ) %>% setNames(BP_type_names)

# Do the same for data simulated from the fitted model
#   We have oversample times as many simulations as we need, so we can take 1/oversample of them
s2k_sim <- mclapply(BP_type_names,function(BPtype) mclapply(BP_place_names,function(BPplace) sort(apply(bp_sim[[BPtype]][[BPplace]]$BP,1,var) /
params$Gamma[[BPtype]]['beta',BPplace])[seq(from=oversample,by = 2*oversample,length.out=N)]) %>% setNames(BP_place_names) ) %>% setNames(BP_type_names)

# Convert lists into a data frame suitable for plotting
plot_data <- expand.grid( s2k = seq(N),Type=BP_type_names, Place = BP_place_names) # Empty data frame of appropriate structure
plot_data$s2k <- unlist(lapply(s2k, unlist))
plot_data$s2k_sim <- unlist(lapply(s2k_sim, unlist))
plot_data %<>% mutate(dbetaprime_val = dbetaprime(s2k, A=1, B = rep(c(sapply(params$Gamma,function(x) x['alpha',])), rep(N,4)) ))
```

Finally, we check the variance distribution empirically, to check whether the continuous distribution we have fit for individual variances describes the true distribution of variances in the population reasonably well.
The first thing we do is to compare the empirical
variances (with fractional parts imputed according to the observed proportions for the unequal digit preference, as discussed in section \ref{sec:lastdigit}) to the theoretical beta-prime distribution.
To match the standard distribution, the variances are normalized by being divided by the factor $\alpha/\theta$.
We show histograms of these "unrounded" empirical variances and the theoretical beta-prime distribution in Figure \ref{fig:histograms}.
Note that the distribution has a very long tail, and we have truncated about 2\% of the data to make the figures more readable.

In Figure \ref{fig:QQplots} we show essentially the same data in the form of Q--Q plots.
Here we have extended the plot far out into the tails of the distribution, including values in the range $[0,10]$, covering around `r round(100*mean(plot_data$s2k<10),1)`\% of the data.
We generate data from the inferred model that mimic the true data, with three systolic and three diastolic BP measures per person.
As before, we impute the fractional parts to the real data.
This gives us a set of true variances and a set of simulated variances, which we hope will have approximately the same distribution.
We see some deviation here, but it is slight, and quite deep into the tails.
Furthermore, the deviation is in the direction of the simulated data having slightly fatter tails than the true data, which is the direction we would wish to err in for the sake of making conservative inferences.

The estimates of the empirical Bayes parameters together with their standard errors are given in the column labelled "True" in Table \ref{tab:testparameters}.
These parameters (and SEs) are accompanied by the results of `r num_reps` estimates of data simulated from the model with the parameters inferred from the data, and then fitted by the same procedure.
Note that the errors for the estimates are consistent with the stated standard errors ($\pm \sqrt{\operatorname{SE}}$), and the relative errors for the SE are small, confirming that the estimation procedure is reliable.

\begin{landscape}
```{r histogramsFRS,echo=FALSE,message=FALSE, warning=FALSE, fig.align='center',fig.show='hold',fig.width=12,fig.height=8, fig.cap='Comparison of the distribution of empirical variances, normalized by dividing by $\\beta=\\alpha/\\theta$, to the fitted beta-prime distribution.'}
p <- ggplot(plot_data, aes(x = s2k)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.04, fill = "grey", color = NA, alpha=.6)+
  geom_line(aes(y = dbetaprime_val), color = "blue", size = 1) + xlim(0,4)+
  facet_grid(Type ~ Place) + labs(x = "Normalized Variance", y = "Density") +
  theme_bw() +
  theme(strip.background = element_blank(),
        strip.text = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        legend.position = "none") 

print(p)


```

\end{landscape}


```{r QQplotsFRS,echo=FALSE,message=FALSE, warning=FALSE, fig.align='center', fig.cap="Q--Q plots of the variances of the observed data with imputed fractional parts (x-axis) against the variances of the simulated data (y-axis)."}
# Add the simulated data to the plot_data data frame

plot_limit <- 10
# Note: s2k and s2k_sim are already sorted

s2k.plot <- ggplot(plot_data%>% group_by(Type, Place) %>% filter(pmax(s2k,s2k_sim)<plot_limit), aes(x=s2k, y=s2k_sim,color=interaction(Type,Place), group=interaction(Type,Place))) +
  geom_point() + geom_line() +  geom_abline(intercept=0,slope = 1, color = 'black' ) + labs( colour = 'BP type', title = '') +
  scale_color_brewer(palette="Dark2") + xlim(0,10) + ylim(0,10)+ labs(x='Observed variances', y='Simulated variances')

print(s2k.plot)
```
-->

## Hamiltonian Monte Carlo (HMC)

The model, as described in the article, is a Bayesian hierarchical model. In order to parameterize such an intricate model, traditional Maximum Likelihood Estimation methods can no longer be applied. Therefore, we apply the Hamiltonian Monte Carlo (HMC) method. HMC is a form of Markov Chain Monte Carlo methods, which samples potential parameter space values of the model, then calculates directly the likelihood function based on that choice of parameters. The derivative of the likelihood function, $\phi$, guides parameter space exploration in $\theta$ towards the modal value of the joint posterior distribution. This method is ideal for complicated, non-Gaussian distribution forms. The three steps of HMC are:
\begin{enumerate}
\item Draw a sample of the derivative $\phi$ using the posterior distribution of $\phi$, which is the same as its prior.
\item Update the values of $\theta^*$ and $\phi^*$ using
  \begin{equation}
    \theta^*\leftarrow \theta+\epsilon M^{-1}\phi,
  \end{equation}
  and
  \begin{equation}
    \phi\leftarrow \phi+\epsilon\frac{1}{2}\frac{\mathrm{d}\log\{p(\theta|y)\}}{\mathrm{d}\theta},
  \end{equation}
where $M$ is the Jacobian of the parameters. This can be set to a diagonal matrix for no correlation between parameters, and is updated pointwise throughout the calculation. This is the leapfrog method, whereby $\epsilon$ dictates the scale size of the step to ensure convergence on the correct point is made, and L is the number of steps to be `leaped'.
\item Compute the rejection parameter:
  \begin{equation}
    r=\frac{p(\theta^*|y)p(\phi^*)}{p(\theta^{t-1}|y)p(\phi^{t-1})}
  \end{equation}
\item Set $\theta^t$ to $\theta^*$ with probability $\min\{1,r\}$, or otherwise keep $\theta^{t-1}$.
\end{enumerate}  
The tuning parameters $\epsilon$ and L should be chosen according to a desired acceptance rate. The No-U-Turn Sampler of Stan automates the calculation of these tuning parameters. A more detailed overview of HMC and the NUTS algorithm integrated into the Stan package, see [@NUTS].

### Centering the Linear Predictor

During the MCMC simulations, the centering values play a non-negligible role in shaping the model parameterization. If the centering parameters are held constant throughout all of the MCMC simulations, then the equation $\sum_i^N \exp{(\boldsymbol{\beta}\cdot(\boldsymbol{X}-\hat{X}))}=0$ is no longer guaranteed. However, automatically defining the centering values based on the model parameters sampled at the current MCMC iteration is not advisable as it can lead to poor parameter convergence. This is because it modifies the likelihood function at every MCMC iteration. Therefore, we iterate the MCMC algorithm multiple times. At every iteration, we recalculate the centering parameters to satisfy the requirement that the average of the linear predictor term going to zero, based on the posterior distributions of the previous MCMC simulation. This iteration is carried out until the centering parameters converge. Convergence is defined by optimising on two factors. The first is that the sum of the linear predictor term across all MCMC samples needs to tend to negligible values (we define this as the average difference being less than $10^{-7}$), see figure \ref{fig:linpred_conv}. The second convergence criteria is that the average Root Mean-Squared Error (RMSE) of the model predictions on the survival outcomes in the MCMC simulations needs to also decrease towards zero, see figure \ref{fig:linpred_conv} (top). For the second criteria, we stopped the simulations when either the difference in the RMSE stopped decreasing (below a threshold of $1\%$), or the RMSE value was less than 20, see figure \ref{fig:linpred_conv} (bottom). Illustration of the convergence is shown in figure \ref{fig:linpred_conv}.

\newpage
## Code Description
The code will be made available, but detailed references have been removed to preserve anonymity for the review process.

<!--
The code can be found at [https://github.com/hamishwp/NHANES_HPOX](https://github.com/hamishwp/NHANES_HPOX).
The numerical code has been built in multiple stages. Below, we explain the principal files required to replicate the entire analysis presented in the article. There are 5 main groups for the code:

1. Data cleaning scripts
2. Main file
3. Stan files for HMC
4. Centering recalculation scripts
5. Post-processing analysis

We provide a brief description of each of these below.

### Data Cleaning
This is found in the file `Dataclean2021.R`. Provided the raw NHANES dataset (in CSV format), it extracts all the data required for the simulations, and stores it in a structure that can be directly read in to the main file (`MCMC_DiasSyst_v3.R`) of this research. 

### Main
The main file is `MCMC_DiasSyst_v3.R`. It reads in the cleaned NHANES data, the specific choice of simulation parameters (for example, whether to use the FRS number or mean systolic & diastolic blood pressure), and runs the correct RStan scripts for that specific selection of simulation parameters. This script is intended for use on computing clusters.

### Stan

There are eight Stan files: 

1. `mystanmodel_DS_sigma_v2_autopred.stan` 
2. `mystanmodel_DS_tau_v2_autopred.stan` 
3. `mystanmodelFRS_DS_sigma_v2_autopred.stan` 
4. `mystanmodelFRS_DS_tau_v2_autopred.stan`
5. `mystanmodel_DS_sigma_v2.stan` 
6. `mystanmodel_DS_tau_v2.stan`
7. `mystanmodelFRS_DS_sigma_v2.stan`
8. `mystanmodelFRS_DS_tau_v2.stan`

These correspond to the following alternative simulation parameters:

* For the blood-pressure variability, choosing to use the standard-deviation $\sigma$ or the precision $\tau=1/\sigma$]}
* Using the FRS score or the mean diastolic and systolic blood pressure as a covariate in the analysis
* Whether the centering parameters, $\hat{X}$, in the linear predictor term are automatically calculated to satisfy $\sum_i^N \exp{(\boldsymbol{\beta}\cdot(\boldsymbol{X}-\hat{X}))}=0$ for every MCMC iteration, or whether the centering is held constant across all iterations

### Centering
The centering of the linear predictors, which is required as input to every MCMC simulation iteration, is recalculated in the files `AutoPred_Recalc.R` and `ManPred_Recalc.R`. This is then provided to the Main script, `MCMC_DiasSyst_v3.R`, which provides these centering values to the Stan code for the MCMC simulations.


### Post-processing
The post-processing script is called `PostProcessing.R`, which heavily relies on the `Functions.R` script which contains all the necessary functions to analyse the data. The post-processing script generates many useful plots of the MCMC posterior distribution for the user, including Bayes' factors, violin plots of the normalised beta and gompertz posteriors, and more.
-->


# Appendix C -- Further Results

In this section, we add some additional detail to the results section covered in the article. Extra information is given to explain how convergence of the simulations was ensured, and to also include more visualisations of the converged model parameterizations. The authors feel that this is particularly useful to provide confidence in the model parameterization and the predictions.

## Convergence of Simulations

Convergence of the simulations required to parameterize the model presented in this work is required for the MCMC simulations performed by Stan, as well as convergence in the centering values that requires repeating the Stan calculations several times. Convergence of the latter is shown in figure \ref{fig:linpred_conv}. The upper plot in figure \ref{fig:linpred_conv} illustrates convergence in the average Root Mean-Squared Error (RMSE) of the model predictions on the survival outcomes in the MCMC simulations. The lower plot in figure \ref{fig:linpred_conv} illustrates convergence in the average sum of the linear predictor terms over all MCMC chain iterations.

![Illustration of the convergence of the centering parameters of the model. ](./Rmarkdown_Plots/RMSE-Linpred_Convergence.png){#fig:linpred_conv}

With respect to convergence of the MCMC simulations, defining convergence first involves discarding the burn-in period of the simulations. When the time-evolution marker chain has a large number of samples, sequence thinning is used to reduce the amount of data storage - after convergence, take only the kth value of the simulations (after having discarded the burn-in phase values) and discard the rest. One measure of convergence is to bin similar markers and check that for each bin, the variation of the individual marker movement over a few time steps is larger than the variation of the ensemble markers in-between one-another. Other methods of convergence are stationarity and mixing. The former occurs by ensuring that the gradients of movements in the chains in time are in the same direction, the latter ensures that the amplitude of the movements in the chains are similar. To calculate the mixing and stationarity, one can do the following:
\begin{itemize}
\item Take the proposedly converged marker population, where there are N markers in total each of index length $\tau$ (thus of total physical time quantity $t\tau$). Split it k times, where k is a common denominator of $\tau$.
\item Now you have $kN$ MCMC chains each of length $|\tau/k|$
\item For the marker $\psi_{ij}$ with i and j the chain length (time) and marker number indices respectively, then the mean marker value over the chain length (time) is
  \begin{equation}
    \bar{\psi}_{|,j}=\frac{k}{\tau}\sum_{i=1}^{\tau/k}\psi_{ij}
  \end{equation}
  and the total average quantity of $\psi$ over all markers, over all chain lengths is therefore
  \begin{equation}
    \bar{\psi}_{||}=\frac{1}{kN}\sum_{j=1}^{kN}\bar{\psi}_{|j}
  \end{equation}  
\item Stationarity: compare the inter-marker variance (between sequence B):
  \begin{equation}
    B = \frac{\tau}{k(kN-1)}\sum_{j=1}^{kN}(\bar{\psi}_{|,j}-\bar{\psi}_{||})^2
  \end{equation}
\item Mixing: compare the variance along each markers chain length (within-sequence W):
  \begin{equation}
    W = \frac{1}{n(\tau-k)}\sum_{j=1}^{kN}\sum_{i=1}^{\tau/k}(\psi_{i,j}-\bar{\psi}_{|j})^2
  \end{equation}
\item Therefore, to estimate the marginal posterior variance of $p(\psi|y)$, then we use a weighted average
  \begin{equation}
    \hat{\text{Var}}^+(\psi|y)=\frac{\tau-k}{N}W+\frac{1}{Nk}B
  \end{equation}
  Note that this quantity overestimates the marginal posterior variance, but it is unbiased under stationarity: this can be used to infer convergence. When the varation in
  \begin{equation}
    \hat{R}=\sqrt{\frac{\hat{\text{Var}}^+(\psi|y)}{W}}
  \end{equation}
  should approach close to 1 for converged simulations.
\end{itemize}
Another convergence parameter is the number of effective independent marker draws. Upon convergence, the time evolution of each marker should be uncorrelated and independent to previous time steps. To find the average time-correlation over all particles, we use the variogram $V_t$:
\begin{equation}
  V_t=\frac{1}{Nk(\tau/k-\tilde{t})}\sum_{j=1}^{kN}\sum_{i=1}^{\tau/k}(\psi_{i,j}-\psi_{i-\tilde{t},j})^2,
\end{equation}
where $\tilde{t}\in 1,2,...,\tau/k$ is a time index. Then we get the time-correlations:
\begin{equation}
  \hat{\rho}_t=1-\frac{V_t}{2\hat{\text{Var}}^+}
\end{equation}
This comes from the expectation of the variance $E[(\psi_i-\psi_{i-t})^2]=2(1-\rho_t)\text{Var}(\psi)$. This can be used to infer the effective number of independent marker draws:
\begin{equation}
  \hat{n}_{eff}=\frac{mn}{1+2\sum_{\tilde{t}=1}^T\hat{\rho}_t}
\end{equation}
Where T is the index at which the sum of the autocorrelation estimates $\hat{\rho}_{t'}+\hat{\rho}_{t'+1}$ is negative. As a general guide, we should have $\hat{n}_{eff}\sim 10N/k$ effective independent marker draws and that $\hat{R}\to 1\sim 1.1$. In this research, we continued running the MCMC simulations until these two criteria were met (and went beyond: $\hat{R}<1.05$ for all parameters in all models and that $\hat{n}_{eff}>750$ for all parameters in all models).

## Results - Model Parameterization

We remind the reader of the list of numbers of the different models explored in this research, provided in the list found in section 'Proposed Models'. The authors will use the numbers in the list, referred to as the run number, in the following plots. One of the most important set of parameters of the model is the vector $\beta$ of covariates in the Cox' proportional hazards model. When the $\beta$ vector is normalised, the larger (in absolute terms) the value of $\beta$, the larger the correlation between that specific covariate and the risk of mortality. Positive values of $\beta$ imply a higher risk of mortality, and the inverse for negative values of $\beta$. As we can see from the violin plots of the MCMC posterior samples of the $\beta$ parameters in figure \ref{fig:betas}, the parameter that correlated the highest with both the mortality risk of HA-CVD-CeVD and for all mortalities, in absolute terms, was the 1998 version of the FRS score, shown in the top-right plot under run numbers 7 and 8. The FRS-1998 score correlated, on average over all the MCMC iterations, approximately $25\%$ more with mortality risk of HA-CVD-CeVD than the (more recently developed) FRS ATP III score. A similar, but slightly weaker, correlation was found between the two FRS scores for all mortality-based risk. The middle-left plot in figure \ref{fig:betas} shows that the mean diastolic blood pressure acts to decrease mortality risk. Finally, the influence of the longer-term difference in the mean blood pressure, displayed in the top-left and top-middle plots of figure \ref{fig:betas}, is also shown to increase mortality risk across all run numbers. The influence of the blood-pressure variability on mortality is illustrated to not be consistent across simulations, whereby the statistical significance of the effect is lower than for the other parameters in the linear predictor term. 

```{r IQRtables,echo=F,results='asis'}
convnames <- data.frame(
  variable = c("M i S", "M i D", "D i S", "D i D", "FRS.1998",
               "sigma C S", "sigma H S", "sigma C D", "sigma H D"),
  Covariate = c(
    "Systolic Mean", 
    "Diastolic Mean",
    "Systolic $|\\Delta|$", 
    "Diastolic $|\\Delta|$", 
    "FRS (1998)",
    "Systolic Clinic Stand Dev", 
    "Systolic Home Stand Dev",
    "Diastolic Clinic Stand Dev", 
    "Diastolic Home Stand Dev"
  )
)

range_format <- function(r1,r2){
  paste0("(", format_numbers(r1,sf=2),"," , format_numbers(r2,sf=2), ")")
}
```

```{r iqrnormbetaRL1, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Printing the table using kable
    cat(kable(read_csv(file="Results/beta_all_IQRnorm.csv", show_col_types = FALSE) %>%
        filter(runnum == 1) %>%
        slice(c(3, 4, 1, 2, 5:8)) %>%
        left_join(convnames, by = "variable")  %>% 
          mutate(Range = range_format(Range1,Range2)) %>% select(Covariate, beta, Range),
        format="latex", escape = F,booktabs = T, digits=3,
              linesep = "",
                  col.names=c("Covariate","Beta Normalised (IQR)","Range"),
        caption = "Beta parameters for all-cause mortality, full NHANES III population, normalised by the interquartile range instead of the standard deviation.", align="lcc") %>%  
          kable_styling(latex_options = c("hold_position","striped")) )
    cat('\n')
```


```{r iqrnormbetaRL2, echo=F,results='asis'}
cat(kable(read_csv(file="Results/beta_all_IQRnorm.csv",show_col_types = FALSE)%>%
        filter(runnum==2)%>%
          slice(c(3,4,1,2,5:8))%>%
        left_join(convnames,by="variable")%>% 
          mutate(Range = range_format(Range1,Range2)) %>% select(Covariate, beta, Range),
        format="latex", 
        escape = F,
        booktabs = T, 
        linesep = "",
        digits=3,
        col.names=c("Covariate","Beta Normalised (IQR)","Range"),
        caption = "Beta parameters for cardiovascular mortality, full NHANES III population, normalised by the interquartile range instead of the standard deviation.", align="lcc") %>%
  kable_styling(latex_options = c("striped", "hold_position")) )
cat('\n')

```

```{r iqrnormbetaRL3, echo=F,results='asis'}
cat(kable(read_csv(file="Results/beta_all_IQRnorm.csv",show_col_types = FALSE)%>%
        filter(runnum==3)%>%slice(c(3,4,1,2,5:8))%>%
        left_join(convnames,by="variable")%>% 
          mutate(Range = range_format(Range1,Range2)) %>% select(Covariate, beta, Range),
        format="latex", 
        escape = F,
        booktabs = T, 
        linesep = "",
        digits=3,
        col.names=c("Covariate","Beta Normalised (IQR)","Range"),
        caption = "Beta parameters for all-cause mortality, FRS NHANES III population, normalised by the interquartile range instead of the standard deviation.", align="lcc")%>%
  kable_styling(latex_options = c("striped", "hold_position")))
cat('\n')

```

```{r iqrnormbetaRL4, echo=F,results='asis'}
cat(kable(read_csv(file="Results/beta_all_IQRnorm.csv",show_col_types = FALSE)%>%
        filter(runnum==4)%>%slice(c(3,4,1,2,5:8))%>%
        left_join(convnames,by="variable")%>% 
          mutate(Range = range_format(Range1,Range2)) %>% select(Covariate, beta, Range),
        format="latex", 
        escape = F,
        booktabs = T,
        linesep = "",
        digits=3,
        col.names=c("Covariate","Beta Normalised (IQR)","Range"),
        caption = "Beta parameters for cardiovascular mortality, FRS NHANES III population, normalised by the interquartile range instead of the standard deviation.", align="lcc")%>%
  kable_styling(latex_options = c("striped", "hold_position")))
cat('\n')

```

![Violin plots of the normalised $\beta$ parameters of the different models. ](./Plots/beta/Beta_parameter_normalised.png){#fig:betas}

With respect to the time-independent Gompertz parameter, denoted $B$ in this article, the results between all models that simulate CVD mortality risk, and all the models that simulation all-cause mortality risk are consistent with one-another. This is illustrated by the similarity between plots on the left hand side and the right hand side of figure \ref{fig:gompB}. The consistency appears across sex assigned at birth and race.

![Violin plots of the normalised B parameter (from the Gompertz equation) of the different models.](./Plots/gompertz/B_parameter.png){#fig:gompB}

Figure \ref{fig:gompt} reflects the same level of consistency for the Gompertz parameter that influences the temporal evolution of the mortality risk. It is worth noting that both figures \ref{fig:gompB} and \ref{fig:gompt} have inverse trends between the values of B and theta for each demographic group. This makes it difficult to imagine, based on these two plots, what the mortality risk is at different ages across demographics, yet it is evident that the form of the change in the mortality risk curve in time is different for each demographic group. Women are observed to have lower initial values of risk, but mortality risk later in life begins to increase much faster than for men. Additionally, hispanic populations are shown to have a larger initial mortality risk than black populations who are shown to have a larger initial mortality risk than white populations in the USA. However, mortality risk increases at a faster rate for white populations than for black populations, for which it increases faster than hispanic populations in the USA. For ease of comparison, we also present here tables of the mean and standard deviation values of the time dependent and independent Gompertz parameters in tables \ref{tab:RL12} to \ref{tab:RL78}.

![Violin plots of the normalised $\theta$ parameter (from the Gompertz equation) of the different models.](./Plots/gompertz/theta_parameter.png){#fig:gompt}


```{r Mean-SD, echo=FALSE,message=FALSE,warning=FALSE,results='asis'}
# Read the tables from Excel
systolic_full <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "MeanValues", rows = 3:7)
diastolic_full <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "MeanValues", rows = 9:13)[,-1]
systolic_FRS <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "MeanValues", rows = 17:21)
diastolic_FRS <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "MeanValues", rows = 23:27)[,-1]

format_rows <- function(df,n_digits){ #n_digits must have the same length as the number of rows of df
  df <- cbind(df[,1],t(sapply(1:nrow(df), function(x) sprintf(paste0("%.", n_digits[x], "f"), df[x,-1]))))
  # First column is character, the rest are numeric
  #Note: apply function returns columns as rows, so we transpose the result
  return(df)
}

# Combine the data frames into one
combined_data <- rbind(format_rows(cbind(systolic_full, diastolic_full),c(1,2,2,2)), format_rows(cbind( systolic_FRS, diastolic_FRS),c(1,2,2,2)) )

# Create a single kable for the combined data
kable_combined <- kable(combined_data, format = "latex", escape = FALSE, booktabs = TRUE, digits=3,
                        caption = "Summary of the posterior estimate of the blood pressure distribution.")

# Use pack_rows to add headers
final_table <- kable_combined %>%
  pack_rows("Full population", 1, nrow(systolic_full), bold = TRUE) %>%
  pack_rows("FRS population", nrow(systolic_full) + 1, nrow(systolic_full) + nrow(systolic_FRS), bold = TRUE) %>%
  kable_styling(latex_options = c("striped", "hold_position")) %>%
  add_header_above(c(" " = 1, "Systolic" = 2, "Diastolic" = 2))


# Print the table
cat(final_table)
cat('\n')
```

```{r RL12, echo=FALSE,message=FALSE,warning=FALSE,results='asis'}
# Read the tables from Excel
all_cause_mortality <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "RL1-2", rows = 11:17)
cvd_mortality <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "RL1-2", rows = 3:9)

# Combine the data frames into one
combined_data <- rbind(all_cause_mortality, cvd_mortality)

# Create a single kable for the combined data
kable_combined <- kable(combined_data, format = "latex", escape = FALSE, booktabs = TRUE, 
                        caption = "Parameters for survival model, NHANES III, Full population, using the systolic and diastolic mean model.")

# Use pack_rows to add headers
final_table <- kable_combined %>%
  pack_rows("All-Cause Mortality", 1, nrow(all_cause_mortality), bold = TRUE) %>%
  pack_rows("CVD Mortality", nrow(all_cause_mortality) + 1, nrow(all_cause_mortality) + nrow(cvd_mortality), bold = TRUE) %>%
  kable_styling(latex_options = c("striped", "hold_position"))

# Output the final table
cat(final_table)
cat('\n')

```

```{r RL34, echo=FALSE,message=FALSE,warning=FALSE,results='asis'}
# Read the tables from Excel
all_cause_mortality <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "RL3-4", rows = 11:17)
cvd_mortality <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "RL3-4", rows = 3:9)

# Combine the data frames into one
combined_data <- rbind(all_cause_mortality, cvd_mortality)

# Create a single kable for the combined data
kable_combined <- kable(combined_data, format = "latex", escape = FALSE, booktabs = TRUE, 
                        caption = "Parameters for survival model, NHANES III, FRS-population only, using the systolic and diastolic mean model.")

# Use pack_rows to add headers
final_table <- kable_combined %>%
  pack_rows("All-Cause Mortality", 1, nrow(all_cause_mortality), bold = TRUE) %>%
  pack_rows("CVD Mortality", nrow(all_cause_mortality) + 1, nrow(all_cause_mortality) + nrow(cvd_mortality), bold = TRUE) %>%
  kable_styling(latex_options = c("striped", "hold_position"))

# Output the final table
cat(final_table)
cat('\n')

```


```{r RL78, echo=FALSE,message=FALSE,warning=FALSE,results='asis'}
# Read the tables from Excel
all_cause_mortality <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "RL7-8", rows = 11:17)
cvd_mortality <- openxlsx::read.xlsx("./Results/AppendixTables.xlsx", sheet = "RL7-8", rows = 3:9)

# Combine the data frames into one
combined_data <- rbind(all_cause_mortality, cvd_mortality)

# Create a single kable for the combined data
kable_combined <- kable(combined_data, format = "latex", escape = FALSE, booktabs = TRUE, 
                        caption = "Parameters for survival model, NHANES III, FRS-population only, using the 1998 FRS-based model.")

# Use pack_rows to add headers
final_table <- kable_combined %>%
  pack_rows("All-Cause Mortality", 1, nrow(all_cause_mortality), bold = TRUE) %>%
  pack_rows("CVD Mortality", nrow(all_cause_mortality) + 1, nrow(all_cause_mortality) + nrow(cvd_mortality), bold = TRUE) %>%
  kable_styling(latex_options = c("striped", "hold_position"))

# Output the final table
cat(final_table)
cat('\n')

```

## Results - Model Performance

To measure the performance of the model to predict the survival outcome of individuals in the population, figure \ref{fig:cumpred} shows, ordered by individual age, the cumulative hazard $H(t)$ predicted against the cumulative number of deaths in the populations, for each model explored in this research. Each model is shown to predict survival outcomes reliably, across the entire age range of the population.

![Predicted cumulative hazard against cumulative number of deaths in the population, ordered by the age of the individual. ](./Plots/Survival/redlinpred_Cumulative_haz-death_age.png){#fig:cumpred}

A common metric that is used to evaluate the performance of models such as presented in this article is called the Receiver Operating Characteristic (ROC) curve. With continuous predictor values such as cumulative hazard $H(T_i)$, a threshold can be defined whereby any individual who has a cumulative risk larger than the threshold $H(T_i)>\epsilon$ is predicted to die. The ratio of the number individuals that were predicted to die compared to the total number who die corresponds is referred to as the True Positive Ratio (TPR)
\begin{equation}
  TPR(\epsilon)=\frac{\sum_i\big(\mathbb{I}(H(T_i)>\epsilon \ \ \& \ \ \delta_i=1)\big)}{\sum_i\big(\mathbb{I}(\delta_i=1)\big)}.
\end{equation}
Note that TPR is also referred to as the recall or sensitivity. Conversely, the ratio of the number of individuals predicted to die but survive compared to the total number of individuals that survived is referred to as the False Positive Ratio (FPR)
\begin{equation}
  FPR(\epsilon)=\frac{\sum_i\big(\mathbb{I}(H(T_i)>\epsilon \ \ \& \ \ \delta_i=0)\big)}{\sum_i\big(\mathbb{I}(\delta_i=0)\big)}.
\end{equation}
Note that the FPR is also referred to as $1-\mathrm{specificity}$. An ROC curve is produced by varying the threshold value that is then used to calculate both the TPR and FPR, and plotting them against one another.
The area under this curve is a metric that indicates performance of the model to predict survival outcomes. AUROC=1 implies perfect predictions and AUROC=0.5 implies the contrary. However, our model is formulated such that the variables age and time since starting the survey both form part of Cox proportional hazard. Furthermore, the Gompertz model is stratified by demographic group. Therefore, in this work, we present a modified ROC curve, which calculates the individuals cumulative hazard at a given time since the start of the survey, $T_{\mathrm{surv}} \in {5, 10, 15}$ years, and calculates whether the model correctly predicted an event to occur before or after this time. 
Note that to do this, we split the ROC population by ages: 45-64 and 65-84. The modified TPR is then calculated via:
\begin{equation}
  \operatorname{TPR}(\epsilon)=\frac{\sum_i\big(\mathbb{I}(\delta_i=1 \ \ \& \ \ \ H(T_i)\geq \epsilon \ \ \& \ \ T_i<T_{\mathrm{surv}})\big)}{\sum_i\big(\mathbb{I}(\delta_i=1 \ \ \& \ \ \ T_i<T_{\mathrm{surv}})\big)},
\end{equation}
and the modified FPR:
\begin{equation}
  \operatorname{FPR}(\epsilon)=\frac{\sum_i\big(\mathbb{I}(H(T_i)\geq \epsilon \ \ \& \ \ T_i\geq T_{\mathrm{surv}})\big)}{\sum_i\big(\mathbb{I}(T_i\geq T_{\mathrm{surv}})\big)}.
\end{equation}

One problem with standard ROC curves is that they are stretched out along the diagonal of the plot, making it impossible to scale the actual differences to make the significant deviations from the diagonal more legible.
For that reason it is often better to plot a variant ROC curve, where the ordinate (the vertical coordinate) is the difference TPR--FPR, rather than TPR itself.
This modified plot --- also known as the Youden index plot --- rotates what was formerly the diagonal onto the abscissa, allowing the deviations to be plotted on the appropriate scale.
The information content is, of course, the same, and the standard AUC is now simply $0.5+$the area under this modified plot.

Before presenting any plots or AUC values, we first present the density distributions of the median posterior systolic $\Delta$ values for all individuals, split by demographic. 

![Density of the (median posterior) systolic $\Delta$ values, per demographic. ](./Rmarkdown_Plots/SysDelta_Densities_Demography.png){#fig:DeltaDens}

Using Welch's ANOVA test, we calculated that $p<1\times 10^{-9}$ for all demographics, including when split between the male and female populations. Figures \ref{fig:ROC_MeanBP} and \ref{fig:ROC_FRS} show the Youden index plots (modified ROC curves) --- including the AUC values --- of the model, the former using the mean systolic and diastolic blood pressure as covariates in the linear predictor term (in the Cox proportional hazards component) and the latter using the FRS value instead. By making predictions of the 5, 10 and 15 year survival between the middle aged and old aged sub-groups, for the three different mortality causes, we start to build a picture of the performance of the model. For figure \ref{fig:ROC_MeanBP}, we notice that the AUC value of the predictions for the middle aged compared to the older aged population is higher, independent of the survival year prediction or the different mortality causes. The highest AUC is for the 45-64 year old population with a focus on CVD and heart attack-related mortality, for all three survival year periods. We also note that the TPR seems to start increasing at a faster rate for the population aged 45-64 than the 65-84 group, implying that it is possible to choose a threshold level, $\epsilon$, for the survival predictions that could correctly identify people at risk without incorrectly predicting as many people to be at risk of mortality as for the group aged 65-84. The results also reflect that the influence of choosing a 5, 10 or 15 year prediction period does not seem to significantly influence the results. Figure \ref{fig:ROC_FRS} displays similar results to the mean systolic and diastolic model when using the FRS value instead, with the main difference that the predictions of the middle aged group for CVD and heart attack-related mortality for 5 year survival seems to be lower than the equivalent in the older group or as compared to the mean blood pressure model. This is caused by a reduced mortality before 5 years for the middle aged population that had their FRS value calculated, where 36, 83 and 213 CVD and heart attack-related deaths occurred before 5, 10 and 15 years in this sub-group, respectively. This can be compared to 141, 356 and 952 all-cause deaths in this same sub-group (red-curve in figure \ref{fig:ROC_FRS} and \ref{fig:ROC_MeanBP}). Alternatively, when compared to the full-population (not just those who had the FRS value), the CVD and heart attack-related deaths for the middle aged population are 55, 115 and 282 over the 5, 10 and 15 year range, respectively.

![Youden index plots (modified ROC curves) for the model that used mean systolic and diastolic blood pressure as covariates in the linear predictor, stratified by the event type (cause of mortality). The columns split two groups in the population: those who start the survey aged between 45 to 64 and 65-84 years old. The rows split the model predictions between 5, 10 and 15 year survival. ](./Rmarkdown_Plots/ROC_MeanBPModel_CAx-EventType.png){#fig:ROC_MeanBP}

![Youden index plots (modified ROC curves) for the model that used the FRS value as covariates in the linear predictor, stratified by age group and the number of years the survival outcome was predicted since participant starting the survey. ](./Rmarkdown_Plots/ROC_FRSModel_CAx-EventType.png){#fig:ROC_FRS}

Comparison of the Youden index plots (modified ROC curves) and AUC values is also presented for the different demographic groups, see figure \ref{fig:ROC_Demog}. This figure shows the differences in the prediction performance (Youden plots and AUC values)  of the full-population model for the 45-64 age population for their 10 year survival outcome, using the mean systolic and diastolic blood pressure model. This plot illustrates that potentially only the all-cause mortality has enough outcomes in each demographic group to separate the curves. The model seems to most accurately predict the 10 year survival outcome of the black and other ethnic groups, as well as the white female demographic as compared to the black female, other female and white male population. To provide insight into this, we also provide the frequency table of deaths for each demographic group, mortality cause and survival year, see table \ref{tab:DeathFreq1} and \ref{tab:DeathFreq2}.


![Youden index plots (modified ROC curves) stratified by the different demographic groups used in this research. The point and line colours represent the different event types that were used to predict on. ](./Rmarkdown_Plots/ROC_CAx-EventType_Demog_10Yr_45-64.png){#fig:ROC_Demog}

```{r DeathFreq1,echo=F,message=FALSE,warning=FALSE,results='asis'}
deaths<-read_csv("Results/Deaths_FreqTable.csv",show_col_types = FALSE)
cat(kable(deaths[deaths[,2]=="Age: 45-64",-2],
format="latex", escape = F,booktabs = T, linesep="",
caption = "Frequency table of the population aged 45-64 for the N-year survival outcomes as separated by demographic group and mortality cause.") %>%   row_spec(c(12,24), extra_latex_after = "\\addlinespace") %>%
          kable_styling(latex_options = c("hold_position","striped")))

cat('\n')
  
```


```{r DeathFreq2,echo=F,message=FALSE,warning=FALSE,results='asis'}
cat(kable(deaths[deaths[,2]=="Age: 65-84",-2],
format="latex", escape = F,booktabs = T, linesep="",
caption = "Frequency table of the population aged 65-84 for the N-year survival outcomes as separated by demographic group and mortality cause.")%>%   row_spec(c(12,24), extra_latex_after = "\\addlinespace") %>%
          kable_styling(latex_options = c("hold_position","striped")))
cat('\n')

```


To finalise the section on the performance of the models, we present a series of figures that provide Youden index plots and AUC values for different linear predictor (Cox proportional hazards model) covariate configurations. 
By setting the covariate-specific $\beta$ parameter values to zero, we can measure the additional prediction performance that adding different covariates provides to the model. In figures \ref{fig:ROC_RL1} to \ref{fig:ROC_RL8oth}, we present three main formulations: using only the systolic and diastolic $\Delta$ terms, using the mean systolic and diastolic (figures \ref{fig:ROC_RL1}-\ref{fig:ROC_RL2oth}) or the FRS value (figures \ref{fig:ROC_RL7}-\ref{fig:ROC_RL8oth}) terms as well as the systolic and diastolic $\Delta$ terms (figures \ref{fig:ROC_RL1}-\ref{fig:ROC_RL8oth}) and, finally, using the systolic mean (figures \ref{fig:ROC_RL1}-\ref{fig:ROC_RL2oth}) or FRS value only (figures \ref{fig:ROC_RL7}-\ref{fig:ROC_RL8oth}). Figures \ref{fig:ROC_RL1}, \ref{fig:ROC_RL2} and \ref{fig:ROC_RL2oth} apply to the full-population with the models trained on CVD and heart-attack related mortality, all-cause, and other mortality, respectively. Figures \ref{fig:ROC_RL7}, \ref{fig:ROC_RL8} and \ref{fig:ROC_RL8oth} apply to the population with an FRS value, with the models trained on CVD and heart-attack related mortality, all-cause, and other mortality, respectively. The first thing to note as a commonality between all these different figures is that the use of the long-term variability, $\Delta$, in the model has comparable performance with that of using the systolic mean or FRS values only. Additionally, where the number of deaths permits for prediction, the use of both the $\Delta$ and mean/FRS values results in higher AUC values than using models that use one or the other. Finally, the use of the FRS value consistently under-performs the mean diastolic and systolic blood pressure-based model.

![Youden index plots (modified ROC curves) for the mean systolic and diastolic model, looking specifically at CVD and heart attack-related deaths, stratified by age group and the number of years the survival outcome was predicted since participant starting the survey. The colour of the points and lines represents the different linear predictor covariate models possible. ](./Rmarkdown_Plots/ROC_CAx-Covariates_EventType_RL1.png){#fig:ROC_RL1}

![Youden index plots (modified ROC curves) for the mean systolic and diastolic model, looking at all-cause deaths, stratified by age group and the number of years the survival outcome was predicted since participant starting the survey. The colour of the points and lines represents the different linear predictor covariate models possible. ](./Rmarkdown_Plots/ROC_CAx-Covariates_EventType_RL2.png){#fig:ROC_RL2}

![Youden index plots (modified ROC curves) for the mean systolic and diastolic model, looking non-CVD and heart attack-related deaths, stratified by age group and the number of years the survival outcome was predicted since participant starting the survey. The colour of the points and lines represents the different linear predictor covariate models possible. ](./Rmarkdown_Plots/ROC_CAx-Covariates_EventType_RL2oth.png){#fig:ROC_RL2oth}

![Youden index plots (modified ROC curves) for for the FRS-based model, looking specifically at CVD and heart attack-related deaths, stratified by age group and the number of years the survival outcome was predicted since participant starting the survey. The colour of the points and lines represents the different linear predictor covariate models possible. ](./Rmarkdown_Plots/ROC_CAx-Covariates_EventType_RL7.png){#fig:ROC_RL7}

![Youden index plots (modified ROC curves) for for the FRS-based model, looking at all-cause deaths, stratified by age group and the number of years the survival outcome was predicted since participant starting the survey. The colour of the points and lines represents the different linear predictor covariate models possible. ](./Rmarkdown_Plots/ROC_CAx-Covariates_EventType_RL8.png){#fig:ROC_RL8}

![Youden index plots (modified ROC curves) for for the FRS-based model, looking non-CVD and heart attack-related deaths, stratified by age group and the number of years the survival outcome was predicted since participant starting the survey. The colour of the points and lines represents the different linear predictor covariate models possible. ](./Rmarkdown_Plots/ROC_CAx-Covariates_EventType_RL8oth.png){#fig:ROC_RL8oth}

## Results - Excluded Population

As there were 3,916 individuals that were excluded from the research, it is important to verify that there is no bias in the survival outcomes linked to each of the demographic groups for the included and excluded populations. This section demonstrates this by considering differences in the Kaplan-Meier curves, shown in figure \ref{fig:excpop}. No significant difference in survival outcomes is observed between the different demographic groups for the excluded and included populations. However, it is important to note that bias has been introduced by the fact that the 751 individuals that were included in the 'other' ethnicity will therefore not be represented in this research.

```{r excpop,include=FALSE,message=FALSE,warning=FALSE,echo=F}

nhanes$excluded<-!nhanesgood
nhanes%<>%dplyr::select(age,yrsfu,eventhrt,dead,excluded,type)

totKMcurve<-data.frame()
for (ageB in c(45)){
  for (ty in unique(nhanes$type)){
      fit<-nhanes%>%filter(age<ageB+19 & age>=ageB & 
                                type==ty)%>%
        survfit(formula = Surv(yrsfu, eventhrt) ~ excluded)
      tmp<-summary(fit,times = 1:40/2)
      totKMcurve%<>%rbind(data.frame(
        excluded=tmp$strata,
        time=tmp$time,
        Age=ageB,
        pSurv=tmp$surv,
        plow=tmp$lower,
        pupp=tmp$upper,
        Gender=str_split(ty,", ",simplify = T)[1],
        Ethnicity=str_split(ty,", ",simplify = T)[2]
      ))
  }
}

totKMcurve$Ethnicity[totKMcurve$Ethnicity=="Mex"]<-"other"
totKMcurve$excluded<-gsub("excluded=","",as.character(totKMcurve$excluded))

p<-totKMcurve%>%
  ggplot(aes(time,pSurv))+geom_line(aes(colour=excluded))+
  geom_ribbon(aes(ymin=plow,ymax=pupp,fill=excluded), size=1,alpha = 0.3)+
  ylab("Survival Probability")+xlab("Time Since Initial Survey")+
  ggtitle("Kaplan Meier Curve")+theme(plot.title = element_text(hjust = 0.5))+
  facet_grid(row=vars(Ethnicity),cols = vars(Gender))
ggsave("Excluded_SurvProbKM.png",p,path = "./Rmarkdown_Plots",width = 7,height = 7)  

```
![Kaplan-Meier plots of the full-population for all-cause mortality, split by demographic group and whether the individual was included in this research. ](./Rmarkdown_Plots/Excluded_SurvProbKM.png){#fig:excpop}

## Results - Frequentist Comparison

A comparison with the frequentist Cox proportional hazards model is presented in this section, to allow the reader to percieve the differences between the Bayesian hierarchical model and the standard frequentist approach. Tables \ref{tab:freqCVDF} to \ref{tab:freqALLNF} show the model results from using only the blood pressure covariates to predict survival outcomes.
The baseline mortality is stratified by sex and ethnicity.


```{r frequentist, include=FALSE, message=FALSE, warning=FALSE, echo=F}
# Temporarily load the data
load("./Data_cleaned/nhanes_cleaned_lists.RData")
# Extract systolic and diastolic individual-level summary statistics
xSC_NF<-list_nhanesA$sys
xDC_NF<-list_nhanesA$dias
xSH_NF<-list_nhanesA$sys_home
xDH_NF<-list_nhanesA$dias_home
xSC_F<-list_nhanesFRS$sys
xDC_F<-list_nhanesFRS$dias
xSH_F<-list_nhanesFRS$sys_home
xDH_F<-list_nhanesFRS$dias_home
### Create datafranes from full data for K-M and Cox-PH models later
list_nhanesA$M_i_D<-unname(rowMeans( 0.5*(as.matrix(xDC_NF)+as.matrix(xDH_NF)) ))
list_nhanesA$M_i_S<-unname(rowMeans( 0.5*(as.matrix(xSC_NF)+as.matrix(xSH_NF)) ))
list_nhanesA$D_i_D<-unname(rowMeans( abs((as.matrix(xDC_NF)-as.matrix(xDH_NF))/2)))
list_nhanesA$D_i_S<-unname(rowMeans( abs((as.matrix(xSC_NF)-as.matrix(xSH_NF))/2)))
list_nhanesA$sigma_C_S<-unname(apply(xSC_NF, 1, sd))
list_nhanesA$sigma_H_S<-unname(apply(xSH_NF, 1, sd))
list_nhanesA$sigma_C_D<-unname(apply(xDC_NF, 1, sd))
list_nhanesA$sigma_H_D<-unname(apply(xDH_NF, 1, sd))
# FRS
list_nhanesFRS$M_i_D<-unname(rowMeans( 0.5*(as.matrix(xDC_F)+as.matrix(xDH_F)) ))
list_nhanesFRS$M_i_S<-unname(rowMeans( 0.5*(as.matrix(xSC_F)+as.matrix(xSH_F)) ))
list_nhanesFRS$D_i_D<-unname(rowMeans( abs((as.matrix(xDC_F)-as.matrix(xDH_F))/2)))
list_nhanesFRS$D_i_S<-unname(rowMeans( abs((as.matrix(xSC_F)-as.matrix(xSH_F))/2)))
list_nhanesFRS$sigma_C_S<-unname(apply(xSC_F, 1, sd))
list_nhanesFRS$sigma_H_S<-unname(apply(xSH_F, 1, sd))
list_nhanesFRS$sigma_C_D<-unname(apply(xDC_F, 1, sd))
list_nhanesFRS$sigma_H_D<-unname(apply(xDH_F, 1, sd))
# Extract only what we need
is.zero<-function(x) x==0
DF_nhanesA <- list_nhanesA[-c(1:4)] %>% as.data.frame() %>%
  mutate(across(starts_with('sigma_'), ~ replace(., is.zero(.), NA_real_)))
DF_nhanesFRS <- list_nhanesFRS[-c(1:4)] %>% as.data.frame() %>%
  mutate(across(starts_with('sigma_'), ~ replace(., is.zero(.), NA_real_)))
# Cleanup
rm(list_nhanesA,list_nhanesFRS)
colnames(DF_nhanesA)[1]<-"Time"
colnames(DF_nhanesFRS)[1]<-"Time"
# add some extras
DF_nhanesA%<>%mutate(race=case_when(black==1 ~ 2, white==1 ~ 1, T ~ 0))
DF_nhanesFRS%<>%mutate(race=case_when(black==1 ~ 2, white==1 ~ 1, T ~ 0))
# Create the standard deviations to normalise the beta values
SD_nA <- suppressWarnings(DF_nhanesA %>%
  select(M_i_D, M_i_S, D_i_D, D_i_S, sigma_C_S, sigma_H_S, sigma_C_D, sigma_H_D) %>%
  summarise(across(everything(), function(x) sd(x, na.rm=T)))%>%reshape2::melt()%>%
  setNames(c("term","std"))%>%rbind(data.frame(term=c("race=1","race=2","female=1"),std=c(1,1,1))))
SD_nFRS <- suppressWarnings(DF_nhanesFRS %>%
  select(FRS.1998, D_i_D, D_i_S, sigma_C_S, sigma_H_S, sigma_C_D, sigma_H_D) %>%
  summarise(across(everything(), function(x) sd(x, na.rm=T)))%>%reshape2::melt()%>%
  setNames(c("term","std"))%>%rbind(data.frame(term=c("race=1","race=2","female=1"),std=c(1,1,1))))
```


```{r cox-example, echo=T,results='asis'}
cox_NHANES <- coxph( Surv(age, Time+age, eventCVDHrt) ~ FRS.1998 +
                     D_i_S + D_i_D + sigma_C_S + sigma_H_S +
                     sigma_C_D + sigma_H_D + strata(race) + strata(female), data = DF_nhanesFRS)
```

```{r freqCVDNF, echo=F,results='asis'}
cat(coxph(Surv(age, Time+age, eventCVDHrt) ~ M_i_S + M_i_D +
                     D_i_S + D_i_D + sigma_C_S + sigma_H_S +
                     sigma_C_D + sigma_H_D +
                  strata(race) + strata(female), data = DF_nhanesA) %>%
  broom::tidy(conf.int = TRUE) %>%    # Include confidence intervals
  mutate_if(is.numeric, signif, digits = 2) %>%
  # mutate_at(vars(matches("_i_")), scale, scale=F) %>%
  # mutate_at(vars(matches("sigma_")), scale, scale=F) %>%
  mutate(p.value=case_when(p.value<0.001 ~ "<0.001", T ~ as.character(p.value)),
         term=gsub("strata\\(female\\)","",gsub("strata\\(race\\)","",term)))%>%
  left_join(SD_nA,by="term")%>%
  mutate(Range = paste0("(", signif(conf.low*std,2), ",", signif(conf.high*std,2), ")")) %>%  # Add the range column
  mutate(estimate=estimate*std, term=gsub("_"," ",term))%>%
  dplyr::select(-statistic, -conf.low, -conf.high, -std.error, -std) %>%    # Remove unnecessary columns
    filter(!(grepl("race=",term,T) | grepl("female=",term,T)))%>%
    rename("variable"="term")%>%left_join(convnames,by="variable")%>%
    dplyr::select(Covariate,estimate,p.value,Range)%>%
  setNames(c("Covariate", "Beta Normalised", "P-Value", "Range")) %>%
  kableExtra::kable(format = "latex", escape = F, booktabs = TRUE, label = "freqCVDNF", digits=3,linesep = "",
                    caption="Cox-PH parameter estimates for cardiovascular mortality, NHANES III, full population.")%>%
  kable_styling(latex_options = c("hold_position","striped")))
cat('\n')
  # kableExtra::kable(caption = "Tidy Summary of GLM Model",format="latex", escape = F,booktabs = T)
# coxph(Surv(Time+age, eventCVDHrt) ~ age + M_i_D + M_i_S + 
#                      D_i_D + D_i_S + sigma_C_S + sigma_H_S + 
#                      sigma_C_D + sigma_H_D, data = DF_nhanesA)

```

```{r freqALLNF, echo=F,results='asis'}
cat(coxph(Surv(age, Time+age, eventall) ~ M_i_S + M_i_D +
                     D_i_S + D_i_D + sigma_C_S + sigma_H_S +
                     sigma_C_D + sigma_H_D +
                  strata(race) + strata(female), data = DF_nhanesA) %>%
  broom::tidy(conf.int = TRUE) %>%    # Include confidence intervals
  mutate_if(is.numeric, signif, digits = 2) %>%
  mutate_at(vars(matches("_i_")), scale, scale=F) %>%
  mutate_at(vars(matches("sigma_")), scale, scale=F) %>%
  mutate(p.value=case_when(p.value<0.001 ~ "<0.001", T ~ as.character(p.value)),
         term=gsub("strata\\(female\\)","",gsub("strata\\(race\\)","",term)))%>%
  left_join(SD_nA,by="term")%>%
  mutate(Range = paste0("(", signif(conf.low*std,2), ",", signif(conf.high*std,2), ")")) %>%  # Add the range column
  mutate(estimate=estimate*std, term=gsub("_"," ",term))%>%
  dplyr::select(-statistic, -conf.low, -conf.high, -std.error, -std) %>%    # Remove unnecessary columns
    filter(!(grepl("race=",term,T) | grepl("female=",term,T)))%>%
    rename("variable"="term")%>%left_join(convnames,by="variable")%>%
    dplyr::select(Covariate,estimate,p.value,Range)%>%
  setNames(c("Covariate", "Beta Normalised", "P-Value", "Range")) %>%
  kableExtra::kable(format = "latex", escape = F, booktabs = TRUE, label = "freqALLNF", digits=3,linesep = "",
                    caption="Cox-PH parameter estimates for all-cause mortality, NHANES III, full population.")%>%
  kable_styling(latex_options = c("hold_position","striped")))
cat('\n')

```

```{r freqCVDF, echo=F,results='asis'}
cat(cox_NHANES %>%
  broom::tidy(conf.int = TRUE) %>%    # Include confidence intervals
  mutate_if(is.numeric, signif, digits = 2) %>%
  mutate_at(vars(matches("_i_")), scale, scale=F) %>%
  mutate_at(vars(matches("FRS")), scale, scale=F) %>%
  mutate_at(vars(matches("sigma_")), scale, scale=F) %>%
  mutate(p.value=case_when(p.value<0.001 ~ "<0.001", T ~ as.character(p.value)),
         term=gsub("strata\\(female\\)","",gsub("strata\\(race\\)","",term)))%>%
  left_join(SD_nFRS,by="term")%>%
  mutate(Range = paste0("(", signif(conf.low*std,2), ",", signif(conf.high*std,2), ")")) %>%  # Add the range column
  mutate(estimate=estimate*std, term=gsub("_"," ",term))%>%
  dplyr::select(-statistic, -conf.low, -conf.high, -std.error, -std) %>%    # Remove unnecessary columns
    filter(!(grepl("race=",term,T) | grepl("female=",term,T)))%>%
    rename("variable"="term")%>%left_join(convnames,by="variable")%>%
    dplyr::select(Covariate,estimate,p.value,Range)%>%
  setNames(c("Covariate", "Beta Normalised", "P-Value", "Range")) %>%
  kableExtra::kable(format = "latex", escape = F, booktabs = TRUE, label = "freqCVDF", digits=3,linesep = "",
                    caption="Cox-PH parameter estimates for cardiovascular mortality, NHANES III, FRS population.")%>%
  kable_styling(latex_options = c("hold_position","striped")))
cat('\n')

  # kableExtra::kable(caption = "Tidy Summary of GLM Model",format="latex", escape = F,booktabs = T)
# coxph(Surv(Time+age, eventCVDHrt) ~ age + M_i_D + M_i_S + 
#                      D_i_D + D_i_S + sigma_C_S + sigma_H_S + 
#                      sigma_C_D + sigma_H_D, data = DF_nhanesA)

```


```{r freqALLF, echo=F,results='asis'}

cat(coxph(Surv(age, Time+age, eventall) ~ FRS.1998 +
                     D_i_S + D_i_D + sigma_C_S + sigma_H_S +
                     sigma_C_D + sigma_H_D +
                  strata(race) + strata(female), data = DF_nhanesFRS) %>%
  broom::tidy(conf.int = TRUE) %>%    # Include confidence intervals
  mutate_if(is.numeric, signif, digits = 2) %>%
  mutate_at(vars(matches("_i_")), scale, scale=F) %>%
  mutate_at(vars(matches("FRS")), scale, scale=F) %>%
  mutate_at(vars(matches("sigma_")), scale, scale=F) %>%
  mutate(p.value=case_when(p.value<0.001 ~ "<0.001", T ~ as.character(p.value)),
         term=gsub("strata\\(female\\)","",gsub("strata\\(race\\)","",term)))%>%
  left_join(SD_nFRS,by="term")%>%
  mutate(Range = paste0("(", signif(conf.low*std,2), ",", signif(conf.high*std,2), ")")) %>%  # Add the range column
  mutate(estimate=estimate*std, term=gsub("_"," ",term))%>%
  dplyr::select(-statistic, -conf.low, -conf.high, -std.error, -std) %>%    # Remove unnecessary columns
    filter(!(grepl("race=",term,T) | grepl("female=",term,T)))%>%
    rename("variable"="term")%>%left_join(convnames,by="variable")%>%
    dplyr::select(Covariate,estimate,p.value,Range)%>%
  setNames(c("Covariate", "Beta Normalised", "P-Value", "Range")) %>%
  kableExtra::kable(format = "latex", escape = F, booktabs = TRUE, label = "freqALLF", digits=3,linesep = "",
                    caption="Cox-PH parameter estimates for all-cause mortality, NHANES III, FRS population.")%>%
  kable_styling(latex_options = c("hold_position","striped")))
cat('\n')

```


## Results - Exploring $\Delta$ Directionality

This section of the appendix is to explore whether the directionality of the difference in clinic-home blood pressure (represented through the non-absolute value of the $\Delta$ covariate) may have an influence on the survival outcome in the population. In the work presented in this article, $\Delta$ is the absolute value of the differences in the means of the blood pressure measurements at the clinic and at home, respectively, for both diastolic and systolic blood pressure. By 'directionality', we refer to whether the difference between the clinic and home mean measurements are positive or negative. Figure \ref{fig:DeltaDensities} shows the clinic-home directionalities, split by demographic group, indicating no significant difference between the different demographic groups. There is a general trend that the directionality for systolic and diastolic blood pressure is more likely to be the same than opposite.

![The range of the non-absolute $\Delta$ values in the systolic and diastolic blood pressure measurements, split by demographic group. This reflects the differences between the average measurements at the clinic and at home. ](./Rmarkdown_Plots/Delta_plusminus_Demography.png){#fig:DeltaDensities}

In order to explore whether the directionality of the clinic-home measurements influences survival outcome, we will use a combination of Kaplan--Meier curves and Cox proportional hazards regression. 
The latter will implement a simple Maximum Likelihood Estimation (MLE) method based on summary statistics of the Bayesian posterior blood pressure values, not the Bayesian-HMC method applied elsewhere in this article. 
The Kaplan--Meier curve is a plot of the change in survival probability of a population in time since the start of a survey/census. 
The survival distribution is calculated using
\begin{equation}\label{survKM}
\hat{S}(t)=\prod_{t_j \le t}\left(1-\frac{d_j}{r_j} \right),
\end{equation}
for $d_j$ the number of individuals who die within the time interval $t_j$ and $r_j$ the population that are alive (at risk of death) and not censored. Greenwood's formula is used to calculate the variance of the Kaplan-Meier estimation
\begin{equation}\label{sigKM}
\hat{\sigma}(t)^2=\hat{S}(t)^2\sum_{t_j \le t}\left(\frac{d_j}{r_j(r_j-d_j)} \right).
\end{equation}
The 100(1-$\alpha$)\% confidence intervals of the Kaplan-Meier estimate are assumed to be normally distributed
\begin{equation}\label{CIKM}
\hat{S}(t) \pm z_{1-\alpha/2}\hat{\sigma}(t).
\end{equation}

Figures \ref{fig:KM45tot} and \ref{fig:KM65tot} show the Kaplan-Meier estimates for the full NHANES population for CVD mortality, split by demographic group, for the age range 45-64 and 65-84, respectively. The survival probability of the older population decreases faster in time than the middle-aged (45-64) population.

![Kaplan-Meier plots of the full-population for CVD mortality, for ages between 45-64, split by demographic group. ](./Rmarkdown_Plots/SurvProbKM_45-64.png){#fig:KM45tot}

![Kaplan-Meier plots of the full-population for CVD mortality, for ages between 65-84, split by demographic group. ](./Rmarkdown_Plots/SurvProbKM_65-84.png){#fig:KM65tot}

By splitting the populations into the respective regions of the $\Delta$ directionality, we can use the different Kaplan-Meier plots to try to identify differences in the survival outcomes. Figures \ref{fig:KM45_deltaregion} and \ref{fig:KM65_deltaregion} show the Kaplan-Meier estimates for the full NHANES population for CVD mortality, split by $\Delta$ directionality and demographic group, for the age range 45-64 and 65-84, respectively. With the broad confidence intervals, all of the different $\Delta$ directionality regions overlap, for all demographic groups and both age range groups (where $\hat{S}(t)\neq 1$).

![Kaplan-Meier plots of the full-population for CVD mortality, for ages between 45-64, split by demographic group and region in systolic-diastolic $\Delta$ space. ](./Rmarkdown_Plots/SurvProbKM_Delta_45-65.png){#fig:KM45_deltaregion}

![Kaplan-Meier plots of the full-population for CVD mortality, for ages between 65-84, split by demographic group and region in systolic-diastolic $\Delta$ space. ](./Rmarkdown_Plots/SurvProbKM_Delta_65-85.png){#fig:KM65_deltaregion}

We further quantify this insignificant relationship between $\Delta$ directionality and survival outcome via the use of a Cox Proportional Hazards (CPH) model. Via the use of the 'coxph' function in the 'survival' R package, we fit (using MLE) a CPH model. The covariates used in the model are $\Delta$ directionality region, gender, ethnicity and age. Table \ref{tab:DeltaDir} shows the summary of the model fit, which reflects that being in different $\Delta$ directionality regions has a non-statistically significant influence on survival outcomes. As shown in the remainder of this article, ethnicity, gender and age are shown to have statistically significant effects on survival outcome.

```{r DeltaDir,echo=F,results='asis'}
DCox<-read_csv("./Results/DeltaPosTab.csv",show_col_types = FALSE)
cat(kable(DCox, linesep = "",
format="latex", escape = F,booktabs = T, digits = 3, caption = "Parameters for distribution of blood pressure, for the full population")%>%  kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```

Finally, we wish to confirm that the perfomance of the model does not depend on the directionality of $\Delta$. Figure \ref{fig:DeltaAUCs} plots the AUC values of 10-year CVD mortality for the all-covariate mean blood pressure-based model trained on the full NHANES population, split by the two age-ranges (45-64 and 65-84) and demographic groups. There is no clear trend between mode performance for the different regions of $\Delta$ directionality.

![AUC values of the full-population, all covariate model with the mean blood pressure model, based on CVD 10-year mortality. ](./Rmarkdown_Plots/DeltaDirection_AUCs.png){#fig:DeltaAUCs}

# References