title | author | date | output | ||||||
---|---|---|---|---|---|---|---|---|---|
Nhanes 2021 Blood Pressure-Based Mortality Risk - Appendix A |
Rscripts by Hamish Patten, DW Bester and David Steinsaltz |
21/07/2022 |
|
- Do all the plots with FRS 1998 and not ATP
- include effective sample size and Rhat value
- ROC curves: -> get rid of the second plot on the right -> on y-axis make plots of BOTH TPR:FPR and (TPR-FPR)/TPR -> focus on CVDHrt and all, with each having: -> one with just demographic (all betas are zero), one FRS, one deltas, one with both FRS and deltas -> So using run numbers 5 & 6 and 7 & 8 -> x-axis is number of years since study started, y-axis is the age at end of survey -> demographics only, then FRS, deltas and both -> Also note how many people per parallelograms and how many deaths -> x-axis as time T since the start of the census -> y-axis age at the end of survey -> intensity of colour represents number of people, colour represents AUC -> USE THE ORIGINAL DATA AND NOT THE POSTERIOR DISTRIBUTION SAMPLES
- Push back on adding the FRS 1998 values as they correlated better with mortality risk for both CVDHrt and all
This appendix aims to add more detail about the numerical modelling than was provided in the article. This is to ensure that the research methods are transparent and entirely reproducible. The numerical modelling required to build and parameterize the blood-pressure related mortality risk work presented in this paper was performed using R, additionally using Rstan. More detail will be provided here about the model, about the specific methodology used to parameterize the model, and more results are provided that were not included in the main bulk of the article.
The model used in this research is built from the theory of joint modelling of longitudinal and time-to-event data. This will be described in detail later on in this section, however, in brief, this allows the simultaneous modelling of both longitudinal observation data (in this article, this is blood pressure measurements) and also the time-to-event outcome (in this article, this is death or death from specifically Cardio-Vascular Disease (CVD) or heart attack).
The mathematical model applied in this research is underpinned by combining the Gomptertz equation with Cox' proportional hazards model. The Gompertz equation
\begin{equation}\label{gompertz}
h_0(t)=\boldsymbol{B}\exp{\left(\boldsymbol{\theta}(x+T)\right)},
\end{equation}
describes the baseline hazard of the population to a particular risk, which, for this article, investigates mortality from Cardio-Vascular Disease (CVD) and heart attack specifically, as well as studying mortality risk in general.
-
$FRS.1998\in\mathbb{R}^{+,N}$ - FRS score based on the 1998 version, -
$FRS-ATP\in\mathbb{R}^{+,N}$ - FRS score based on the ATP version, -
$M_S\in\mathbb{R}^{+,N}$ - Mean systolic blood pressure, -
$M_D\in\mathbb{R}^{+,N}$ - Mean diastolic blood pressure, -
$\Delta_S\in\mathbb{R}^{+,N}$ - Absolute value of the difference in the Home-Clinic systolic blood pressure, -
$\Delta_D\in\mathbb{R}^{+,N}$ - Absolute value of the difference in the Home-Clinic diastolic blood pressure, -
$\sigma_{S,H}\in\mathbb{R}^{+,N}$ - Standard deviation of the systolic blood pressure taken at home, -
$\sigma_{D,H}\in\mathbb{R}^{+,N}$ - Standard deviation of the diastolic blood pressure taken at home, -
$\sigma_{S,C}\in\mathbb{R}^{+,N}$ - Standard deviation of the systolic blood pressure taken at the clinic, -
$\sigma_{D,C}\in\mathbb{R}^{+,N}$ - Standard deviation of the diastolic blood pressure taken at the clinic, -
$\tau_{S,H}\in\mathbb{R}^{+,N}$ - Precision of the systolic blood pressure taken at home, -
$\tau_{D,H}\in\mathbb{R}^{+,N}$ - Precision of the diastolic blood pressure taken at home, -
$\tau_{S,C}\in\mathbb{R}^{+,N}$ - Precision of the systolic blood pressure taken at the clinic, -
$\tau_{D,C}\in\mathbb{R}^{+,N}$ - Precision of the diastolic blood pressure taken at the clinic.
Please note that the last four elements of this list, the precision values, were only carried out to ensure model consistency with the use of standard deviation instead. For the parameterisation of this model, we assume that the gompertz parameters and the parameters in the linear predictor term are distributed as follows:
\begin{equation}\label{priorsS}
\begin{aligned}
\boldsymbol{B}i\sim\mathbb{C}(\mu_B,\sigma_B),\
\boldsymbol{\theta}i\sim\mathbb{N}(\mu\theta,\sigma\theta),\
\boldsymbol{\beta}i\sim \mathbb{N}(\mu\beta,\sigma_\beta),
\end{aligned}
\end{equation}
noting that
Alongside the time-to-event prediction, we simulataneously estimate the blood pressure outcomes of each individual throughout the census time. Let
Combining the longitudinal outcome and time-to-event partial likelihoods, and for a given parameter space value of
In this article, we researched into 16 models, but made a focus on 8 of them. The 8 main models use the standard deviation,
\begin{enumerate}\label{runnums} \item All participants (15,295), using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as death specifically from CVD or heart attack. \item All participants (15,295), using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as all-causes of death. \item Only participants that had FRS values (9,418), but using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as death specifically from CVD or heart attack. \item Only participants that had FRS values (9,418), but using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as all-causes of death. \item Only participants that had FRS values (9,418), and using the FRS ATP-III value in the linear predictor term, with the outcome data as death specifically from CVD or heart attack. \item Only participants that had FRS values (9,418), and using the FRS ATP-III value in the linear predictor term, with the outcome data as all-causes of death. \item Only participants that had FRS values (9,418), and using the FRS 1998-version value in the linear predictor term, with the outcome data as death specifically from CVD or heart attack. \item Only participants that had FRS values (9,418), and using the FRS 1998-version value in the linear predictor term, with the outcome data as all-causes of death. \end{enumerate}
We also include Directed Acyclical Graph (DAG) plots to help visualise the different models, as shown in figure A.1 and A.2. In order to read the DAGs, note that each square background layer that appears as a stack of layers represents different measured outcomes that were made during the census. The outcome variables measured are represented by a square-shaped text box, and a parameter of the model is represented by a circular-shaped text box. If either a square or circular text box is placed on top of a stacked rectangular layer, it means that multiple values of that variable (as many as there are layers to the stack) are either measured (for outcome variables) or produced (for parameters of the model). Please note that the number of layers in the stack is written in the text box that does not contain a frame which is intentionally displayed on top of the stacked layer that it represents. For example,
The methodology for this research can be split into three main sections: 1) calculating the empirical Bayes' parameters, 2) parameterizing the model using Hamiltonian Monte Carlo (HMC) and 3) re-centering the variables in the linear predictor equation.
We begin by assuming
- \frac{k-1}{2} \sum_{i=1}^n \log s_i^l -\left(\alpha+\frac{k-1}{2}\right) \sum_{i=1}^n \log \left(s_i^l+\frac\alpha\theta\right). \end{equation}
The partial Fisher Information has entries \begin{align*} -\frac{\partial^2 \ell}{\partial \alpha^2} &= n\psi_1\left(\alpha\right) - n\psi_1\left(\alpha+\frac{k-1}{2}\right)
- \frac{n}{\alpha} +\sum_{i=1}^n \frac{2\theta s_i^l + \alpha-(k-1)/2}{(\theta s_i^l + \alpha)^2}\
-\frac{\partial^2 \ell}{\partial \theta^2} &=
-\frac{n \alpha}{\theta^2} +\frac{\alpha}{\theta^2}\left(\alpha+\frac{k-1}{2}\right)\sum_{i=1}^n \frac{2\theta s_i^l + \alpha}{(\theta s_i^l + \alpha)^2}\
-\frac{\partial^2 \ell}{\partial \theta\partial\alpha} &= \frac{n}{\theta}-
\frac1\theta \sum_{i=1}^n \frac{\alpha^2+2\alpha\theta s_i^l+\frac{k-1}{2}\theta s_i^l}{(\theta s_i^l + \alpha)^2}.
\end{align*}
where
$\psi_1$ is the trigamma function.
Let
For a parameter like
To computing the residuals, we define the deviance for an individual
The model, as described in the article, is a Bayesian hierarchical model. In order to parameterize such an intricate model, traditional Maximum Likelihood Estimation methods can no longer be applied. Therefore, we apply the Hamiltonian Monte Carlo (HMC) method. HMC is a form of Markov Chain Monte Carlo methods, which samples potential parameter space values of the model, then calculates directly the likelihood function based on that choice of parameters. The derivative of the likelihood function,
The tuning parameters
During the MCMC simulations, the centering values play a non-negligible role in shaping the model parameterization. If the centering parameters are held constant throughout all of the MCMC simulations, then the equation
The code can be found at https://github.com/hamishwp/Nhanes2021. The numerical code has been built in multiple stages. Below, we explain the principal files required to replicate the entire analysis presented in the article. There are 5 main groups for the code:
- Data cleaning scripts
- Main file
- Stan files for HMC
- Centering recalculation scripts
- Post-processing analysis
We provide a brief description of each of these below.
This is found in the file Dataclean2021.R
. Provided the raw NHANES dataset (in CSV format), it extracts all the data required for the simulations, and stores it in a structure that can be directly read in to the main file (MCMC_DiasSyst_v3.R
) of this research.
The main file is MCMC_DiasSyst_v3.R
. It reads in the cleaned NHANES data, the specific choice of simulation parameters (for example, whether to use the FRS number or mean systolic & diastolic blood pressure), and runs the correct RStan scripts for that specific selection of simulation parameters. This script is intended for use on computing clusters.
There are eight Stan files: mystanmodel_DS_sigma_v2_autopred.stan
, mystanmodel_DS_tau_v2_autopred.stan
, mystanmodelFRS_DS_sigma_v2_autopred.stan
, mystanmodelFRS_DS_tau_v2_autopred.stan
, mystanmodel_DS_sigma_v2.stan
, mystanmodel_DS_tau_v2.stan
, mystanmodelFRS_DS_sigma_v2.stan
, mystanmodelFRS_DS_tau_v2.stan
. These correspond to the following alternative simulation parameters:
- For the blood-pressure variability, choosing to use the standard-deviation
$\sigma$ or the precision$\tau=1/\sigma$ ]} - Using the FRS score or the mean diastolic and systolic blood pressure as a covariate in the analysis
- Whether the centering parameters,
$\hat{X}$ , in the linear predictor term are automatically calculated to satisfy$\sum_i^N \exp{(\boldsymbol{\beta}\cdot(\boldsymbol{X}-\hat{X}))}=0$ for every MCMC iteration, or whether the centering is held constant across all iterations
The centering of the linear predictors, which is required as input to every MCMC simulation iteration, is recalculated in the files AutoPred_Recalc.R
and ManPred_Recalc.R
. This is then provided to the Main script, MCMC_DiasSyst_v3.R
, which provides these centering values to the Stan code for the MCMC simulations.
The file gamma_fits.Rmd
contains all the necessary routines in order to replicate the calculation of the empirical Bayes' priors for the hyperparameters of the model.
###Post-processing
The post-processing script is called PostProcessing.R
, which heavily relies on the Functions.R
script which contains all the necessary functions to analyse the data. The post-processing script generates many useful plots of the MCMC posterior distribution for the user, including Bayes' factors, violin plots of the normalised beta and gompertz posteriors, and more.
In this section, we add some additional detail to the results section covered in the article. Extra information is given to explain how convergence of the simulations was ensured, and to also include more visualisations of the converged model parameterizations. The authors feel that this is particularly useful to provide confidence in the model parameterization and the predictions.
Convergence of the simulations required to parameterize the model presented in this work is required for the MCMC simulations performed by Stan, as well as convergence in the centering values that requires repeating the Stan calculations several times. Convergence of the latter is shown in figure A.3. The upper plot in figure A.3 illustrates convergence in the average Root Mean-Squared Error (RMSE) of the model predictions on the survival outcomes in the MCMC simulations. The lower plot in figure A.3 illustrates convergence in the average sum of the linear predictor terms over all MCMC chain iterations.
With respect to convergence of the MCMC simulations, defining convergence first involves discarding the burn-in period of the simulations. When the time-evolution marker chain has a large number of samples, sequence thinning is used to reduce the amount of data storage - after convergence, take only the kth value of the simulations (after having discarded the burn-in phase values) and discard the rest. One measure of convergence is to bin similar markers and check that for each bin, the variation of the individual marker movement over a few time steps is larger than the variation of the ensemble markers in-between one-another. Other methods of convergence are stationarity and mixing. The former occurs by ensuring that the gradients of movements in the chains in time are in the same direction, the latter ensures that the amplitude of the movements in the chains are similar. To calculate the mixing and stationarity, one can do the following:
\begin{itemize}
\item Take the proposedly converged marker population, where there are N markers in total each of index length
\item Stationarity: compare the inter-marker variance (between sequence B):
\begin{equation}
B = \frac{\tau}{k(kN-1)}\sum{j=1}^{kN}(\bar{\psi}{|,j}-\bar{\psi}{||})^2
\end{equation}
\item Mixing: compare the variance along each markers chain length (within-sequence W):
\begin{equation}
W = \frac{1}{n(\tau-k)}\sum_{j=1}^{kN}\sum_{i=1}^{\tau/k}(\psi_{i,j}-\bar{\psi}{|j})^2
\end{equation}
\item Therefore, to estimate the marginal posterior variance of $p(\psi|y)$, then we use a weighted average
\begin{equation}
\hat{\text{Var}}^+(\psi|y)=\frac{\tau-k}{N}W+\frac{1}{Nk}B
\end{equation}
Note that this quantity overestimates the marginal posterior variance, but it is unbiased under stationarity: this can be used to infer convergence. When the varation in
\begin{equation}
\hat{R}=\sqrt{\frac{\hat{\text{Var}}^+(\psi|y)}{W}}
\end{equation}
should approach close to 1 for converged simulations.
\end{itemize}
Another convergence parameter is the number of effective independent marker draws. Upon convergence, the time evolution of each marker should be uncorrelated and independent to previous time steps. To find the average time-correlation over all particles, we use the variogram $V_t$:
\begin{equation}
V_t=\frac{1}{Nk(\tau/k-\tilde{t})}\sum{j=1}^{kN}\sum_{i=1}^{\tau/k}(\psi_{i,j}-\psi_{i-\tilde{t},j})^2,
\end{equation}
where
We remind the reader of the list of numbers of the different models explored in this research, provided in list \ref{runnums}. The authors will use the numbers in the list, referred to as the run number, in the following plots. One of the most important set of parameters of the model is the vector
With respect to the time-independent Gompertz parameter, described using
Figure A.6 reflects the same level of consistency for the Gompertz parameter that influences the temporal evolution of the mortality risk. It is worth noting that both figures A.5 and A.6 have inverse trends between the values of B and theta for each demographic group. This makes it difficult to imagine, based on these two plots, what the mortality risk is at different ages across demographics, yet it is evident that the form of the change in the mortality risk curve in time is different for each demographic group. Women are observed to have lower initial values of risk, but mortality risk later in life begins to increase much faster than for men. Additionally, hispanic populations are shown to have a larger initial mortality risk than black populations who are shown to have a larger initial mortality risk than white populations in the USA. However, mortality risk increases at a faster rate for white populations than for black populations, for which it increases faster than hispanic populations in the USA.
To measure the performance of the model to predict the survival outcome of individuals in the population, figure A.7 shows, ordered by individual age, the cumulative hazard
Finally, the AUC values per model resulted as 0.72, 0.7, 0.69, 0.68, 0.73, 0.69, 0.73 and 0.69, for run numbers 1-8, respectively. This illustrates that, with respect to ability to predict mortality risk amongst the population, the models that included the FRS score as a covariate in the proportional hazards model were the best performing models. Furthermore, training the model specifically on CVD and heart attack mortality data also led to an increase in the performance of the models.