-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNHANES-BP-Appendix.Rmd
executable file
·2503 lines (2070 loc) · 160 KB
/
NHANES-BP-Appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "NHANES Blood Pressure-Based Mortality Risk - Appendix"
author: "Rscripts by Hamish Patten, DW Bester and David Steinsaltz"
date: "03/08/2024"
output:
bookdown::pdf_document2:
keep_tex: true
toc: true
toc_depth: 3
number_sections: true
table_caption: true
latex_engine: xelatex # Change the LaTeX engine to xelatex
header-includes:
- "\\usepackage{lscape}" # Use lscape package to change page orientation
bibliography: nhanesBP.bib
---
```{r setup, include=FALSE,message=FALSE,warning=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(tinytex.verbose = TRUE)
library(tidyverse)
library(bookdown)
library(formatR)
library(magrittr)
library(knitr)
library(tinytex)
library(kableExtra)
library(VennDiagram)
library(lattice)
library(ggtern)
library(parallel)
library(survival)
c14<-c("dodgerblue2", "#E31A1C", "green4", "#6A3D9A", "#FF7F00", "gold1", "skyblue2", "gray70", "maroon", "orchid1", "darkturquoise", "darkorange4", "brown", "black")
```
# Appendix A -- The data
```{r data-cleaning, include=FALSE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Load data
nhanes_old=read.csv('Data_raw/nh3bpdat290716.csv')
# Update mortality with 2017 public release
nhanes_update=read.csv('Data_raw/NHANES_ML_update.csv',colClasses = 'integer') %>%
mutate(yrsfuExam = round(permth_exm/12,2), yrsfuHome = round(permth_int/12, 2)) %>%
filter(eligstat == 1)
nhanes_old %<>% mutate(yrsfuExam = round(yrsfuExam, 2), yrsfuHome = round(yrsfuHome,2), permth_exm = round(yrsfuExam *12), permth_int = round(yrsfuHome*12))
# Compare old and new mortality
old_dead <- sort(nhanes_old$SEQN[nhanes_old$dead==1])
new_dead <- sort(nhanes_update$seqn[nhanes_update$mortstat==1])
conflict_dead <- old_dead[ !(old_dead %in% new_dead)] # Tested: No resurrections
missing_update <- nhanes_old$SEQN[!(nhanes_old$SEQN %in% nhanes_update$seqn)] #Tested: No one from old data set missing from followup
missing_old <- !(nhanes_update$seqn %in% nhanes_old$SEQN) # 6 subjects in the updated data set with mortality data (all alive) who were not in the old data set.
nhanes_update %<>% filter(!missing_old) # remove them
identical(nhanes_old$SEQN,nhanes_update$seqn) # Sequence of ids now identical
fu_diff <- nhanes_update$permth_int - nhanes_update$permth_exm
fu_diff_update <- nhanes_update$permth_int - nhanes_old$permth_int
identical(nhanes_old$UCOD_LEADING[!is.na(nhanes_old$UCOD_LEADING)], nhanes_update$ucod_leading[!is.na(nhanes_old$UCOD_LEADING)])
#TRUE, so cause-of-death codes for all individuals who were dead in the first data set are identical
conflict_old <- subset(nhanes_old, nhanes_old$SEQN %in% conflict_dead)
conflict_new <- subset(nhanes_update, nhanes_update$seqn %in% conflict_dead)
nhanes <- nhanes_old
nhanes$UCOD_LEADING <- nhanes_update$ucod_leading
nhanes$yrsfuExam <- nhanes_update$yrsfuExam
nhanes$yrsfuHome <- nhanes_update$yrsfuHome
nhanes$dead <- nhanes_update$mortstat
nhanes$mrtHrt <- as.integer(nhanes$UCOD_LEADING==1) # 1 codes for Heart
nhanes$mrtNeo <- as.integer(nhanes$UCOD_LEADING==2) # 2 codes for Neoplasm
nhanes$mrtInj <- as.integer(nhanes$UCOD_LEADING==4) # 4 codes for Injury
nhanes$mrtOtherCVD <- as.integer(nhanes$UCOD_LEADING==5) # 5 codes for other CVD
# Change NAs to 0
nhanes$mrtHrt[is.na(nhanes$mrtHrt)] <- 0
nhanes$mrtNeo[is.na(nhanes$mrtNeo)] <- 0
nhanes$mrtInj[is.na(nhanes$mrtInj)] <- 0
nhanes$mrtOtherCVD[is.na(nhanes$mrtOtherCVD)] <- 0
nhanes$eventhrt <- with(nhanes, mrtOtherCVD+mrtHrt)
nhanes$eventother <- with(nhanes, dead-eventhrt)
nhanes$yrsfu=pmin(nhanes_update$yrsfuHome,nhanes$yrsfuExam, na.rm = TRUE)
# Note: The most recent examination (home or clinic) is a left truncation time,
# so follow-up is the minimum. For some the clinic is missing.
# We call the follow-up the home follow-up time, but it doesn't matter,
# since those individuals will be excluded from analysis.
whichsys=match(c('systolicA','systolicB','systolicC'),names(nhanes))
whichdias=match(c('diastolicA','diastolicB','diastolicC'),names(nhanes))
whichsyshome=match(c('systolicAhome','systolicBhome','systolicChome'),names(nhanes))
whichdiashome=match(c('diastolicAhome','diastolicBhome','diastolicChome'),names(nhanes))
whichBP=c(whichsys,whichdias)
whichBPhome=c(whichsyshome,whichdiashome)
allsys=c(whichsys,whichsyshome)
alldias=c(whichdias,whichdiashome)
### Means and variances
sys=nhanes[,whichsys]
dias=nhanes[,whichdias]
sysH=nhanes[,whichsyshome]
diasH=nhanes[,whichdiashome]
nhanes$meandiasH=apply(diasH,1,mean)
nhanes$meansysH=apply(sysH,1,mean)
nhanes$meandiasC=apply(dias,1,mean)
nhanes$meansysC=apply(sys,1,mean)
nhanes$meandias=(apply(sysH,1,mean)+apply(sys,1,mean))/2
nhanes$meansys=(apply(diasH,1,mean)+apply(dias,1,mean))/2
nhanes$sysDel=(apply(sysH,1,mean)-apply(sys,1,mean))/2
nhanes$diasDel=(apply(diasH,1,mean)-apply(dias,1,mean))/2
nhanes$vardiasC=apply(dias,1,var)
nhanes$varsysC=apply(sys,1,var)
nhanes$vardiasH=apply(diasH,1,var)
nhanes$varsysH=apply(sysH,1,var)
nhanes$sddiasC=sqrt(nhanes$vardiasC)
nhanes$sdsysC=sqrt(nhanes$varsysC)
nhanes$sddiasH=sqrt(nhanes$vardiasH)
nhanes$sdsysH=sqrt(nhanes$varsysH)
nhanes$precdiasC=1/(nhanes$vardiasC+1/3)
nhanes$precsysC=1/(nhanes$varsysC+1/3)
nhanes$precdiasH=1/(nhanes$vardiasH+1/3)
nhanes$precsysH=1/(nhanes$varsysH+1/3)
race <- with(nhanes, ifelse(white==1, 2, ifelse(black==1, 1,ifelse(mexican==1,3, 0))))
nhanes$race <- as.factor(race)
type=factor(1+race+3*(1-nhanes$female)*(race>0)+6*(race==0)) # other race all type 7
levels(type)=c('female, black','female, white','female, Mex','male, black','male, white','male, Mex','other')
nhanes$type=type
nhanesna=!(is.na(apply(nhanes[,c(whichBP,whichBPhome)],1,prod)))
diaslow <- apply(nhanes[,alldias],1,min)<40
diashigh <- apply(nhanes[,alldias],1,max)>140
syslow <- apply(nhanes[,allsys],1,min)<60
syshigh <- apply(nhanes[,allsys],1,max)>250
nhanessysrange=(syslow | syshigh)
nhanesdiasrange=(diaslow | diashigh)
dias0 <- apply(nhanes[,alldias],1,min) == 0
sys0 <- apply(nhanes[,allsys],1,min) == 0
nhanesgood=(nhanesna&(nhanes$yrsfu>0)&nhanes$other==0)
n_original <- dim(nhanes)[1]
n_no_other <- sum(nhanes$other==0)
n_followup <- sum(nhanes$yrsfu>0 & nhanesna & nhanes$other==0)
n_no_followup <- sum(nhanes$yrsfu==0 & nhanesna & nhanes$other==0)
frs=read.csv('Data_raw/FRS.csv')
whichfrs <- '1998' # 'ATP' or '1998'
#Choose versions of FRS
if(whichfrs=='ATP'){nhanes$FRS=frs$ATP.FRS[match(nhanes$SEQN,frs$SEQN)]} else{nhanes$FRS=frs$X1998.FRS[match(nhanes$SEQN,frs$SEQN)]}
#################################################
#
# Add in observer data
#
#################################################
exm=read.csv('Data_raw/Examiners.csv')
nhanes$Exam=exm$PEPTECH[match(nhanes$SEQN,exm$SEQN)]
nhanes$Exam[is.na(nhanes$Exam)]=0
nhanes$Exam[nhanes$Exam == 88888] = 0
## Combine the 88888 examiner with 0
#nhanesA=subset(nhanesA,Exam>0)
examname=sort(unique(nhanes$Exam))
examlist=lapply(examname,function(en) subset(nhanes,Exam==en)$nunq.dias)
examlists=lapply(examname,function(en) subset(nhanes,Exam==en)$nunq.sys)
nhanesA=nhanes[nhanesgood & !nhanessysrange&!nhanesdiasrange, ]
nhanesA$type=factor(nhanesA$type,exclude=7)
N=dim(nhanesA)[1]
k=3
BP_type_names <- c('Systolic','Diastolic')
BP_place_names <- c('Home','Clinic')
whichsys=match(c('systolicA','systolicB','systolicC'),names(nhanesA))
whichdias=match(c('diastolicA','diastolicB','diastolicC'),names(nhanesA))
whichsyshome=match(c('systolicAhome','systolicBhome','systolicChome'),names(nhanesA))
whichdiashome=match(c('diastolicAhome','diastolicBhome','diastolicChome'),names(nhanesA))
whichBP=c(whichsys,whichdias)
whichBPhome=c(whichsyshome,whichdiashome)
# Make BP measures into array
sys=data.matrix(nhanesA[,whichsys])
dias=data.matrix(nhanesA[,whichdias])
sysH=data.matrix(nhanesA[,whichsyshome])
diasH=data.matrix(nhanesA[,whichdiashome])
allBP <- list(Systolic = list(Home = sysH, Clinic = sys), Diastolic = list(Home = diasH, Clinic = dias))
L=length(sys)
gamma_dimnames <- list( c('alpha','theta','beta') , BP_place_names)
norm_dimnames <- c('m_M','m_Delta', 'sigma2_M', 'sigma2_Delta')
```
```{r correlations, include=TRUE, echo=FALSE, message=FALSE, warning=FALSE}
demog.data <- data.frame(Ethnicity = droplevels(nhanesA$race) %>% fct_recode(Black = '1',White = '2',Mexican = '3'),
Sex = factor(nhanesA$female) %>% fct_recode(Male = '0', Female = '1') ,
Exam = as.factor(nhanesA$Exam) ) %>% mutate(Demog = factor(paste(Ethnicity,Sex)))
levels(demog.data$Exam)[levels(demog.data$Exam) == '0'] <- 'Unknown'
# Make a table of means for systolic home BP by ethnicity and sex
bp_table <- list()
bp.data <- data.frame()
which_diff <- function(x) { # for a vector with 3 entries, with two the same, return the one that is different
if (length(unique(x)) == 1 | length(unique(x))==3) {return(0)}
if (x[1] == x[2]) {return(3)}
if (x[1] == x[3]) {return(2)}
return(1)
}
for (BPtype in BP_type_names){
for (BPplace in BP_place_names){
bp.data %<>% rbind(cbind(demog.data, BPtype = factor(BPtype, levels = BP_type_names) , BPplace = factor(BPplace), Mean = apply(allBP[[BPtype]][[BPplace]],1,mean), SD = apply(allBP[[BPtype]][[BPplace]],1,sd),
TotalMean = (apply(allBP[[BPtype]][['Home']],1,mean)+apply(allBP[[BPtype]][['Clinic']],1,mean))/2,
Delta = abs(apply(allBP[[BPtype]][['Home']],1,mean)-apply(allBP[[BPtype]][['Clinic']],1,mean))/2,
# Note: TotalMean and Delta are identical for home and clinic
Number = apply(allBP[[BPtype]][[BPplace]],1,function(x) length(unique(x))),
Which_Diff = apply(allBP[[BPtype]][[BPplace]],1,which_diff)))
}
}
mean_sd_summary <- bp.data %>%
group_by(BPplace,BPtype, Sex, Ethnicity) %>%
summarise(
Mean_of_Mean = mean(Mean, na.rm = TRUE),
Mean_of_SD = mean(SD, na.rm = TRUE)
)
bp.data.cor <- bp.data %>%
group_by(BPtype, BPplace) %>%
summarize(correlationSD = cor(Mean, SD, use = "complete.obs"),correlationDelta = cor(TotalMean, Delta, use = "complete.obs"))
bp.data.cor2 <- bp.data %>%
group_by(BPtype,BPplace,Demog) %>%
summarize(correlationSD = cor(Mean, SD, use = "complete.obs"),correlationDelta = cor(TotalMean, Delta, use = "complete.obs"))
# Make a table of last digit fractions for bp by type and place
bp_last_digit <- matrix(0,nrow=0,ncol=5)
for (BPtype in BP_type_names){
for (BPplace in BP_place_names){
digit_table <- unname(table(allBP[[BPtype]][[BPplace]]%%10))
digits=digit_table/sum(digit_table)
bp_last_digit %<>% rbind(digits)
}
}
bp_last_digit %<>% as_tibble
names(bp_last_digit) <- 2*(0:4)
bp_last_digit$Place <- rep(BP_place_names,each=2)
bp_last_digit$Type <- rep(BP_type_names,2)
bp_last_digit %<>% select(Place,Type,everything())
```
```{r Venn1, echo=FALSE, message=FALSE, warning=FALSE, fig.width=4, fig.height=4, fig.align='center', fig.cap='Venn diagram of subjects excluded from the analysis.'}
bprange <- nhanessysrange | nhanesdiasrange
bprange[is.na(bprange)] <- FALSE
v <- draw.quad.venn(
area1=sum(!nhanesna), # Missing BP
area2=sum(nhanes$yrsfu==0), # No follow-up
area3=sum(nhanes$other==1), # Other ethnicity
area4 = sum(bprange), # BP out of range
n12=sum(!nhanesna & nhanes$yrsfu==0),
n23=sum(nhanes$yrsfu==0 & nhanes$other==1),
n13=sum(!nhanesna & nhanes$other==1),
n14 = sum(!nhanesna & bprange),
n24 = sum(nhanes$yrsfu==0 & bprange),
n34 = sum(nhanes$other==1 & bprange),
n124 = sum(!nhanesna & nhanes$yrsfu==0 & bprange),
n134 = sum(!nhanesna & nhanes$other==1 & bprange),
n123=sum(!nhanesna & nhanes$yrsfu==0 & nhanes$other==1),
n234=sum(nhanes$yrsfu==0 & nhanes$other==1 & bprange),
n1234=sum(!nhanesna & nhanes$yrsfu==0 & nhanes$other==1 & bprange),
category = c("Missing", "Follow-up", "Other Ethnicity","BP out of range"),
fill = c("skyblue", "pink1", "mediumorchid", "orange"),
lty = "blank",
cat.pos = c(-20, 20,0,0),
cex = 1
)
```
## Exclusions
There were `r dim(nhanes)[1]` subjects in the initial data set.
Of these `r sum(!nhanesgood)` were excluded because they had missing data or were not followed up, or belonged to the "Other" ethnic group.
This left `r sum(nhanesgood)` subjects for further consideration.
A small number of subjects were excluded because their blood pressure measurements were outside the normal range, as described below in section \ref{sec:BPrange}.
As our method depends on estimating the mortality rates for each demographic group (ethnicity and sex),
we removed the small number of subjects whose ethnic group was given as "Other" (n=`r sum(nhanes$other)`).
(The three included ethnic groups were Mexican American (n=`r sum(nhanes$mexican)`), Black (n=`r sum(nhanes$black)`), and White (n=`r sum(nhanes$white)`).
In the end there were `r N` subjects in the analysis data set.
A Venn diagram of the different causes of exclusion is given in figure \ref{fig:Venn1}.
We will refer to this as the "full population".
Of these, `r sum(!is.na(nhanesA$FRS))` had a computable FRS score.
We call this the ``FRS population''.
## Exploratory data analysis
The empirical means of the home and clinic measures in population B are tabulated in Table \ref{tab:summaries}. We note that the home measures are systematically higher than the clinic measures, within every demographic group, with greater differences for subjects who are white or Mexican, and female.
The average difference is about `r sprintf('%.1f',mean(diasH)-mean(dias))` for diastolic and `r sprintf('%.1f', mean(sysH) - mean(sys))` for systolic, which is small compared with the general range of the differences, which have SD of
`r sprintf('%.1f (diastolic) and %.1f (systolic)', sd(diasH-dias),sd(sysH-sys))`.
```{r summaries, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Printing the table using kable
cat(kable(mean_sd_summary,format="latex", escape = F,booktabs = T, digits=1,
linesep = rep(c(rep("",5),"\\addlinespace"),4),
col.names=c("Place","Sys/Dias", "Sex", "Ethnicity", 'Mean', 'SD'),caption = 'Summary data for blood pressure') %>% kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```
### Correlations between measurements {#sec:correlations}
In Figure \ref{fig:SD-mean} see that there is relatively little correlation between empirical SD and empirical mean SD for the different BP types and places. This is reassuring, as it avoids the possibility of a collinearity effect confounding the sampling of mean and SD, which are being treated as independent covariates in the model.
```{r SD-mean,fig.pos='H', fig.cap='Scatterplot of individual mean BP against individual SD of BP', echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Scatterplot of mean BP against SD of BP from bp.data
ggplot(bp.data, aes(x = Mean, y = SD)) +
geom_point(alpha=.03, color = 'navyblue' ) +
labs(title = "Scatterplot of Mean against SD",
x = "Mean",y = "SD") + ylim(0,15)+
geom_smooth(method = "lm", se = FALSE, color = "black") + # Add regression line
facet_grid(BPtype ~ BPplace, scales = 'free_x') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(data = bp.data.cor, aes(label = sprintf("Cor: %.2f", correlationSD), x = Inf, y = Inf),
vjust = "top", hjust = "right", inherit.aes = FALSE)
```
<!--#```{r Delta-mean,fig.pos='H', fig.cap='Scatterplot of individual mean BP against individual absolute difference between Clinic and Home mean', echo=FALSE,results='asis',message=FALSE,warning=FALSE}
#bp.sub <- data.frame(Mean=bp.data$TotalMean[bp.data$BPplace=='Clinic'],Delta=bp.data$Delta[bp.data$BPplace=='Clinic'], #BPtype=bp.data$BPtype[bp.data$BPplace=='Clinic']) %>% # Home and clinic are identical
# group_by(BPtype)
# bp.data.cor <- bp.sub %>%
# summarize(correlationDelta = cor(Mean, Delta, use = "complete.obs"))
# # Scatterplot of mean BP against SD of BP from bp.data
# ggplot(bp.sub %>% group_by(BPtype), aes(x = Mean, y = Delta)) +
# geom_point(alpha=.03, color = 'navyblue' ) +
# labs(title = "Scatterplot of Mean against |Delta|",
# x = "Mean",y = "|Delta|") + ylim(0,20)+
# geom_smooth(method = "lm", se = FALSE, color = "black") + # Add regression line
# facet_grid(~ BPtype, scales = 'free_x') +
# theme(plot.title = element_text(hjust = 0.5)) +
# geom_text(data = bp.data.cor2, aes(label = sprintf("Cor: %.2f", correlationDelta), x = Inf, y = Inf),
# vjust = "top", hjust = "right", inherit.aes = FALSE)
-->
```{r Delta-mean,include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
DeltaSD <- cbind(nhanesA$meansys,nhanesA$meandias)
MeanSD <- abs(cbind(nhanesA$sysDel,nhanesA$diasDel))
colnames(DeltaSD) <- c('SysDelta','DiasDelta')
colnames(MeanSD) <- c('SysMean','DiasMean')
cor_table <- cor(DeltaSD,MeanSD,use='complete.obs')
cat(kable(cor_table,
caption = 'Correlation between mean and Delta. Rows correspond to type of Delta, columns to type of mean.',
format = "latex",
escape = FALSE,
booktabs = TRUE,
digits = 3))
cat('\n')
```
```{r SD-mean2,include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
DiasSDMean <- cbind(nhanesA$sddiasC,nhanesA$meandiasC,nhanesA$sddiasH,nhanesA$meandiasH)
SysSDMean <- cbind(nhanesA$sdsysC,nhanesA$meansysC,nhanesA$sdsysH,nhanesA$meansysH)
allSD <- with(nhanesA,cbind(sdsysC,sdsysH,sddiasC,sddiasH))
allMean <- with(nhanesA,cbind(meansysC,meansysH,meandiasC,meandiasH))
colnames(allSD) <- c('Clinic Sys SD','Home Sys SD','Clinic Dias SD','Home Dias SD')
colnames(allMean) <- c('Clinic Sys Mean','Home Sys Mean','Clinic Dias Mean','Home Dias Mean')
cor_table2 <- cor(allSD,allMean,use='complete.obs')
cat(kable(cor_table2,
caption = 'Correlation between mean and SD. Rows correspond to type and location of SD, columns to type and location of mean.',
format = "latex",
escape = FALSE,
booktabs = TRUE,
digits = 3) %>%
kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```
In Table \ref{tab:Delta-mean} we show the correlations between overall mean and absolute difference ($|\Delta|$) between clinic and home measurements.
The results are given as a $2\times2$ table, showing correlations within systolic and diastolic BP, and between the two.
The only moderately high correlation is between Systolic mean and Diastolic absolute Delta, which would correspond to a Variance Inflation Factor of `r round(1/(1-cor_table["DiasDelta","SysMean"]^2),2)`.
While this is not directly relevant to the present Bayesian methodology, it suggests that this correlation should not substantially affect the estimation of the model coefficients.
In Table \ref{tab:SD-mean2} we show the correlations between mean and standard deviation for the three BP measures, considering all pairs of (Clinic,Home) and (Systolic,Diastolic).
Finally, Table \ref{tab:SD-mean3} shows the correlations between systolic and diastolic, ranging over (Clinic,Home) and (Mean,SD).
(Some of the numbers here of course duplicate those in Table \ref{tab:SD-mean2}.)
Again, the correlations are too low to require any special treatment.
```{r SD-mean3,include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
DiasSDMean <- cbind(nhanesA$sddiasC,nhanesA$meandiasC,nhanesA$sddiasH,nhanesA$meandiasH)
SysSDMean <- cbind(nhanesA$sdsysC,nhanesA$meansysC,nhanesA$sdsysH,nhanesA$meansysH)
colnames(DiasSDMean) <- c('Clinic Dias SD','Clinic Dias Mean','Home Dias SD','Home Dias Mean')
colnames(SysSDMean) <- c('Clinic Sys SD','Clinic Sys Mean','Home Sys SD','Home Sys Mean')
cor_table3 <- cor(DiasSDMean,SysSDMean,use='complete.obs')
cat(kable(cor_table3,
caption = 'Correlation between diastolic and systolic summary statistics. Rows correspond to variables and locations for diastolic, columns to variables and locations for systolic.',
format = "latex",
escape = FALSE,
booktabs = TRUE,
digits = 3) %>%
kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```
## Errors in blood pressure measurement or recording {#sec:errors}
The blood pressure measurement or recording errors were found particularly in the home measurements.
While these did not destroy the usefulness of the home measurements, they did require some attention and decisions for how to work with these defects.
We also consider them inherently interesting, and worth registering for future researchers working on these or similar data.
In particular, the problem we have called “dependent replication” was entirely unexpected, although not unprecedented, and is of particular concern to researchers trying to estimate individual variation in clinically relevant measures.
### Last-digit preference {#sec:lastdigit}
Mild tendency for observers to prefer certain last digits in reporting BP measurements has been reported in other studies, though an analysis of the 1999 wave of NHANES reported no last-digit preference [@ostchega2003national].
The last-digit preference in NHANES III, on the other hand, is substantial, with about `r sprintf('%.1f', 100*bp_last_digit$'0'[3])`% of all the clinic-measured systolic BP measurements ending in 0, but only about `r sprintf('%.1f', 100*bp_last_digit$'4'[3]+100*bp_last_digit$'6'[3])`% ending in 4 or 6. Because the shifts due to last-digit preference are presumably small, we expect them to have little effect on the main effects that we are examining in this paper, but they do increase the probability of two measurements being rounded to the same value, something that needs to be taken into account in examining the problem of dependent replication.
```{r digit-summary, include=TRUE, echo=FALSE,results='asis'}
# Printing the table using kable
cat(kable(bp_last_digit,format="latex", escape = F,booktabs = T, digits=3,col.names=c("Place","Sys/Dias",2*(0:4)),caption = 'Summary data for BP end digits') %>%
kable_styling(latex_options = c("hold_position","striped")))
cat('\n')
```
### Dependent replicates {#sec:pseudorep}
While the protocol calls for each subject to have three independent BP measures taken, it is not impossible that the observers may have been influenced by one measure in recording the next.
This could happen in either direction: later measurements could be pulled closer to the first, or there could be an inclination to avoid repeated measures.
This is relevant, because erroneously repeated measures would artificially decrease the variance of the three measurements, and avoiding repeated measures would have the opposite effect.
The end-digit bias may be expected to have an effect here, since it influences the probability of two measurements being rounded to the same value.
We begin by noting the standard deviations for measurements of individual subjects as given in the column 'Mean of SD' in Table \ref{tab:sd-summary}.
The column 'Prob all rep' gives the theoretical probability that two of the three measurements for a subject would have the same value, if the measurements were independent and normally distributed with the given standard deviation (adjusted for the rounding), and assuming that rounding to particular digits is done in proportion to the fractions listed in Table \ref{tab:digit-summary}.
The column 'Prob 2 rep' gives the probability that two of the three measurements would have the same value, under the same conditions.
The column 'Frac all rep' gives the observed fraction of subjects for whom all three measurements were equal, and 'Frac 2 rep' gives the fraction for whom two of the three measurements were equal.
The observed fractions for three equal measurements are all very close to the theoretical probabilities, but the observed fractions for two equal measurements are substantially lower than the theoretical probabilities.
(For comparison, a 95\% probability range for the fraction of subjects with two equal measurements is about $\pm 0.008$.)
In Figure \ref{fig:examinerPlot}, we show the fraction of subjects with two equal measurements, by examiner, blocked by place and type.
We see that the fraction of subjects with two equal measurements varies substantially by examiner, and that the variation is greater for the systolic than for the diastolic measurements.
```{r pseudorep, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Calculate expected number of all with same last digit given 3 observations of sd = S
prob_repeated3 <- function(S, intervalwidth=2){
# Assume overall mean is uniform
# so average over the interval [0,10]
# then probability of same is sum_j (Phi(10j+intervalwidth-x)-Phi(10j-x))^3
sum(sapply(seq(-1,1),function(j)
.1*integrate(function(x) #(1-x/intervalwidth)*dnorm(x,sd=S)*(pnorm(x,sd=S)-.5),0,intervalwidth)$value
(pnorm(10*j+intervalwidth-x,sd=S)-pnorm(10*j-x,sd=S))^3,-5,5)$value) )
}
prob_repeated2 <- function(S, intervalwidth=2){
# Assume overall mean is uniform
# so average over the interval [0,10]
# then probability of same is sum_j 3*(Phi(10j+intervalwidth-x)-Phi(10j-x))^2 - 2*prob_repeated3
sum(sapply(seq(-1,1),function(j)
.1*integrate(function(x)
3*(pnorm(10*j+intervalwidth-x,sd=S)-pnorm(10*j-x,sd=S))^2,-5,5)$value) ) - 2*prob_repeated3(S,intervalwidth)
}
sd_summary <- bp.data %>%
group_by(BPplace,BPtype) %>%
summarise(
Mean_of_SD = mean(SD, na.rm = TRUE),
# fraction with all three measurements equal
Fraction3 = mean(Number==1, na.rm = TRUE),
Fraction2 = mean(Number==2, na.rm = TRUE)
)
# Calculate probability of three measurements being equal given unequal interval lengths
prob_repeated3_intervals <- function(S,intervals){
sum(sapply(intervals, function(intervalwidth) prob_repeated3(S,intervalwidth)))
}
prob_repeated2_intervals <- function(S,intervals){
sum(sapply(intervals, function(intervalwidth) prob_repeated2(S,intervalwidth)))
}
sd_summary$Prob_repeated3 <- sapply(1:4, function(j) prob_repeated3_intervals(sqrt(sd_summary$Mean_of_SD[j]^ 2+1/3),10*unlist(bp_last_digit[j,3:7])))
sd_summary$Prob_repeated2 <- sapply(1:4, function(j) prob_repeated2_intervals(sqrt(sd_summary$Mean_of_SD[j]^2+1/3),10*unlist(bp_last_digit[j,3:7])))
sd_summary %<>% relocate(Fraction2, .after = Prob_repeated3)
```
```{r sd-summary, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
#Print table of sd_summary
cat(kable(sd_summary,format="latex", escape = F,booktabs = T, digits=3,
linesep = rep(c(rep("",5),"\\addlinespace"),4),
col.names=c("Place","Sys/Dias","Mean of SD","Frac all rep", "Prob all rep", "Frac 2 rep", "Prob 2 rep"),caption = 'Summary data for repeated measures') %>% kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```
We show the fraction of subjects with two equal measurements in Figure {fig:examinerPlot}, split by examiner, blocked by place and type.
We see that the fraction of subjects with two equal measurements varies substantially by examiner, and that the variation is greater for the systolic than for the diastolic measurements.
```{r examinerCalc, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# 2 by 2 grid of columns of number of subjects with 2 equal measurements
# by examiner, blocked by place and type
# First, calculate the number of subjects with 2 equal measurements for each examiner
examiner2 <- bp.data %>%
group_by(BPplace,BPtype,Exam) %>%
summarise(
Number2 = sum(Number==2, na.rm = TRUE),
Number3 = sum(Number==1, na.rm = TRUE),
Total = n(),
)
examiner2 %<>% mutate(Frac2 = Number2/Total,
Frac3 = Number3/Total,
SD2 = sqrt(Frac2*(1-Frac2)/Total),
SD3 = sqrt(Frac3*(1-Frac3)/Total))
```
```{r examinerPlot, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE,fig.cap="Number of subjects with 2 equal measurements by examiner, blocked by place and type. Red band shows 95% probability range. Vertical green dashed line shows expected fraction; blue dotted line shows observed fraction over all examiners."}
ggplot(examiner2, aes(x=Number2/Total,y=Exam)) +
geom_point() +
geom_errorbarh(aes(xmin=Frac2 - 2*SD2,xmax=Frac2 + 2*SD2,col='red',height=.3)) +
facet_grid(BPplace~BPtype) +
labs(x="Fraction of subjects with 2 equal measurements",y="Examiner") +
theme_bw() +
# add dashed vertical line at corresponding Prob_repeated2 for that place and type
geom_vline(data= sd_summary, aes(xintercept=Prob_repeated2),linetype="dashed", col= 'darkolivegreen') +
geom_vline(data= sd_summary, aes(xintercept=Fraction2),linetype="dotted", col= 'navyblue') +
theme(axis.text.y = element_text(size=10),
axis.text.x = element_text(size=10),
axis.title.x = element_text(size=14),
axis.title.y = element_text(size=14),
strip.text.y = element_text(size=12),
strip.text.x = element_text(size=12),
strip.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
legend.position = "none"
)
```
```{r examinerPlot3, include=TRUE, echo=FALSE,message=FALSE,warning=FALSE,fig.cap="Number of subjects with 3 equal measurements by examiner, blocked by place and type. Red band shows 95% probability range. Vertical green dashed line shows expected fraction; blue dotted line shows observed fraction over all examiners."}
# do the same for 3 equal measurements
stripchart3 <- ggplot(examiner2,
aes(x=Number3/Total,y=Exam)) +
geom_point() +
geom_errorbarh(aes(xmin=pmax(0,Frac3 - 2*SD3),xmax=Frac3 + 2*SD3,col='red',height=.3)) +
facet_grid(BPplace~BPtype) +
labs(x="Fraction of subjects with 3 equal measurements",y="Examiner") +
theme_bw() +
# add dashed vertical line at corresponding Prob_repeated3 for that place and type
geom_vline(data= sd_summary, aes(xintercept=Prob_repeated3),linetype="dashed", col= 'darkolivegreen') +
geom_vline(data= sd_summary, aes(xintercept=Fraction3),linetype="dotted", col= 'navyblue') +
theme(axis.text.y = element_text(size=10),
axis.text.x = element_text(size=10),
axis.title.x = element_text(size=14),
axis.title.y = element_text(size=14),
strip.text.y = element_text(size=12),
strip.text.x = element_text(size=12),
strip.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
legend.position = "none"
)
stripchart3
```
In Figure \ref{fig:examinerPlot3}, we show the fraction of subjects with three equal measurements, by examiner, blocked by place and type.
Relative to the expected random fluctuations, we see that there is even more variation among the examiners.
One examiner (3001) produced consistently excessive numbers of triple repeats in Home measurements, and a deficit of triple repeats in Clinic measurements.
One further point to explore is the position of the two equal measures in a group of three.
If there are three independent measures, with two equal, each of the three has equal probability of being the odd one out.
On the other hand, if there is a trend in the measurements, then the second is least likely to be the odd one out.
In fact, what we observe is that it is the third measurement that is least likely to differ from the other two, while the first is most likely.
This is what we would expect if examiners sometimes either intentionally copied the second measurement into the space for the third, or unintentionally allowed themselves to be influenced into observing the same number.
The proportions are listed in Table \ref{tab:proportionChisq}, together with chi-squared tests for difference from the expected equal proportions for each site and type.
On the other hand, if there is a trend in the measurements, then the second is least likely to be the odd one out, which is also not what we see.
We see that there is a huge deviation from the expected proportions in the Home measurements, but less in the Home measurements, and more deviation in Systolic than in Diastolic measurements.
```{r proportionChisq, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# make a table of bp.data$Which_Diff blocked by BPplace and BPtype
b0 <- subset(bp.data,Which_Diff>0)
table1 <- as.data.frame(table(b0$Which_Diff,b0$BPplace,b0$BPtype,b0$Exam))
names(table1) <-c("Which_Diff","BPplace","BPtype","Exam","Freq")
# add a column for normalized frequencies
table1$Normalized <- ave(table1$Freq,table1$BPplace,table1$BPtype,table1$Exam,FUN=function(x) x/sum(x))
# add a column for frequencies
table1 %<>% pivot_wider(names_from = Which_Diff, values_from = c(Normalized,Freq),id_cols = c("BPplace","BPtype","Exam")) %>%
rbind(data.frame(BPplace=rep(c("Home","Clinic"),2),BPtype=rep(c("Systolic","Diastolic"),each=2),Exam='Centre',Normalized_1=1/3,Normalized_2=1/3,Normalized_3=1/3,Freq_1=1,Freq_2=1,Freq_3=1))
# Perform chi-square test for difference between observed proportions of position of unequal measurement and expect 1/3,1/3,1/3
# Data are in columns Freq_1,Freq_2,Freq_3 of table1
# Expected values are 1/3, 1/3, 1/3
table1 %<>% rowwise() %>% mutate( chisq.val = { observed=c(Freq_1,Freq_2,Freq_3)
expected=c(1/3,1/3,1/3)
unname(chisq.test(x=observed, p=expected)$statistic)},
chisq.p = pchisq( chisq.val, df=2, lower.tail=FALSE))
table1$chisq.p <- chisq.test(cbind(table1$Freq_1,table1$Freq_2,table1$Freq_3),p=c(1/3,1/3,1/3))$p.value
# make a table of chi-squared for total counts of unequal measurements by place and type
chisq1 <- table1 %>% subset(Exam != 'Centre') %>%
group_by(BPplace,BPtype) %>%
summarise(Freq_1=sum(Freq_1),Freq_2=sum(Freq_2),Freq_3=sum(Freq_3)) %>% rowwise() %>%
mutate(chisq.val = { observed=c(Freq_1,Freq_2,Freq_3)
expected=c(1/3,1/3,1/3)
signif(unname(chisq.test(x=observed, p=expected)$statistic),3)},
chisq.p = format( pchisq( chisq.val, df=2, lower.tail=FALSE), scientific = TRUE, digits = 3) )
cat(kable(chisq1, format="latex", escape = F,booktabs = T,
linesep = rep(c(rep("",5),"\\addlinespace"),4),
col.names=c("Place","Sys/Dias","Freq1","Freq2","Freq3","ChiSq","p-value"),
caption="Chi-square test for difference between observed proportions (all examiners), stratified by place and type")%>%
kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```
To explore this further, we can look at the proportions of first, second and third measurements from each examiner that are different from the other two.
The results of a chi-squared test for each examiner (stratified by site and type of BP) for difference from the expected equal proportions are shown in Figure \ref{fig:proportionChisq3}.
The dashed line represents a p-value of 0.001.
Here we see that the Home measurements are extremely variable, while the Clinic measurements are quite consistent with the expected proportions, with the single exception of examiner 3004, who is far from the expected equal proportions in all categories of measurement.
```{r proportionChisq3,include=TRUE,echo=FALSE,results='asis',message=FALSE,warning=FALSE,fig.height=7,fig.cap="Proportions of first, second and third measurements from each examiner that are different from the other two, by place and type. Chi-squared value for difference from expected proportions. Dashed line represents p-value 0.001."}
ggplot(subset(table1,Exam != 'Centre'),
aes(y=chisq.val,x=Exam))+
geom_point() +
facet_grid(BPplace~BPtype) +
labs(y="Chi-square value for difference from expected proportions",x="Examiner") +
theme_bw() +
# add dashed horizontal line corresponding to p=0.001
geom_hline(yintercept=qchisq(0.999,df=2),linetype="dashed",col='navyblue')+
theme(axis.text.y = element_text(size=10),
axis.text.x = element_text(size=10, angle=90, vjust =.5, hjust =1),
axis.title.x = element_text(size=14),
axis.title.y = element_text(size=14),
strip.text.y = element_text(size=12),
strip.text.x = element_text(size=12),
panel.spacing = unit(0.5, "lines"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(color="black", fill=NA, size=0.5),
panel.background = element_blank(),
legend.position = "none")
```
Given that the position of the differing measure clearly differs from the expected equal proportions, we might ask whether the examiners agree on a common proportion, suggesting that there might be some underlying systematic (observer-independent) reason for the differing measurements.
In Table \ref{tab:ternaryChi2} we show the results of a chi-squared test for equality of observed proportions among the examiners, stratified by place and type.
Interestingly, we see here that the examiners are fairly consistent in their proportions for the Home measures, but not for the Clinic measures.
```{r ternaryChi2, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE}
# Chi-square test for difference between observed proportions and expected proportions for the table stratified by place and type
chisq2 <- table1 %>% group_by(BPplace,BPtype) %>% summarise(ChiSq=signif(chisq.test(cbind(Freq_1,Freq_2,Freq_3))$statistic,3),
p.value=format(chisq.test(cbind(Freq_1,Freq_2,Freq_3))$p.value,scientific=TRUE,digits=3))
# Print table of chisq2
cat(kable(chisq2, format="latex", escape = F,booktabs = T,
linesep = rep(c(rep("",5),"\\addlinespace"),4),
col.names=c("Place","Sys/Dias","ChiSq","p-value"),
caption="Chi-square test for difference between observed proportions among the examiners, stratified by place and type")%>%
kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```
```{r ternaryPlot, include=TRUE,fig.height=8,fig.width=8, echo=FALSE,results='asis',message=FALSE,warning=FALSE,fig.cap="Ternary plot of the position of the measurement that is unique, among subjects with 2 equal measurements. V1 is the fraction with the first distinct, V2 is the fraction with the second distinct, V3 is the fraction with the third distinct."}
# make a ternary plot of the normalized frequencies
size_values <- setNames(rep(1.5, length(levels(table1$Exam))), levels(table1$Exam))
size_values["Centre"] <- 3 # make the centre bigger
size_values['3004'] <- 3 # make the outlier bigger
color_values <- setNames(c14, levels(table1$Exam)) # set the colours
ggtern(data=table1, aes(x=Normalized_1, y=Normalized_2, z=Normalized_3)) +
geom_point(aes(fill=Exam, size=Exam),alpha=.75, shape=21) +
theme_bw() +
facet_grid(BPplace~BPtype) +
theme(tern.axis.text.L = element_text(size=10),
tern.axis.text.R = element_text(size=10),
tern.axis.text.T = element_text(size=10),
tern.axis.title.L = element_text(size=14, vjust=1,hjust=.5,face='bold'),
tern.axis.title.R = element_text(size=14, vjust=1,hjust=.5,face='bold'),
tern.axis.title.T = element_text(size=14, vjust=1,hjust=.5,face='bold'),
strip.text.y = element_text(size=14),
strip.text.x = element_text(size=14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
) + labs(L="V1", T="V2", R="V3") +
theme_showgrid()+ theme(legend.position ='bottom') +
scale_size_manual(values=size_values, name='Exam') +
scale_color_manual(values=color_values, name='Exam') +
guides(size=guide_legend(override.aes=list(size=5)))
```
Looking at a ternary plot Figure \ref{fig:ternaryPlot} for the proportions from the 13 different examiners, we see very clearly the bias toward having the last two measures agree, for almost all examiners, and examiner 3004 (marked larger) standing out as a clear outlier.
Overall, we can only conclude that there are clearly some irregularities in the BP measurement process, but we cannot identify a specific structure to them, or propose a remedy.
As the irregularities are not very large, we will proceed with the analysis without attempting to correct for them.
### Missing or implausible measurements {#sec:BPrange}
Some of the reported measures were extremely implausible, particularly for diastolic BP. `r sum(dias0)`subjects had at least one diastolic BP measure recorded as 0, in addition to the `r sum(!nhanesna)` subjects who were missing at least one measurement. We excluded all of these subjects, and indeed any subject who had at least one measurement recorded outside the ranges (40,140) for diastolic and (60,250) for systolic BP, as recommended by the CDC [@CDCBP].
There was just one subject with systolic BP measures that were too low, but `r sum(nhanesdiasrange)` subjects with low diastolic BP (in addition to those with measures recorded as 0).
One subject was excluded for diastolic BP 156, and three were excluded for systolic BP that was too high, with the maximum being 264.
\newpage
<!-- ## Exploratory data analysis -->
<!-- The empirical means of the home and clinic measures in population B are tabulated in Table 1. We note that the home measures are systematically higher than the clinic measures, within every demographic group, with greater differences for subjects who are white or Mexican, and female. -->
<!-- The average difference is about `r sprintf('%.1f',mean(diasH)-mean(dias))` for diastolic and `r sprintf('%.1f', mean(sysH) - mean(sys))` for systolic, which is small compared with the general range of the differences, which have SD of -->
<!-- `r sprintf('%.1f (diastolic) and %.1f (systolic)', sd(diasH-dias),sd(sysH-sys))`. -->
<!-- ```{r summaries, include=TRUE, echo=FALSE,results='asis',message=FALSE,warning=FALSE} -->
<!-- # Printing the table using kable -->
<!-- cat(kable(mean_sd_summary,format="latex", escape = F,booktabs = T, digits=1, -->
<!-- linesep = rep(c(rep("",5),"\\addlinespace"),4), -->
<!-- col.names=c("Sys/Dias", "Place","Sex", "Ethnicity", 'Mean', 'SD'),caption = 'Summary data for blood pressure') %>% kable_styling(latex_options = "hold_position") ) -->
<!-- cat('\n') -->
<!-- ``` -->
<!-- ### Correlations between measurements {#sec:correlations} -->
<!-- We see that there is relatively little correlation between empirical SD and empirical mean SD for the different BP types and places. This is reassuring, as it avoids the possibility of a collinearity effect confounding the sampling of mean and SD, which are being treated as independent covariates in the model. -->
<!-- ```{r SD-mean, dpi = 1,fig.pos='H', fig.width=8,fig.height=6, fig.cap='2D density plot of individual mean BP against individual SD of BP', echo=FALSE,results='asis',message=FALSE,warning=FALSE} -->
<!-- # 2D density plot of mean BP against SD of BP from bp.data -->
<!-- ggplot(bp.data, aes(x = Mean, y = SD)) + -->
<!-- geom_bin2d() + -->
<!-- labs(title = "2D Density Plot of Mean against SD", -->
<!-- x = "Mean",y = "SD") + ylim(0,15)+ -->
<!-- scale_fill_continuous(type = "viridis") + -->
<!-- theme_bw()+ -->
<!-- geom_smooth(method = "lm", se = FALSE, color = "black") + # Add regression line -->
<!-- facet_grid(BPtype ~ BPplace, scales = 'free_x') + -->
<!-- theme(plot.title = element_text(hjust = 0.5)) + -->
<!-- geom_text(data = bp.data.cor, aes(label = sprintf("Cor: %.2f", correlation), x = Inf, y = Inf), -->
<!-- vjust = "top", hjust = "right", inherit.aes = FALSE) -->
<!-- ``` -->
# Appendix B -- Model details
This appendix aims to add more detail about the numerical modelling than was provided in the article. This is to ensure that the research methods are transparent and entirely reproducible. The numerical modelling presented in this paper was performed using R combined with Rstan. More detail will be provided here about the model, about the specific methodology used to parameterize the model, and more results are provided that were not included in the main text.
## The Statistical Model
The model used in this research is built from the theory of joint modelling of longitudinal and time-to-event data. This will be described in detail later on in this section, however, in brief, this allows the simultaneous modelling of both longitudinal observation data (in this article, this is blood pressure measurements) and also the time-to-event outcome.
In this research the event of interest is either death from any cause, or death from specifically cardiovascular or cerebrovascular causes. We henceforth will refer to this latter mortality as CVD.
In the latter case, death from a different cause is treated as a noninformative censoring event.
### Survival Analysis (Time-to-Event)
The basic survival model is a Gompertz hazard rate with proportional hazards influences of the blood pressure covariates.
The Gompertz equation
\begin{equation}\label{gompertz}
h_0(t)=B\exp{\left(\theta(x+T)\right)},
\end{equation}
describes the baseline hazard of the population to a particular risk, which, for this article, investigates CVD mortality specifically, as well as studying mortality risk in general. $x\in\mathbb{N^N}$ is the age of the individual at the initial interview time, for $N$ the number of individuals, and $T\in\mathbb{R}^{+,N}$ the time since the individual entered the survey.
Note that both $B$ and $\theta$ have 6 different values, depending on the sex reported at the initial interview -- female or male --- or the race --- black, white or 'other'.
Note that 'other' in the race category is a combination of all non-black or non-white racial identities, such as Hispanic populations.
The log-linear proportional hazards model links the covariates of the model (mean systolic blood pressure, variance in the diastolic blood pressure, etc) to the survival outcome of the individual via the equation
\begin{equation}\label{prophaz}
h(t)=h_0(t)\exp{\left(\boldsymbol{\beta}\cdot(\boldsymbol{X}-\hat{\boldsymbol{X}})\right)},
\end{equation}
where $\boldsymbol{X}\in\mathbb{R}^{+,N\times d}$ is a vector of summary statistics of the blood pressure measurements of individual covariates in our model, $\hat{\boldsymbol{X}}\in\mathbb{R}^{+,d}$ is the centering of the covariates such that the equation $\sum_i^N \exp{(\boldsymbol{\beta}\cdot(\boldsymbol{X}-\hat{\boldsymbol{X}}))}=0$ is approximately satisfied (more on this later), and $\boldsymbol{\beta}\in\mathbb{R}^d$ implies the strength of the influence of the covariate on the mortality risk.
The majority of mortality events are censored --- not yet known at the time of data collection --- the censoring indicator being notated as $\delta\in \{0,1\}$.
When CVD mortality is the event being analysed, deaths due to other causes are treated as noninformative censoring events.
In this study, we explored the following covariates:
```{r, include=FALSE,message=FALSE,warning=FALSE}
DF<-data.frame(varnames=c("$FRS-1998$",
"$FRS-ATP$",
"$M_S$",
"$M_D$",
"$\\Delta_S$",
"$\\Delta_D$",
"$\\sigma_{\\{S,H\\}}$",
"$\\sigma_{\\{D,H\\}}$",
"$\\sigma_{\\{S,C\\}}$",
"$\\sigma_{\\{D,C\\}}$",
"$\\tau_{\\{S,H\\}}$",
"$\\tau_{\\{D,H\\}}$",
"$\\tau_{\\{S,C\\}}$",
"$\\tau_{\\{D,C\\}}$"),
support=c("$R^N$",
"$R^N$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$",
"$R^{+,N}$"),
desc=c("1998 version of the FRS score",
"ATP version of the FRS score",
"Mean systolic blood pressure",
"Mean diastolic blood pressure",
"Semi-difference between Home and Clinic mean systolic blood pressure",
"Semi-difference between Home and Clinic mean diastolic blood pressure",
"Standard deviation of the systolic blood pressure taken at home",
"Standard deviation of the diastolic blood pressure taken at home",
"Standard deviation of the systolic blood pressure taken at the clinic",
"Standard deviation of the diastolic blood pressure taken at the clinic",
"Precision of the systolic blood pressure taken at home",
"Precision of the diastolic blood pressure taken at home",
"Precision of the systolic blood pressure taken at the clinic",
"Precision of the diastolic blood pressure taken at the clinic"))
```
```{r, echo=F, label= "runnumers",message=FALSE,warning=FALSE,results='asis'}
cat(kable(DF,col.names = c("Variable Name", "Support", "Description"),
format = "latex",
caption="Explanations of the different models simulated in this work, according to run number.",
escape = FALSE,
booktabs = TRUE,
linesep = "",
digits = 3) %>%
kable_styling(latex_options = c("hold_position","striped")) )
cat('\n')
```
Please note that the last four elements of this list, the precision values, were only carried out to ensure model consistency with the use of standard deviation instead.
Note as well that the $\Delta$ covariates, representing the medium-term variability, enter into the log relative risk sum as an **absolute value**.
For the parametrization of this model, we assume that the Gompertz parameters and the parameters in the linear predictor term have prior distributions as follows:
\begin{equation}\label{priorsS}
\begin{aligned}
\boldsymbol{B}\sim\mathcal{N}(\mu_B,\sigma_B),\\
\boldsymbol{\theta}\sim\mathcal{N}(\mu_\theta,\sigma_\theta),\\
\boldsymbol{\beta}\sim \mathcal{N}(\mu_\beta,\sigma_\beta),
\end{aligned}
\end{equation}
The likelihood for this Gompertz proportional hazards model, over all individuals in the census, is as follows:
\begin{equation}\label{likesurv}
L_S(\boldsymbol{v},\boldsymbol{\delta})=\prod_i^N f(v_i,\delta_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}})=\prod_i^N h(v_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}})^{\delta_i} \exp{\left( -\sum_i^N H(v_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}}) \right)},
\end{equation}
with $H(v)=\int_0^v h(w) \mathrm{d}w$ the cumulative hazard.
### Longitudinal Modelling
The mortality hazard rates are assumed to be influenced by individual-level blood pressure means and variability characteristics.
These characteristics are not directly observed, but are inferred from their influence on the individual blood pressure measurements, which have been observed.
Let $Y_i(t_j)$ be the observed blood pressure for patient $i$ at time $t_j$, for the individual $i\in 1,2,...,N$ and the number of blood pressure measurements per individual $j\in 1,2,...,k$. Due to the fact that the blood pressure measurement data was taken at both the home and clinic (written using subscripts H and C, respectively), with approximately 6 months between these two measurements, we model the blood pressure using the following model, assuming the diastolic $Y_{i}^D$ and systolic $Y_{i}^S$ blood pressure to be Gaussian-distributed:
\begin{equation}\label{bp}
\begin{aligned}
(Y_{i}^D)_{H} \sim \mathcal{N}(M_i^D+\Delta_i^D,(\sigma_i^D)_H),\\
(Y_{i}^D)_{C} \sim \mathcal{N}(M_i^D-\Delta_i^D,(\sigma_i^D)_C),\\
(Y_{i}^S)_{H} \sim \mathcal{N}(M_i^S+\Delta_i^S,(\sigma_i^S)_H),\\
(Y_{i}^S)_{C} \sim \mathcal{N}(M_i^S-\Delta_i^S,(\sigma_i^S)_C),
\end{aligned}
\end{equation}
where superscripts $D$ and $S$ refer to diastolic and systolic blood pressure, respectively.
The blood pressure characteristics --- the individual-level parameters --- are themselves distributed according to a hierarchical model, determined by population-level parameters (also called ``hyperparameters''):
\begin{equation}\label{priorsL}
\begin{aligned}
M_i^{\{D,S\}}\sim \mathcal{N}(\mu_M^{\{D,S\}},\sigma_M^{\{D,S\}}),\\
\Delta_i^{\{D,S\}}\sim \mathcal{N}(\mu_D^{\{D,S\}},\sigma_D^{\{D,S\}}),\\
\sigma_{i,C}^{\{D,S\}}\sim \Gamma(r_C^{\{D,S\}},\lambda_C^{\{D,S\}}),\\
\sigma_{i,H}^{\{D,S\}}\sim \Gamma(r_H^{\{D,S\}},\lambda_H^{\{D,S\}}).
\end{aligned}
\end{equation}
The longitudinal outcome modelling therefore aims to infer these hyperparameters
\begin{equation}
\Theta=\left\{\mu_M^{\{D,S\}},\mu_D^{\{D,S\}},\sigma_M^{\{D,S\}},\sigma_D^{\{D,S\}},r_C^{\{D,S\}},\lambda_C^{\{D,S\}},r_H^{\{D,S\}},\lambda_H^{\{D,S\}}\right\},
\end{equation}
and to use the implied uncertainty about the individual-level parameters to inform the inference about the survival parameters.
The likelihood for the longitudinal measurements is therefore (combining the systolic and diastolic into a single parameter for simplicity):
\begin{equation}\label{likelong}
L_L(\Theta|Y)=\prod_{i=1}^N\left(\prod_{j=1}^{k}f(y_{ij}|M_i,\Delta_i,\sigma_i)\right)f(M_i|\mu_M,\sigma_M)f(\Delta_i|\mu_D,\sigma_D)f(\tau_{i,C}|r_C,\lambda_C)f(\tau_{i,H}|r_H,\lambda_H)
\end{equation}
### Combined Hierarchical Model
Combining the longitudinal outcome and time-to-event partial likelihoods, and for a given parameter space value of $\Omega=\{\beta,B,\theta\}\cup \Theta$, the joint likelihood is
\begin{equation}
\begin{split}
L(\Omega|Y)=\prod_{i=1}^N\left(\prod_{j=1}^{k}f(y_{ij}|M_i,\Delta_i,\sigma_i)\right)f&(v_i,\delta_i|B_i,\theta_i,\beta_i,\boldsymbol{X},\hat{\boldsymbol{X}})f(M_i|\mu_M,\sigma_M)\\
&f(\Delta_i|\mu_D,\sigma_D)f(\tau_{i,C}|r_C,\lambda_C)f(\tau_{i,H}|r_H,\lambda_H).
\end{split}
\end{equation}
One approach to estimating the complete set of hyperparameters
\begin{equation}
\Omega_H=\{\mu_B,\sigma_B,\mu_\theta,\sigma_\theta,\mu_\beta,\sigma_\beta,\mu_M^{\{D,S\}},\sigma_M^{\{D,S\}},\mu_D^{\{D,S\}},\sigma_D^{\{D,S\}},r_C^{\{D,S\}},\lambda_C^{\{D,S\}},r_H^{\{D,S\}},\lambda_H^{\{D,S\}}\}
\end{equation}
is to impose a higher-level prior distribution, and use the machinery of Bayesian inference to produce posteriors for everything.
This approach runs into computational difficulties, which have led us to a two-stage `empirical Bayes' approach, where the hyperparameters for the longitudinal model are first fixed by a maximum-likelihood calculation, after which the remaining hyperparameters and individual-level parameters can be estimated with Bayesian machinery.
For the time-to-event parameters we choose flat hyperpriors, selecting the hyperparameters $\mu_B=\mu_\theta=\mu_\beta=0$, $\sigma_B=\sigma_\theta=2$, and $\sigma_\beta=100$.
### The modelling variants
In this article, we researched into 16 variants of the model-fitting problem, but focussed mainly on 8 of them.
The 8 main models use the standard deviation, $\sigma$, as the measure of the influence of blood-pressure variability on mortality.
We also produced the same 8 models but using precision, $\tau=1/\sigma^2$, as the measure of the influence of blood-pressure variability on mortality.
However, this was only to ensure that there were no differences between the use of one over the other.
Throughout the remainder of this appendix, we refer to the 8 main models using the following run numbers:
\begin{enumerate}
\item All participants (14,654), using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as death specifically from CVD.
\item All participants (14,654), using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as all-causes of death.
\item Only participants that had data from which FRS values could be computed (N=9,008) --- the ``FRS population'' but using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as death specifically from CVD.
\item FRS population, but using mean systolic and diastolic blood pressure (not FRS) in the linear predictor term, with the outcome data as all-causes of death.
\item FRS population, and using the FRS ATP-III value in the linear predictor term, with the outcome data as death specifically from CVD.
\item FRS population, and using the FRS ATP-III value in the linear predictor term, with the outcome data as all-causes of death.
\item FRS population, and using the FRS 1998-version value in the linear predictor term, with the outcome data as death specifically from CVD.
\item FRS population, and using the FRS 1998-version value in the linear predictor term, with the outcome data as all-causes of death.
\label{tab:runnums}
\end{enumerate}
We also include Directed Acyclical Graph (DAG) sketches to help visualize the different models, as shown in figures \ref{fig:DAGmean} and \ref{fig:DAGFRS}.
In order to read the DAGs, note that each square background layer that appears as a stack of layers represents different measured outcomes that were made in the first wave of the survey.
The outcome variables measured are represented by a square-shaped text box, and a parameter of the model is represented by a circular-shaped text box. If either a square or circular text box is placed on top of a stacked rectangular layer, it means that multiple values of that variable (as many as there are layers to the stack) are either measured (for outcome variables) or simulated (for parameters of the model). Please note that the number of layers in the stack is written in the text box that does not contain a frame which is intentionally displayed on top of the stacked layer that it represents. For example, $i=1,...,N$.
Finally, the direction of the arrows implies causality assumed in the model.
The distribution of the blood pressure parameters in the population are derived from the model, and are summarised with other outputs of the model in table \ref{tab:Mean-SD} of Appendix C.
![An illustration of the DAG of the mean blood pressure-based model presented in this article. ](./DAG_Mean2.png){#fig:DAGmean}
![An illustration of the DAG of the FRS-based model presented in this article. ](./DAG_FRS2.png){#fig:DAGFRS}
<!--
For convenience, we provide an overview of the different blood pressure values in the full and FRS populations in tables \ref{tab:BloodFull} and \ref{tab:BloodFRS}.
```{r BloodFull,echo=F,message=FALSE,warning=FALSE}
Blood<-read_csv("./Results/BloodVals.csv",show_col_types = FALSE)
kable(Blood[Blood$Population=="Full",-1] ,
format="latex", escape = F,booktabs = T, caption = "Parameters for distribution of blood pressure, for the full population") %>%
row_spec(4, extra_latex_after = "\\addlinespace")
```
```{r BloodFRS,echo=F,message=FALSE,warning=FALSE}
Blood<-read_csv("./Results/BloodVals.csv",show_col_types = FALSE)
kable(Blood[Blood$Population=="FRS",-1],
format="latex", escape = F,booktabs = T, caption = "Parameters for distribution of blood pressure, for the FRS population") %>%
row_spec(4, extra_latex_after = "\\addlinespace")
```