Now, you are given a dataset of age- and sex-matched twin cohort with two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this
-GitHub Repo.
+
Hands-on exercise : Association test
+
Now, you are given a dataset of age- and sex-matched twin cohort with
+two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data
+set and template notebook are available on Moodle (recommended) and also
+on this GitHub
+Repo.
The information for columns:
- zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin
-T1QTL_A[1-5]
and T2QTL_A[1-5]
: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively
+T1QTL_A[1-5]
and T2QTL_A[1-5]
: 5
+quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2
+(T2) respectively
- The same 5 QTL (D1-D5) in dominance coding for T1 and T2
-- Phenotype scores of T1 and T2 for the two quantitative cardiovascular traits
+- Phenotype scores of T1 and T2 for the two quantitative
+cardiovascular traits
-
Download the data dataTwin.dat
to your working directory. Start the RStudio program and set the working directory.
-
dataTwin <- read.table("dataTwin2023.dat",h=T)
+
Download the data dataTwin.dat
to your working
+directory. Start the RStudio program and set the working
+directory.
+
dataTwin <- read.table("dataTwin2023.dat",h=T)
Exploratory analysis
-- A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
-- D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA).
+- A1-5: The QTLs are biallelic with two alleles A and a. The genotypes
+aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
+- D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0
+(AA).
+
+
Q1. How many MZ and DZ volunteers are there?
+
+
+Answers:
+
+
-Q1. How many MZ and DZ volunteers are there?
-Q2. How are the genotypes represented?
-Q3. Are the QTL independent of each other?
-Q4. Are there outliers in phenotypes?
-table(dataTwin$zygosity) # Q1: shows number of MZ and DZ twin pairs
-table(dataTwin$T1QTL_A1) # Q2: shows the distribution of QTL_A1
-table(dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_D1
-table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_A1 in relation to QTL_D1
-cor(dataTwin[,2:11]) # Q3: shows the correlation between QTL_As
-cor(dataTwin[,2:11])>0.2
-apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) }) # Q4: any outlier >4 SD from the mean for the two quantitative phenotypes
+
+
Q2. How are the genotypes represented?
+
+
+Answers:
+
+
+- Dosage/Count of non-reference alleles : 0, 1, and 2
+
+
+
Q3. Are the QTL independent of each other?
+
+
+Answers:
+
+
+- Yes. The pairwise correlation is low (<0.2).
+
+
+
Q4. Are there outliers in phenotypes?
+
+
+Answers:
+
+
+- Yes. T2 individual 1303 has phenotype score (-4.x) below 4 SD from
+the mean.
+
+
+
table(dataTwin$zygosity) # Q1: shows number of MZ and DZ twin pairs
+
##
+## 1 2
+## 1000 1000
+
table(dataTwin$T1QTL_A1) # Q2: shows the distribution of QTL_A1
+
##
+## 0 1 2
+## 474 1021 505
+
table(dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_D1
+
##
+## 0 1
+## 979 1021
+
table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_A1 in relation to QTL_D1
+
##
+## 0 1
+## 0 474 0
+## 1 0 1021
+## 2 505 0
+
cor(dataTwin[,2:11]) # Q3: shows the correlation between QTL_As
+
## T1QTL_A1 T1QTL_A2 T1QTL_A3 T1QTL_A4 T1QTL_A5
+## T1QTL_A1 1.00000000 -0.005470340 0.021705688 0.01940408 0.016278190
+## T1QTL_A2 -0.00547034 1.000000000 0.017344822 -0.01421677 -0.008678746
+## T1QTL_A3 0.02170569 0.017344822 1.000000000 0.01335711 -0.036751338
+## T1QTL_A4 0.01940408 -0.014216767 0.013357109 1.00000000 0.074899996
+## T1QTL_A5 0.01627819 -0.008678746 -0.036751338 0.07490000 1.000000000
+## T2QTL_A1 0.53243815 0.004201635 -0.013909013 0.03252724 0.020081970
+## T2QTL_A2 -0.04561174 0.464131160 -0.005044127 0.01324172 -0.003012277
+## T2QTL_A3 0.03316574 -0.003552831 0.521253656 0.02045423 0.009081830
+## T2QTL_A4 0.03271254 -0.033419904 0.020422583 0.48641289 0.019247531
+## T2QTL_A5 -0.01285323 0.030413269 -0.045121964 0.08288145 0.457962222
+## T2QTL_A1 T2QTL_A2 T2QTL_A3 T2QTL_A4 T2QTL_A5
+## T1QTL_A1 0.532438150 -0.045611740 0.033165736 0.032712539 -0.01285323
+## T1QTL_A2 0.004201635 0.464131160 -0.003552831 -0.033419904 0.03041327
+## T1QTL_A3 -0.013909013 -0.005044127 0.521253656 0.020422583 -0.04512196
+## T1QTL_A4 0.032527239 0.013241725 0.020454234 0.486412895 0.08288145
+## T1QTL_A5 0.020081970 -0.003012277 0.009081830 0.019247531 0.45796222
+## T2QTL_A1 1.000000000 0.006179257 -0.013129314 0.048294183 -0.01325839
+## T2QTL_A2 0.006179257 1.000000000 -0.020860987 0.002164782 -0.01131418
+## T2QTL_A3 -0.013129314 -0.020860987 1.000000000 -0.010583797 -0.02101270
+## T2QTL_A4 0.048294183 0.002164782 -0.010583797 1.000000000 0.04350925
+## T2QTL_A5 -0.013258394 -0.011314179 -0.021012699 0.043509251 1.00000000
+
cor(dataTwin[,2:11])>0.2
+
## T1QTL_A1 T1QTL_A2 T1QTL_A3 T1QTL_A4 T1QTL_A5 T2QTL_A1 T2QTL_A2
+## T1QTL_A1 TRUE FALSE FALSE FALSE FALSE TRUE FALSE
+## T1QTL_A2 FALSE TRUE FALSE FALSE FALSE FALSE TRUE
+## T1QTL_A3 FALSE FALSE TRUE FALSE FALSE FALSE FALSE
+## T1QTL_A4 FALSE FALSE FALSE TRUE FALSE FALSE FALSE
+## T1QTL_A5 FALSE FALSE FALSE FALSE TRUE FALSE FALSE
+## T2QTL_A1 TRUE FALSE FALSE FALSE FALSE TRUE FALSE
+## T2QTL_A2 FALSE TRUE FALSE FALSE FALSE FALSE TRUE
+## T2QTL_A3 FALSE FALSE TRUE FALSE FALSE FALSE FALSE
+## T2QTL_A4 FALSE FALSE FALSE TRUE FALSE FALSE FALSE
+## T2QTL_A5 FALSE FALSE FALSE FALSE TRUE FALSE FALSE
+## T2QTL_A3 T2QTL_A4 T2QTL_A5
+## T1QTL_A1 FALSE FALSE FALSE
+## T1QTL_A2 FALSE FALSE FALSE
+## T1QTL_A3 TRUE FALSE FALSE
+## T1QTL_A4 FALSE TRUE FALSE
+## T1QTL_A5 FALSE FALSE TRUE
+## T2QTL_A1 FALSE FALSE FALSE
+## T2QTL_A2 FALSE FALSE FALSE
+## T2QTL_A3 TRUE FALSE FALSE
+## T2QTL_A4 FALSE TRUE FALSE
+## T2QTL_A5 FALSE FALSE TRUE
+
apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) }) # Q4: any outlier < 4 SD from the mean for the two quantitative phenotypes
+
## pheno1_T1 pheno1_T2 pheno2_T1 pheno2_T2
+## FALSE FALSE FALSE TRUE
+
apply(dataTwin[22:25],2,function(x){ any(x > (mean(x) + 4*sd(x))) }) # Q4: any outlier > 4 SD from the mean for the two quantitative phenotypes
+
## pheno1_T1 pheno1_T2 pheno2_T1 pheno2_T2
+## FALSE FALSE FALSE FALSE
+
# remove the phenotype score of the outlier (T2) for the phenotype 2 (pheno2_T2)
+outlier<- which(dataTwin$pheno2_T2 < (mean(dataTwin$pheno2_T2) - 4*sd(dataTwin$pheno2_T2) ))
+outlier
+
## [1] 1303
+
dataTwin$pheno2_T2[outlier]
+
## [1] -4.21
+
dataTwin$pheno2_T2[outlier] <- NA
Association test
Test for association between QTL and pheno1 for T1
-- Regress
pheno1_T1
on T1QTL_A1
to estimate the proportion of variance explained (R2).
+- Regress
pheno1_T1
on T1QTL_A1
to estimate
+the proportion of variance explained (R2).
- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + e
-- Calculate the conditional mean of phenotype (i.e. phenotypic mean conditional genotype)
+- Calculate the conditional mean of phenotype (i.e. phenotypic mean
+conditional genotype)
-
If the relationship between the QTL and the phenotype is perfectly linear, the regression line should pass through the conditional means (c_means), and the differences between the conditional means should
-be about equal.
-
Q5. What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)?
-
Q6. What is the proportion of phenotypic variance explained?
-
linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin)
-summary(linA1)
-summary(linA1)$r.squared # proportion of explained variance by additive component
-
-c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean)
-plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
-lines(c(0,1,2), c_means, type="p", col=6, lwd=8)
-lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
-
To test this “linearity”, we can use the dominance coding of the QTL and add the dominance term to the regression model.
+
If the relationship between the QTL and the phenotype is perfectly
+linear, the regression line should pass through the conditional means
+(c_means), and the differences between the conditional means should be
+about equal.
+
Q5. What are the values of b0, b1? Is QTL1
+significant associated with the phenotype at alpha<0.01 (multiple
+testing of 5 loci)?
+
Q6. What is the proportion of phenotypic variance
+explained?
+
linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin)
+summary(linA1)
+summary(linA1)$r.squared # proportion of explained variance by additive component
+
+c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean)
+plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
+lines(c(0,1,2), c_means, type="p", col=6, lwd=8)
+lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
+
To test this “linearity”, we can use the dominance coding of the QTL
+and add the dominance term to the regression model.
- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e
- Repeat for T2.
Q7. Why can’t we analyse T1 and T2 together?
Q8. Is there a dominance effect?
-
linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
-summary(linA1) # results lm(phenoT1~T1QTL_A1)
-summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
-
-plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
-abline(linA1, lwd=3)
-lines(c(0,1,2), c_means, type='p', col=6, lwd=8)
-lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
-lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3)
-
Q9. Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype T1
-
allQTL_A_T1 <- 2:6
-cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1")
-## Additive
-cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
-## Dominance
-cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
-
-#Q9: QTL3 shows the strongest association with P=7.771588e-25
-linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
-summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
-
If the subjects with top 5% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q9) and interpret the result.
-
Q10. What are the odds ratio, p-value, and 95% confidence interval?
-
quant05 <- quantile(c(dataTwin$pheno1_T1,dataTwin$pheno1_T2),seq(0,1,0.05))
-dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant05[20])
-dataTwin$CaseT2 <- as.numeric(dataTwin$pheno1_T2>quant05[20])
-logisticAD1 <- summary(glm(CaseT1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin, family="binomial"))
-exp(logisticAD1$coefficients[2,1]) # odds ratio
-exp(logisticAD1$coefficients[2,1]-1.96*logisticAD1$coefficients[2,2]) # lower 95% confidence interval
-exp(logisticAD1$coefficients[2,1]+1.96*logisticAD1$coefficients[2,2]) # upper 95% confidence interval
+
linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
+summary(linA1) # results lm(phenoT1~T1QTL_A1)
+summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
+
+plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
+abline(linA1, lwd=3)
+lines(c(0,1,2), c_means, type='p', col=6, lwd=8)
+lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
+lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3)
+
Q9. Repeat for the other 4 QTL and determine which
+QTL shows strongest association with the phenotype T1
+
allQTL_A_T1 <- 2:6
+cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1")
+## Additive
+cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
+## Dominance
+cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
+
+#Q9: QTL3 shows the strongest association with P=7.771588e-25
+linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
+summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
+
If the subjects with top 5% of the phenotype score are considered as
+cases, perform case-control association test for most significant SNP
+(from Q9) and interpret the result.
+
Q10. What are the odds ratio, p-value, and 95%
+confidence interval?
+
quant05 <- quantile(c(dataTwin$pheno1_T1,dataTwin$pheno1_T2),seq(0,1,0.05))
+dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant05[20])
+dataTwin$CaseT2 <- as.numeric(dataTwin$pheno1_T2>quant05[20])
+logisticAD1 <- summary(glm(CaseT1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin, family="binomial"))
+exp(logisticAD1$coefficients[2,1]) # odds ratio
+exp(logisticAD1$coefficients[2,1]-1.96*logisticAD1$coefficients[2,2]) # lower 95% confidence interval
+exp(logisticAD1$coefficients[2,1]+1.96*logisticAD1$coefficients[2,2]) # upper 95% confidence interval