From 90e4f6f15b21aeaacf1495b7df0ef8ae59497403 Mon Sep 17 00:00:00 2001 From: claratsm Date: Wed, 22 Nov 2023 18:41:18 +0800 Subject: [PATCH] Add files via upload --- .../module4-genetics/M4-pop-genetics.Rmd | 263 ++++++++++-------- 1 file changed, 142 insertions(+), 121 deletions(-) diff --git a/notebooks/module4-genetics/M4-pop-genetics.Rmd b/notebooks/module4-genetics/M4-pop-genetics.Rmd index 4ff6863..18f990e 100644 --- a/notebooks/module4-genetics/M4-pop-genetics.Rmd +++ b/notebooks/module4-genetics/M4-pop-genetics.Rmd @@ -5,7 +5,7 @@ ### Part 1 -**Scenario:** You are a researcher working on a twin study on cardiovascular traits to assess the genetic and environmental contribution relevant to metabolism and cardiovascular disease risk. You have recruited a cohort of volunteer adult twins of the same ancestry. The volunteers have undergone a series of baseline clinical evaluations and performed genotyping on a panel of single nucleotide polymorphisms that may be associated with the traits.! +**Scenario:** You are a researcher working on a twin study on cardiovascular traits to assess the genetic and environmental contribution relevant to metabolism and cardiovascular disease risk. You have recruited a cohort of volunteer adult twins of the same ancestry. The volunteers have undergone a series of baseline clinical evaluations and performed genotyping on a panel of single nucleotide polymorphisms that may be associated with the traits. #### Questions for Discussion @@ -13,9 +13,9 @@ **Q1.** Besides the clinical measurements, what data do you need to collect from the subjects?
**Answers:** - * Sex -* Age -* Other confounding factors, e.g. BMI, blood pressure, smoking status, etc. + - Sex +- Age +- Other confounding factors, e.g. BMI, blood pressure, smoking status, etc.
**Q2.** How is genotype data represented for statistical genetic analysis? @@ -39,26 +39,26 @@ #### Hands-on exercise : Association test -Now, you are given a dataset of the twin cohort with a cardiovascular phenotype and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this -[GitHub Repo](https://github.com/StatBiomed/BMDS-book/blob/main/notebooks/module4-genetics/dataTwin.dat). +Now, you are given a dataset of age- and sex-matched twin cohort with two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this +[GitHub Repo](https://github.com/StatBiomed/BMDS-book/blob/main/notebooks/module4-genetics/dataTwin2023.dat). **The information for columns:** -* zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin -* T1QTL_A[1-5] and T2QTL_A[1-5]: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively -* The same 5 QTL (D1-D5) in dominance coding for T1 and T2 -* Phenotype scores of T1 and T2 +- zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin +- `T1QTL_A[1-5]` and `T2QTL_A[1-5]`: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively +- The same 5 QTL (D1-D5) in dominance coding for T1 and T2 +- Phenotype scores of T1 and T2 for the two quantitative cardiovascular traits **Download the data `dataTwin.dat` to your working directory.** Start the RStudio program and set the working directory. -```{r readDataR} -dataTwin <- read.table("dataTwin.dat",h=T) +```{r readDataR, eval=F} +dataTwin <- read.table("dataTwin2023.dat",h=T) ``` **Exploratory analysis** -* A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA). -* D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA). +- A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA). +- D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA). **Q1.** How many MZ and DZ volunteers are there? @@ -68,112 +68,86 @@ dataTwin <- read.table("dataTwin.dat",h=T) **Q4.** Are there outliers in phenotypes? -```{r genotypedataExploration, eval=F} -?table -# Write your codes here - -# Any outlier >4SD? -``` -```{r genotypedataExplorationR, echo=F, eval=F} +```{r genotypedataExplorationR, eval=F} table(dataTwin$zygosity) # Q1: shows number of MZ and DZ twin pairs table(dataTwin$T1QTL_A1) # Q2: shows the distribution of QTL_A1 table(dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_D1 -table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1) # Q3: shows the distribution of QTL_A1 in relation to QTL_D1 -any(dataTwin$phenoT1 < (mean(dataTwin$phenoT1) - 4*sd(dataTwin$phenoT1))) # Q4: any outlier >4 SD from the mean for T1 -any(dataTwin$phenoT1 > (mean(dataTwin$phenoT1) + 4*sd(dataTwin$phenoT1))) # Q4: any outlier >4 SD from the mean for T1 -cor(dataTwin[,2:11]) +table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_A1 in relation to QTL_D1 +cor(dataTwin[,2:11]) # Q3: shows the correlation between QTL_As +cor(dataTwin[,2:11])>0.2 +apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) }) # Q4: any outlier >4 SD from the mean for the two quantitative phenotypes ``` -**Q5.** Compute the genotype frequencies for each variant. Compute the allele frequencies. +**Association test** -```{r Freq} -# Write your codes here -``` -```{r FreqR, echo=F, eval=F} -T1_genoF <- apply(dataTwin[,2:6],2,table) # shows the genotype count of QTLs of T1 -T2_genoF <- apply(dataTwin[,7:11],2,table) # shows the genotype count of QTLs of T2 -T1_genoF/nrow(dataTwin) # shows the genotype frequency of QTLs of T1 -T2_genoF/nrow(dataTwin) # shows the genotype frequency of QTLs of T2 -(0.5*T1_genoF[2,]+T1_genoF[3,])/nrow(dataTwin) # shows the allele frequency of QTLs of T1 -(0.5*T2_genoF[2,]+T2_genoF[3,])/nrow(dataTwin) # shows the allele frequency of QTLs of T2 -``` +Test for association between QTL and pheno1 for T1 -Test for association between variants and phenotype for T1 -* Regress "phenoT1" on T1QTL_A1 to estimate the proportion of variance explained (R2). -* Model: phenoT1 = b0 + b1* T1QTL_A1 + e -* Calculate the conditonal mean of phenotype (i.e. phenotypic mean conditional genotype) +- Regress `pheno1_T1` on `T1QTL_A1` to estimate the proportion of variance explained (R2). +- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + e +- Calculate the conditional mean of phenotype (i.e. phenotypic mean conditional genotype) -If the relationship between the QTL and the phenotype is perfectly linear, the regression line -should pass through the conditional means (c_means), and the differences between the conditional means should +If the relationship between the QTL and the phenotype is perfectly linear, the regression line should pass through the conditional means (c_means), and the differences between the conditional means should be about equal. -**Q6.** What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)? - -**Q7.** What is the proportion of phenotypic variance explained? - -```{r lmA} -?lm -?glm -# Write your codes here +**Q5.** What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)? -# compute the conditional mean (c_means) +**Q6.** What is the proportion of phenotypic variance explained? -# visualize the fitting - -``` -```{r lmAR, echo=FALSE, eval=F} -linA1 <- lm(phenoT1~T1QTL_A1, data=dataTwin) +```{r lmAR, eval=F} +linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin) summary(linA1) summary(linA1)$r.squared # proportion of explained variance by additive component -c_means <- by(dataTwin$phenoT1,dataTwin$T1QTL_A1,mean) -plot(dataTwin$phenoT1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7)) -abline(linA1, lwd=3) +c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean) +plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7)) lines(c(0,1,2), c_means, type="p", col=6, lwd=8) +lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3) ``` -To test this "linearity", we can use the dominant coding of the QTL and add the dominance term to the regression model. -* Model: phenoT1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e -* Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype -* Repeat for T2. +To test this "linearity", we can use the dominance coding of the QTL and add the dominance term to the regression model. -**Q8.** Why can't we analyse T1 and T2 together? +- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e +- Repeat for T2. -**Q9.** Is there a dominance effect? +**Q7.** Why can't we analyse T1 and T2 together? -```{r lmAD} -# Write your codes here +**Q8.** Is there a dominance effect? -# visualize the fitting - -``` -```{r lmADR, echo=FALSE, eval=F} -linAD1 <- lm(phenoT1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin) +```{r lmADR, eval=F} +linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin) summary(linA1) # results lm(phenoT1~T1QTL_A1) summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1) -plot(dataTwin$phenoT1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7)) +plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7)) abline(linA1, lwd=3) lines(c(0,1,2), c_means, type='p', col=6, lwd=8) -lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col=3, lwd=3) +lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3) +lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3) ``` -If the subjects with top 5% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q3) and interpret the result. +**Q9.** Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype T1 -**Q10.** What are the odds ratio, p-value, and 95% confidence interval? - -```{r logisticAD} -?quantile -?seq +```{r lmAllQTLR, eval=F} +allQTL_A_T1 <- 2:6 +cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1") +## Additive +cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) })) +## Dominance +cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) })) -# odds ratio -# p-value -# 95% confidence interval? +#Q9: QTL3 shows the strongest association with P=7.771588e-25 +linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin) +summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1) ``` -```{r logisticADR, echo=FALSE, eval=F} -quant05 <- quantile(c(dataTwin$phenoT1,dataTwin$phenoT2),seq(0,1,0.05)) -dataTwin$CaseT1 <- as.numeric(dataTwin$phenoT1>quant05[20]) -dataTwin$CaseT2 <- as.numeric(dataTwin$phenoT2>quant05[20]) + +If the subjects with top 5% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q9) and interpret the result. + +**Q10.** What are the odds ratio, p-value, and 95% confidence interval? + +```{r logisticADR, eval=F} +quant05 <- quantile(c(dataTwin$pheno1_T1,dataTwin$pheno1_T2),seq(0,1,0.05)) +dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant05[20]) +dataTwin$CaseT2 <- as.numeric(dataTwin$pheno1_T2>quant05[20]) logisticAD1 <- summary(glm(CaseT1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin, family="binomial")) exp(logisticAD1$coefficients[2,1]) # odds ratio exp(logisticAD1$coefficients[2,1]-1.96*logisticAD1$coefficients[2,2]) # lower 95% confidence interval @@ -182,34 +156,66 @@ exp(logisticAD1$coefficients[2,1]+1.96*logisticAD1$coefficients[2,2]) # upper 9 ### Part 2 -**Scenario:** You are asked to estimate the additive and dominance genetic variance explained using regression-based method and a classical twin design. +**Scenario:** You are asked to estimate the additive genetic variance, dominance genetic variance and/or shared environmental variance using regression-based method and a classical twin design. + +\begin{align*} + +\text{For ADE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{D} + \sigma^{2}_{E}\\ -Given standardization of phenotype and the ADE model, we assume that the phenotypic variance equals +\text{For ACE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{C} + \sigma^{2}_{E}, \quad \text{where} \\ -$\sigma^{2}_{P}$ = $\sigma^{2}_{A}$ + $\sigma^{2}_{D}$ + $\sigma^{2}_{E}$, where +\sigma^{2}_{P} & \text{ is the phenotypic variance}, \\ -$\sigma^{2}_{P}$ is the phenotypic variance, +\sigma^{2}_{A} & \text{ is additive genetic variance}, \\ -$\sigma^{2}_{A}$ is additive genetic variance, +\sigma^{2}_{D} & \text{ is dominance genetic variance}, \\ -$\sigma^{2}_{D}$ is dominance genetic variance, and +\sigma^{2}_{C} & \text{ is shared environmental variance, and} \\ -$\sigma^{2}_{E}$ is unshared environmental variance. +\sigma^{2}_{E} & \text{ is unshared environmental variance.} + +\end{align*} +For ADE model, - cov(MZ) = cor(MZ) = rMZ= $\sigma^{2}_{A}$ + $\sigma^{2}_{D}$ +\begin{align*} - cov(DZ) = cor(DZ) = rDZ = 0.5 * $\sigma^{2}_{A}$ + 0.25 * $\sigma^{2}_{D}$, where + cov(MZ) = cor(MZ) & = rMZ = \sigma^{2}_{A} + \sigma^{2}_{D} \\ + cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + 0.25 * \sigma^{2}_{D} \quad \text{ , where} \\ +\end{align*} the coefficients 1/2 and 1/4 are based on quantitative genetic theory (Mather & Jinks, 1971). By solving the unknowns, the Falconer's equations for the ADE model: -$\sigma^{2}_{A}$ = 4*rDZ - rMZ +\begin{align*} + +\sigma^{2}_{A} & = 4*rDZ - rMZ \\ + +\sigma^{2}_{D} & = 2*rMZ - 4*rDZ \\ +\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{D} \\ -$\sigma^{2}_{D}$ = 2*rMZ - 4*rDZ +\end{align*} + +For ACE model, + +\begin{align*} + + cov(MZ) = cor(MZ) & = rMZ= \sigma^{2}_{A} + \sigma^{2}_{C} \\ + cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + \sigma^{2}_{C} \quad \text{ , where} \\ + +\end{align*} + +By solving the unknowns, the Falconer's equations for the ACE model: + +\begin{align*} + +\sigma^{2}_{A} & = 2*(rMZ - rDZ) \\ +\sigma^{2}_{C} & = 2*rDZ - rMZ \\ +\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{C} = 1 - rMZ + +\end{align*} -$\sigma^{2}_{E}$ = 1 - $\sigma^{2}_{A}$ - $\sigma^{2}_{D}$ #### Questions for discussions : @@ -220,9 +226,9 @@ $\sigma^{2}_{E}$ = 1 - $\sigma^{2}_{A}$ - $\sigma^{2}_{D}$ * Suggested reading: * Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature. 2009 ;461(7265):747-753. doi:10.1038/nature08494 -#### Hands-on exercise : variance explained +#### Hands-on exercise : variance explained using regression-based method -**Q1.** What is the variance of the phenotype? +**Q1.** What is the variance of the phenotype? **Q2.** Compute the explained variance attributable to the additive genetic component of the QTL with strongest association in Part 1. @@ -238,8 +244,9 @@ R2 from the regression represents the proportion of phenotypic variance explaine * Total genetic: 0.03658*15.102 = 0.552 * Additive genetic: 0.02732*15.102 = 0.412 * Dominance genetic: 0.00926*15.102 = 0.140 + ```{r varianceR, eval=F} -var_pheno <- var(dataTwin$phenoT1) # the variance of the phenotype +var_pheno <- var(dataTwin$pheno1_T1) # the variance of the phenotype var_pheno summary(linAD1)$r.squared # proportion of explained variance by total genetic component summary(linA1)$r.squared # proportion of explained variance by additive component @@ -251,34 +258,29 @@ summary(linA1)$r.squared*var_pheno # (raw) variance component of additive genet **Q4.** Estimate the variance explained by all the QTL using linear regression. -```{r varianceAllQTLR, echo=FALSE, eval=F} +```{r varianceAllQTLR, eval=F} # compute for all 5 QTL -linAD5=(lm(phenoT1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 + - T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5, +linAD5=(lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 + + T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5, data=dataTwin)) -linA5=(lm(phenoT1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5, - data=dataTwin)) - -linD5=(lm(phenoT1 ~ T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5, - data=dataTwin)) - -summary(linAD5)$r.squared -summary(linA5)$r.squared -summary(linD5)$r.squared +summary(linAD5)$r.squared # proportion of explained variance by total genetic component +summary(linAD5)$r.squared*var_pheno # (raw) variance component of total genetic component ``` -Based on our regression results, we have estimates of the total genetic variance as well as the A and D components. In practice, this is impossible as we do not know all the variants associated with any polygenic trait. +#### Hands-on exercise : variance explained using a classical twin design. -Now, let's estimate the A (additive genetic) and D (dominance) variance with the classical twin design using Falconer's equations based on ADE model. +Based on our regression results, we have estimates of the total genetic variance as well as the A and D components for phenotype 1. In practice, it is impossible to know all the variants associated with any polygenic trait. + +Given `rMZ > 2*rDZ`, we can use Falconer's formula based on ADE model to estimate the A (additive genetic) and D (dominance) variance with the classical twin design for phenotype 1 without genotypes. **Q5.** Compute rMZ and rDZ. -**Q6.** Estimate the additive and dominance genetic variances using the Falconer's equations for the ADE model. +**Q6.** Estimate the proportion of additive and dominance genetic variances using the Falconer's equations for the ADE model. -```{r Falconer, echo=FALSE, eval=F} -dataMZ = dataTwin[dataTwin$zygosity==1, c('phenoT1', 'phenoT2')] # MZ data frame -dataDZ = dataTwin[dataTwin$zygosity==2, c('phenoT1', 'phenoT2')] # DZ data frame +```{r FalconerADE, eval=F} +dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno1_T1', 'pheno1_T2')] # MZ data frame +dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno1_T1', 'pheno1_T2')] # DZ data frame rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix @@ -287,4 +289,23 @@ sA2 = 4*rDZ - rMZ sD2 = 2*rMZ - 4*rDZ sE2 = 1 - sA2 - sD2 print(c(sA2, sD2, sE2)) +``` + +Similarly, for phenotype 2, we can estimate the proportion of additive and/or dominance genetic variances as well as shared environmental variance using the Falconer's formula. + +**Q7.** Which model (ACE or ADE) should be considered for phenotype 2? + +**Q8.** Estimate the proportion of A, C/D and E variance components for phenotype 2. + +```{r FalconerACE, eval=F} +dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno2_T1', 'pheno2_T2')] # MZ data frame +dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno2_T1', 'pheno2_T2')] # DZ data frame + +rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix +rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix + +sA2 = 2*(rMZ - rDZ) +sC2 = 2*rDZ - rMZ +sE2 = 1 - rMZ +print(c(sA2, sC2, sE2)) ``` \ No newline at end of file