Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
claratsm authored Nov 22, 2023
1 parent fe20d46 commit 90e4f6f
Showing 1 changed file with 142 additions and 121 deletions.
263 changes: 142 additions & 121 deletions notebooks/module4-genetics/M4-pop-genetics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@

### Part 1

**Scenario:** You are a researcher working on a twin study on cardiovascular traits to assess the genetic and environmental contribution relevant to metabolism and cardiovascular disease risk. You have recruited a cohort of volunteer adult twins of the same ancestry. The volunteers have undergone a series of baseline clinical evaluations and performed genotyping on a panel of single nucleotide polymorphisms that may be associated with the traits.!
**Scenario:** You are a researcher working on a twin study on cardiovascular traits to assess the genetic and environmental contribution relevant to metabolism and cardiovascular disease risk. You have recruited a cohort of volunteer adult twins of the same ancestry. The volunteers have undergone a series of baseline clinical evaluations and performed genotyping on a panel of single nucleotide polymorphisms that may be associated with the traits.


#### Questions for Discussion

**Q1.** Besides the clinical measurements, what data do you need to collect from the subjects?
<details>
<summary>**Answers:** </summary>
* Sex
* Age
* Other confounding factors, e.g. BMI, blood pressure, smoking status, etc.
- Sex
- Age
- Other confounding factors, e.g. BMI, blood pressure, smoking status, etc.
</details>

**Q2.** How is genotype data represented for statistical genetic analysis?
Expand All @@ -39,26 +39,26 @@

#### Hands-on exercise : Association test

Now, you are given a dataset of the twin cohort with a cardiovascular phenotype and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this
[GitHub Repo](https://github.com/StatBiomed/BMDS-book/blob/main/notebooks/module4-genetics/dataTwin.dat).
Now, you are given a dataset of age- and sex-matched twin cohort with two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this
[GitHub Repo](https://github.com/StatBiomed/BMDS-book/blob/main/notebooks/module4-genetics/dataTwin2023.dat).

**The information for columns:**

* zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin
* T1QTL_A[1-5] and T2QTL_A[1-5]: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively
* The same 5 QTL (D1-D5) in dominance coding for T1 and T2
* Phenotype scores of T1 and T2
- zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin
- `T1QTL_A[1-5]` and `T2QTL_A[1-5]`: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively
- The same 5 QTL (D1-D5) in dominance coding for T1 and T2
- Phenotype scores of T1 and T2 for the two quantitative cardiovascular traits

**Download the data `dataTwin.dat` to your working directory.** Start the RStudio program and set the working directory.

```{r readDataR}
dataTwin <- read.table("dataTwin.dat",h=T)
```{r readDataR, eval=F}
dataTwin <- read.table("dataTwin2023.dat",h=T)
```

**Exploratory analysis**

* A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
* D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA).
- A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
- D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA).

**Q1.** How many MZ and DZ volunteers are there?

Expand All @@ -68,112 +68,86 @@ dataTwin <- read.table("dataTwin.dat",h=T)

**Q4.** Are there outliers in phenotypes?

```{r genotypedataExploration, eval=F}
?table
# Write your codes here
# Any outlier >4SD?
```
```{r genotypedataExplorationR, echo=F, eval=F}
```{r genotypedataExplorationR, eval=F}
table(dataTwin$zygosity) # Q1: shows number of MZ and DZ twin pairs
table(dataTwin$T1QTL_A1) # Q2: shows the distribution of QTL_A1
table(dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_D1
table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1) # Q3: shows the distribution of QTL_A1 in relation to QTL_D1
any(dataTwin$phenoT1 < (mean(dataTwin$phenoT1) - 4*sd(dataTwin$phenoT1))) # Q4: any outlier >4 SD from the mean for T1
any(dataTwin$phenoT1 > (mean(dataTwin$phenoT1) + 4*sd(dataTwin$phenoT1))) # Q4: any outlier >4 SD from the mean for T1
cor(dataTwin[,2:11])
table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1) # Q2: shows the distribution of QTL_A1 in relation to QTL_D1
cor(dataTwin[,2:11]) # Q3: shows the correlation between QTL_As
cor(dataTwin[,2:11])>0.2
apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) }) # Q4: any outlier >4 SD from the mean for the two quantitative phenotypes
```

**Q5.** Compute the genotype frequencies for each variant. Compute the allele frequencies.
**Association test**

```{r Freq}
# Write your codes here
```
```{r FreqR, echo=F, eval=F}
T1_genoF <- apply(dataTwin[,2:6],2,table) # shows the genotype count of QTLs of T1
T2_genoF <- apply(dataTwin[,7:11],2,table) # shows the genotype count of QTLs of T2
T1_genoF/nrow(dataTwin) # shows the genotype frequency of QTLs of T1
T2_genoF/nrow(dataTwin) # shows the genotype frequency of QTLs of T2
(0.5*T1_genoF[2,]+T1_genoF[3,])/nrow(dataTwin) # shows the allele frequency of QTLs of T1
(0.5*T2_genoF[2,]+T2_genoF[3,])/nrow(dataTwin) # shows the allele frequency of QTLs of T2
```
Test for association between QTL and pheno1 for T1

Test for association between variants and phenotype for T1
* Regress "phenoT1" on T1QTL_A1 to estimate the proportion of variance explained (R2).
* Model: phenoT1 = b0 + b1* T1QTL_A1 + e
* Calculate the conditonal mean of phenotype (i.e. phenotypic mean conditional genotype)
- Regress `pheno1_T1` on `T1QTL_A1` to estimate the proportion of variance explained (R2).
- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + e
- Calculate the conditional mean of phenotype (i.e. phenotypic mean conditional genotype)

If the relationship between the QTL and the phenotype is perfectly linear, the regression line
should pass through the conditional means (c_means), and the differences between the conditional means should
If the relationship between the QTL and the phenotype is perfectly linear, the regression line should pass through the conditional means (c_means), and the differences between the conditional means should
be about equal.

**Q6.** What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)?

**Q7.** What is the proportion of phenotypic variance explained?

```{r lmA}
?lm
?glm
# Write your codes here
**Q5.** What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)?

# compute the conditional mean (c_means)
**Q6.** What is the proportion of phenotypic variance explained?

# visualize the fitting
```
```{r lmAR, echo=FALSE, eval=F}
linA1 <- lm(phenoT1~T1QTL_A1, data=dataTwin)
```{r lmAR, eval=F}
linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin)
summary(linA1)
summary(linA1)$r.squared # proportion of explained variance by additive component
c_means <- by(dataTwin$phenoT1,dataTwin$T1QTL_A1,mean)
plot(dataTwin$phenoT1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
abline(linA1, lwd=3)
c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean)
plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
lines(c(0,1,2), c_means, type="p", col=6, lwd=8)
lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
```

To test this "linearity", we can use the dominant coding of the QTL and add the dominance term to the regression model.
* Model: phenoT1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e
* Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype
* Repeat for T2.
To test this "linearity", we can use the dominance coding of the QTL and add the dominance term to the regression model.

**Q8.** Why can't we analyse T1 and T2 together?
- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e
- Repeat for T2.

**Q9.** Is there a dominance effect?
**Q7.** Why can't we analyse T1 and T2 together?

```{r lmAD}
# Write your codes here
**Q8.** Is there a dominance effect?

# visualize the fitting
```
```{r lmADR, echo=FALSE, eval=F}
linAD1 <- lm(phenoT1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
```{r lmADR, eval=F}
linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
summary(linA1) # results lm(phenoT1~T1QTL_A1)
summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
plot(dataTwin$phenoT1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
abline(linA1, lwd=3)
lines(c(0,1,2), c_means, type='p', col=6, lwd=8)
lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col=3, lwd=3)
lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3)
```

If the subjects with top 5% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q3) and interpret the result.
**Q9.** Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype T1

**Q10.** What are the odds ratio, p-value, and 95% confidence interval?

```{r logisticAD}
?quantile
?seq
```{r lmAllQTLR, eval=F}
allQTL_A_T1 <- 2:6
cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1")
## Additive
cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
## Dominance
cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic; pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
# odds ratio
# p-value
# 95% confidence interval?
#Q9: QTL3 shows the strongest association with P=7.771588e-25
linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
```
```{r logisticADR, echo=FALSE, eval=F}
quant05 <- quantile(c(dataTwin$phenoT1,dataTwin$phenoT2),seq(0,1,0.05))
dataTwin$CaseT1 <- as.numeric(dataTwin$phenoT1>quant05[20])
dataTwin$CaseT2 <- as.numeric(dataTwin$phenoT2>quant05[20])

If the subjects with top 5% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q9) and interpret the result.

**Q10.** What are the odds ratio, p-value, and 95% confidence interval?

```{r logisticADR, eval=F}
quant05 <- quantile(c(dataTwin$pheno1_T1,dataTwin$pheno1_T2),seq(0,1,0.05))
dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant05[20])
dataTwin$CaseT2 <- as.numeric(dataTwin$pheno1_T2>quant05[20])
logisticAD1 <- summary(glm(CaseT1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin, family="binomial"))
exp(logisticAD1$coefficients[2,1]) # odds ratio
exp(logisticAD1$coefficients[2,1]-1.96*logisticAD1$coefficients[2,2]) # lower 95% confidence interval
Expand All @@ -182,34 +156,66 @@ exp(logisticAD1$coefficients[2,1]+1.96*logisticAD1$coefficients[2,2]) # upper 9

### Part 2

**Scenario:** You are asked to estimate the additive and dominance genetic variance explained using regression-based method and a classical twin design.
**Scenario:** You are asked to estimate the additive genetic variance, dominance genetic variance and/or shared environmental variance using regression-based method and a classical twin design.

\begin{align*}

\text{For ADE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{D} + \sigma^{2}_{E}\\

Given standardization of phenotype and the ADE model, we assume that the phenotypic variance equals
\text{For ACE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{C} + \sigma^{2}_{E}, \quad \text{where} \\

$\sigma^{2}_{P}$ = $\sigma^{2}_{A}$ + $\sigma^{2}_{D}$ + $\sigma^{2}_{E}$, where
\sigma^{2}_{P} & \text{ is the phenotypic variance}, \\

$\sigma^{2}_{P}$ is the phenotypic variance,
\sigma^{2}_{A} & \text{ is additive genetic variance}, \\

$\sigma^{2}_{A}$ is additive genetic variance,
\sigma^{2}_{D} & \text{ is dominance genetic variance}, \\

$\sigma^{2}_{D}$ is dominance genetic variance, and
\sigma^{2}_{C} & \text{ is shared environmental variance, and} \\

$\sigma^{2}_{E}$ is unshared environmental variance.
\sigma^{2}_{E} & \text{ is unshared environmental variance.}

\end{align*}

For ADE model,

cov(MZ) = cor(MZ) = rMZ= $\sigma^{2}_{A}$ + $\sigma^{2}_{D}$
\begin{align*}

cov(DZ) = cor(DZ) = rDZ = 0.5 * $\sigma^{2}_{A}$ + 0.25 * $\sigma^{2}_{D}$, where
cov(MZ) = cor(MZ) & = rMZ = \sigma^{2}_{A} + \sigma^{2}_{D} \\
cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + 0.25 * \sigma^{2}_{D} \quad \text{ , where} \\

\end{align*}
the coefficients 1/2 and 1/4 are based on quantitative genetic theory (Mather & Jinks, 1971).

By solving the unknowns, the Falconer's equations for the ADE model:

$\sigma^{2}_{A}$ = 4*rDZ - rMZ
\begin{align*}

\sigma^{2}_{A} & = 4*rDZ - rMZ \\

\sigma^{2}_{D} & = 2*rMZ - 4*rDZ \\
\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{D} \\

$\sigma^{2}_{D}$ = 2*rMZ - 4*rDZ
\end{align*}

For ACE model,

\begin{align*}

cov(MZ) = cor(MZ) & = rMZ= \sigma^{2}_{A} + \sigma^{2}_{C} \\
cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + \sigma^{2}_{C} \quad \text{ , where} \\

\end{align*}

By solving the unknowns, the Falconer's equations for the ACE model:

\begin{align*}

\sigma^{2}_{A} & = 2*(rMZ - rDZ) \\
\sigma^{2}_{C} & = 2*rDZ - rMZ \\
\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{C} = 1 - rMZ

\end{align*}

$\sigma^{2}_{E}$ = 1 - $\sigma^{2}_{A}$ - $\sigma^{2}_{D}$

#### Questions for discussions :

Expand All @@ -220,9 +226,9 @@ $\sigma^{2}_{E}$ = 1 - $\sigma^{2}_{A}$ - $\sigma^{2}_{D}$
* Suggested reading:
* Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature. 2009 ;461(7265):747-753. doi:10.1038/nature08494

#### Hands-on exercise : variance explained
#### Hands-on exercise : variance explained using regression-based method

**Q1.** What is the variance of the phenotype?
**Q1.** What is the variance of the phenotype?

**Q2.** Compute the explained variance attributable to the additive genetic component of the QTL with strongest association in Part 1.

Expand All @@ -238,8 +244,9 @@ R2 from the regression represents the proportion of phenotypic variance explaine
* Total genetic: 0.03658*15.102 = 0.552
* Additive genetic: 0.02732*15.102 = 0.412
* Dominance genetic: 0.00926*15.102 = 0.140

```{r varianceR, eval=F}
var_pheno <- var(dataTwin$phenoT1) # the variance of the phenotype
var_pheno <- var(dataTwin$pheno1_T1) # the variance of the phenotype
var_pheno
summary(linAD1)$r.squared # proportion of explained variance by total genetic component
summary(linA1)$r.squared # proportion of explained variance by additive component
Expand All @@ -251,34 +258,29 @@ summary(linA1)$r.squared*var_pheno # (raw) variance component of additive genet
**Q4.** Estimate the variance explained by all the QTL using linear regression.


```{r varianceAllQTLR, echo=FALSE, eval=F}
```{r varianceAllQTLR, eval=F}
# compute for all 5 QTL
linAD5=(lm(phenoT1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 +
T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5,
linAD5=(lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 +
T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5,
data=dataTwin))
linA5=(lm(phenoT1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5,
data=dataTwin))
linD5=(lm(phenoT1 ~ T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5,
data=dataTwin))
summary(linAD5)$r.squared
summary(linA5)$r.squared
summary(linD5)$r.squared
summary(linAD5)$r.squared # proportion of explained variance by total genetic component
summary(linAD5)$r.squared*var_pheno # (raw) variance component of total genetic component
```

Based on our regression results, we have estimates of the total genetic variance as well as the A and D components. In practice, this is impossible as we do not know all the variants associated with any polygenic trait.
#### Hands-on exercise : variance explained using a classical twin design.

Now, let's estimate the A (additive genetic) and D (dominance) variance with the classical twin design using Falconer's equations based on ADE model.
Based on our regression results, we have estimates of the total genetic variance as well as the A and D components for phenotype 1. In practice, it is impossible to know all the variants associated with any polygenic trait.

Given `rMZ > 2*rDZ`, we can use Falconer's formula based on ADE model to estimate the A (additive genetic) and D (dominance) variance with the classical twin design for phenotype 1 without genotypes.

**Q5.** Compute rMZ and rDZ.

**Q6.** Estimate the additive and dominance genetic variances using the Falconer's equations for the ADE model.
**Q6.** Estimate the proportion of additive and dominance genetic variances using the Falconer's equations for the ADE model.

```{r Falconer, echo=FALSE, eval=F}
dataMZ = dataTwin[dataTwin$zygosity==1, c('phenoT1', 'phenoT2')] # MZ data frame
dataDZ = dataTwin[dataTwin$zygosity==2, c('phenoT1', 'phenoT2')] # DZ data frame
```{r FalconerADE, eval=F}
dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno1_T1', 'pheno1_T2')] # MZ data frame
dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno1_T1', 'pheno1_T2')] # DZ data frame
rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix
rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix
Expand All @@ -287,4 +289,23 @@ sA2 = 4*rDZ - rMZ
sD2 = 2*rMZ - 4*rDZ
sE2 = 1 - sA2 - sD2
print(c(sA2, sD2, sE2))
```

Similarly, for phenotype 2, we can estimate the proportion of additive and/or dominance genetic variances as well as shared environmental variance using the Falconer's formula.

**Q7.** Which model (ACE or ADE) should be considered for phenotype 2?

**Q8.** Estimate the proportion of A, C/D and E variance components for phenotype 2.

```{r FalconerACE, eval=F}
dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno2_T1', 'pheno2_T2')] # MZ data frame
dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno2_T1', 'pheno2_T2')] # DZ data frame
rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix
rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix
sA2 = 2*(rMZ - rDZ)
sC2 = 2*rDZ - rMZ
sE2 = 1 - rMZ
print(c(sA2, sC2, sE2))
```

0 comments on commit 90e4f6f

Please sign in to comment.