From da112ae2e154d20167d60fee84ceee7efb72c38f Mon Sep 17 00:00:00 2001
From: claratsm <claratang@hku.hk>
Date: Mon, 18 Nov 2024 19:17:56 +0800
Subject: [PATCH] Nov 2024 update

---
 ...-genetics.Rmd => M4-pop-genetics_2024.Rmd} | 803 +++++++++---------
 .../{dataTwin2023.dat => dataTwin2024.dat}    |   0
 2 files changed, 393 insertions(+), 410 deletions(-)
 rename notebooks/module4-genetics/{M4-pop-genetics.Rmd => M4-pop-genetics_2024.Rmd} (83%)
 rename notebooks/module4-genetics/{dataTwin2023.dat => dataTwin2024.dat} (100%)
diff --git a/notebooks/module4-genetics/M4-pop-genetics.Rmd b/notebooks/module4-genetics/M4-pop-genetics_2024.Rmd
similarity index 83%
rename from notebooks/module4-genetics/M4-pop-genetics.Rmd
rename to notebooks/module4-genetics/M4-pop-genetics_2024.Rmd
index c340535..3400c31 100644
--- a/notebooks/module4-genetics/M4-pop-genetics.Rmd
+++ b/notebooks/module4-genetics/M4-pop-genetics_2024.Rmd
@@ -1,411 +1,394 @@
-
-# Population Genetics and Diseases {#pop-genetics}
-
-## Case study 1: Heritability and human traits
-
-### Part 1
-
-**Scenario:** You are a researcher working on a twin study on cardiovascular traits to assess the genetic and environmental contribution relevant to metabolism and cardiovascular disease risk. You have recruited a cohort of volunteer adult twins of the same ancestry. The volunteers have undergone a series of baseline clinical evaluations and performed genotyping on a panel of single nucleotide polymorphisms that may be associated with the traits.
-
-
-#### Questions for Discussion
-
-**Q1.** Besides the clinical measurements, what data do you need to collect from the subjects?
-<details>
-  <summary>**Answers:** </summary>
-  - Sex
-- Age
-- Other confounding factors, e.g. BMI, blood pressure, smoking status, etc.
-</details>
-
-**Q2.** How is genotype data represented for statistical genetic analysis?  
-<details>
-  <summary>**Answers:** </summary>
-  - Allele: 0/1, 1/2, A/C, etc
-- Genotype: 0 0, 0 1, 1 0, 1 1
-- Genotype probabilities: P(0/0)=0, P(0/1)=1, P(1/1)=0
-- Genotype dosage: 0/1/2, 0.678 (continuous from 0-1 or 0-2)
-</details>
-
-**Q3.** How can you test for association between genotypes and phenotypes (binary and quantitative)? 
-<details>
-  <summary>**Answers:** </summary>
-  - Allelic chi-square test
-- Fisher's exact test
-- Linear/Logistic regression
-- Linear mixed model 
-</details>
-
-
-#### Hands-on exercise : Association test
-
-Now, you are given a dataset of age- and sex-matched twin cohort with two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this 
-[GitHub Repo](https://github.com/StatBiomed/BMDS-book/blob/main/notebooks/module4-genetics/dataTwin2023.dat).
-
-**The information for columns:**
-
-- zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin
-- `T1QTL_A[1-5]` and `T2QTL_A[1-5]`: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively
-- The same 5 QTL (D1-D5) in dominance coding for T1 and T2
-- Phenotype scores of T1 and T2 for the two quantitative cardiovascular traits
-
-**Download the data `dataTwin.dat` to your working directory.** Start the RStudio program and set the working directory.
-
-```{r readDataR}
-dataTwin <- read.table("dataTwin2023.dat",h=T)
-```
-
-**Exploratory analysis**
-
-- A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
-- D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA).
-
-**Q1.** How many MZ and DZ volunteers are there?
-<details>
-  <summary>**Answers:** </summary>
-  - 1000 MZ and 1000 DZ
-</details>
-
-**Q2.** How are the genotypes represented?
-<details>
-  <summary>**Answers:** </summary>
-  - Dosage/Count of non-reference allele : 0, 1, and 2 
-</details>
-
-**Q3.** Are the QTL independent of each other?
-<details>
-  <summary>**Answers:** </summary>
-  - Yes. The pairwise correlations are low (<0.2). 
-</details>
-
-**Q4.** Are there outliers in phenotypes?
-<details>
-  <summary>**Answers:** </summary>
-  - Yes. T2 individual 1303 has phenotype score (-4.21) being 4 SD below the mean. 
-</details>
-
-```{r genotypedataExplorationR}
-table(dataTwin$zygosity)        # Q1: shows number of MZ and DZ twin pairs
-table(dataTwin$T1QTL_A1)        # Q2: shows the distribution of QTL_A1
-table(dataTwin$T1QTL_D1)        # Q2: shows the distribution of QTL_D1
-table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1)     # Q2: shows the distribution of QTL_A1 in relation to QTL_D1
-cor(dataTwin[,2:11])                            # Q3: shows the correlation between QTL_As
-cor(dataTwin[,2:11])>0.2
-apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) })  # Q4: any outlier < 4 SD from the mean for the two quantitative phenotypes
-apply(dataTwin[22:25],2,function(x){ any(x > (mean(x) + 4*sd(x))) })  # Q4: any outlier > 4 SD from the mean for the two quantitative phenotypes
-# remove the phenotype score of the outlier (T2) for the phenotype 2 (pheno2_T2)
-outlier<- which(dataTwin$pheno2_T2 < (mean(dataTwin$pheno2_T2) - 4*sd(dataTwin$pheno2_T2) )) 
-outlier
-dataTwin$pheno2_T2[outlier]
-dataTwin$pheno2_T2[outlier] <- NA
-```
-
-**Association test**
-
-Test for association between QTL and pheno1 for T1
-
-- Regress `pheno1_T1` on `T1QTL_A1` to estimate the proportion of variance explained (R2).
-- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + e
-- Calculate the conditional mean of phenotype (i.e. phenotypic mean conditional genotype)
-
-If the relationship between the QTL and the phenotype is perfectly linear, the regression line should pass through the conditional means (c_means), and the differences between the conditional means should be about equal.
-
-**Q5.** What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)?
-<details>
-  <summary>**Answers:** </summary>
-  - b0 = 4.1464
-- b1 = 0.9180
-- QTL1 is significantly associated with the phenotype with $P = 1.02\times 10^{-13}$
-</details>
-
-**Q6.** What is the proportion of phenotypic variance explained?
-<details>
-  <summary>**Answers:** </summary>
-  - Proportion of phenotypic variance explained = 0.027
-</details>
-
-```{r lmAR}
-linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin)
-summary(linA1)
-summary(linA1)$r.squared  # proportion of explained variance by additive component
-
-c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean)
-plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
-lines(c(0,1,2), c_means, type="p", col=6, lwd=8)
-lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
-```
-
-To test for the non-linearity, we can use the dominance coding of the QTL and add the dominance term to the regression model. 
-
-- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e
-- Repeat for T2. 
-
-**Q7.** Why can't we analyse T1 and T2 together?
-<details>
-  <summary>**Answers:** </summary>
-  - As T1 and T2 are biologically related as MZ or DZ twins, the genotypes of the QTLs are not independent. Treating the genotypes of T1 and T2 as independent observations will introduce bias.
-</details>
-
-**Q8.** Is there a dominance effect?
-<details>
-  <summary>**Answers:** </summary>
-  - Yes. The model with dominance provides a better goodness of fit (lower p-value)
-</details>
-
-```{r lmADR}
-linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
-summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
-
-plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
-abline(linA1, lwd=3)
-lines(c(0,1,2), c_means, type='p', col=6, lwd=8)
-lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
-lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3)
-```
-
-**Q9.** Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype T1
-<details>
-  <summary>**Answers:** </summary>
-  - QTL3 with $P = 7.77 \times 10^{-25}$ for model with dominance
-</details>
-
-```{r lmAllQTLR}
-allQTL_A_T1 <- 2:6
-cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1")
-## Additive
-cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
-## Dominance
-cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
-
-#Q9: QTL3 shows the strongest association with P=7.771588e-25
-linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
-summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
-```
-
-If the subjects with top 5% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q9) and interpret the result.
-
-**Q10.** What are the odds ratio, p-value, and 95% confidence interval (CI)?
-<details>
-  <summary>**Answers:** </summary>
-  - Odds ratio is 3.40 and the 95% CI is (1.65, 7.03)
-</details>
-
-```{r logisticADR}
-quant05 <- quantile(c(dataTwin$pheno1_T1,dataTwin$pheno1_T2),seq(0,1,0.05))
-dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant05[20]) 
-dataTwin$CaseT2 <- as.numeric(dataTwin$pheno1_T2>quant05[20])
-logisticAD1 <- summary(glm(CaseT1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin, family="binomial"))
-exp(logisticAD1$coefficients[2,1])                                     # odds ratio
-exp(logisticAD1$coefficients[2,1]-1.96*logisticAD1$coefficients[2,2])  # lower 95% confidence interval
-exp(logisticAD1$coefficients[2,1]+1.96*logisticAD1$coefficients[2,2])  # upper 95% confidence interval
-```
-
-### Part 2
-
-**Scenario:** You are asked to estimate the additive genetic variance, dominance genetic variance and/or shared environmental variance using regression-based method and a classical twin design. 
-
-\begin{align*} 
-
-\text{For ADE model :  }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{D} + \sigma^{2}_{E}\\ 
-
-\text{For ACE model :  }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{C} + \sigma^{2}_{E}, \quad \text{where} \\
-
-\sigma^{2}_{P} & \text{ is the phenotypic variance}, \\
-
-\sigma^{2}_{A} & \text{ is additive genetic variance}, \\
-
-\sigma^{2}_{D} & \text{ is dominance genetic variance}, \\
-
-\sigma^{2}_{C} & \text{ is shared environmental variance, and} \\
-
-\sigma^{2}_{E} & \text{ is unshared environmental variance.} 
- 
-\end{align*}
-
-For ADE model, 
-
-\begin{align*} 
-
-  cov(MZ) = cor(MZ) & = rMZ = \sigma^{2}_{A} + \sigma^{2}_{D} \\
-  cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + 0.25 * \sigma^{2}_{D} \quad \text{ , where} \\
-
-\end{align*}
-the coefficients 1/2 and 1/4 are based on quantitative genetic theory (Mather & Jinks, 1971).
-
-By solving the unknowns, the Falconer's equations for the ADE model:
-
-\begin{align*} 
-
-\sigma^{2}_{A} & = 4*rDZ - rMZ \\
-
-\sigma^{2}_{D} & = 2*rMZ - 4*rDZ \\
-\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{D} \\
-
-\end{align*}
-
-For ACE model, 
-
-\begin{align*} 
-
-  cov(MZ) = cor(MZ) & = rMZ= \sigma^{2}_{A} + \sigma^{2}_{C} \\
-  cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + \sigma^{2}_{C} \quad \text{ , where} \\
-
-\end{align*}
-
-By solving the unknowns, the Falconer's equations for the ACE model:
-
-\begin{align*} 
-
-\sigma^{2}_{A} & = 2*(rMZ - rDZ) \\
-\sigma^{2}_{C} & = 2*rDZ - rMZ \\
-\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{C} = 1 - rMZ 
-
-\end{align*}
-
-
-#### Questions for discussions :
-
-**Q1.** What is missing heritability of common traits in the era of genome-wide association analysis (GWAS)?
-<details>
-  <summary>**Answers:** </summary>
-  - Missing heritability refers to the discrepancy between twin/family-estimated heritability and the amount of variance explained by disease/trait-associated loci identified in GWAS 
-</details>
-
-**Q2.** What are the potential sources of missing heritability?
-<details>
-  <summary>**Answers:** </summary>
-    - Large number of variants with small effect not reaching GWAS significance due to inadequate power of GWAS 
-  - Poor detection of rarer disease/trait-associated variants by genotyping arrays
-  - Structural variants poorly captured by genotyping arrays
-  - Low power to detect gene–gene interactions
-  - Underestimation of shared environment among relatives in twin/family-based studies
-</details>
-
-* Suggested reading:
-  * Manolio TA, Collins FS, Cox NJ, et al. Nature. 2009 ;461(7265):747-753. doi:10.1038/nature08494
-
-#### Hands-on exercise : variance explained using regression-based method
-
-**Q1.** What is the variance of the phenotype? 
-
-**Q2.** Compute the explained variance attributable to the additive genetic component of the QTL with strongest association in Part 1.
-
-**Q3.** Compute the explained variance attributable to the dominance genetic component of the QTL with strongest association in Part 1.
-
-R2 from the regression represents the proportion of phenotypic variance explained; thus the raw explained variance component is R2 times the variance of the phenotype (var_pheno).
-
-<details>
-  <summary>**Answers** </summary>
-  * The proportion of explained variance are 0.0273 (additive) and 0.0541 (total: additive + dominance).
-* As the predictors are uncorrelated, the proportion of explained variance by dominance =  0.0541 - 0.0273 = 0.0267
-* Given the phenotypic variance of 15.102, then 
-  * Total genetic: 0.0541*15.102 = 0.8168
-  * Additive genetic: 0.0273*15.102 = 0.4128
-  * Dominance genetic: 0.0267*15.102 = 0.4040
-</details>
-```{r varianceR}
-var_pheno <- var(dataTwin$pheno1_T1)  # the variance of the phenotype
-var_pheno
-
-linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
-linA3 <- lm(pheno1_T1 ~ T1QTL_A3, data=dataTwin)
-
-summary(linAD3)$r.squared           # proportion of explained variance by total genetic component
-summary(linA3)$r.squared            # proportion of explained variance by additive component
-summary(linAD3)$r.squared*var_pheno # (raw) variance component of total genetic component
-summary(linA3)$r.squared*var_pheno  # (raw) variance component of additive genetic component
-(summary(linAD3)$r.squared-summary(linA3)$r.squared)*var_pheno  # (raw) variance component of dominance genetic component
-```
-</details>
-
-**Q4.**	Estimate the variance explained by all the QTL using linear regression.
-
-<details>
-  <summary>**Answers** </summary>
-  - Proportion of variance explained by all 5 QTLs with dominance = 0.23 and the total variance explained = 3.52.
-</details>
-
-```{r varianceAllQTLR}
-# compute for all 5 QTL
-linAD5=(lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 +
-                       T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5,
-            data=dataTwin))
-
-summary(linAD5)$r.squared              # proportion of explained variance by total genetic component
-summary(linAD5)$r.squared*var_pheno    # (raw) variance component of total genetic component
-```
-
-#### Hands-on exercise : variance explained using a classical twin design. 
-
-Based on our regression results, we have estimates of the total genetic variance as well as the A and D components for phenotype 1. In practice, it is impossible to know all the variants associated with any polygenic trait. 
-
-Given `rMZ > 2*rDZ`, we can use Falconer's formula based on ADE model to estimate the A (additive genetic) and D (dominance) variance with the classical twin design for phenotype 1 without genotypes.
-
-**Q5.** Compute rMZ and rDZ.
-<details>
-  <summary>**Answers** </summary>
-  - rMZ = 0.5434
-- rDZ = 0.1904
-</details>
-
-**Q6.**	Estimate the proportion of additive and dominance genetic variances using the Falconer's equations for the ADE model.
-<details>
-  <summary>**Answers** </summary>
-  - $\sigma^{2}_{A} = 0.2181$
-- $\sigma^{2}_{D} = 0.3253$
-- $\sigma^{2}_{E} = 0.4566$
-</details>
-
-```{r FalconerADE, eval=F}
-dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno1_T1', 'pheno1_T2')] # MZ data frame
-dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno1_T1', 'pheno1_T2')] # DZ data frame
-
-rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix
-rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix
-rMZ
-rDZ
-
-sA2 = 4*rDZ - rMZ
-sD2 = 2*rMZ - 4*rDZ
-sE2 = 1 - sA2 - sD2
-print(c(sA2, sD2, sE2))
-```
-
-Similarly, for phenotype 2, we can estimate the proportion of additive and/or dominance genetic variances as well as shared environmental variance using the Falconer's formula.
-
-**Q7.**	Which model (ACE or ADE) should be considered for phenotype 2?
-<details>
-  <summary>**Answers** </summary>
-  - ACE as rMZ < 2*rDZ
-</details>
-
-**Q8.** Estimate the proportion of A, C/D and E variance components for phenotype 2.
-<details>
-  <summary>**Answers** </summary>
-  - $\sigma^{2}_{A} = 0.3526$
-- $\sigma^{2}_{C} = 0.1610$
-- $\sigma^{2}_{E} = 0.4864$
-</details>
-```{r FalconerACE, eval=F}
-dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno2_T1', 'pheno2_T2')] # MZ data frame
-dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno2_T1', 'pheno2_T2')] # DZ data frame
-
-rMZ=cor(dataMZ, use="complete.obs")[2,1] # element 2,1 in the MZ correlation matrix
-rDZ=cor(dataDZ, use="complete.obs")[2,1] # element 2,1 in the DZ correlation matrix
-rMZ
-rDZ
-
-sA2 = 2*(rMZ - rDZ)
-sC2 = 2*rDZ - rMZ
-sE2 = 1 - rMZ
-print(c(sA2, sC2, sE2))
-```
-
-###  References
-1. Evans DM, Gillespie NA, Martin NG. Biometrical genetics. Biol Psychol. 2002 Oct;61(1-2):33-51. doi: 10.1016/s0301-0511(02)00051-0. PMID: 12385668. [Review article]
-
-2. Falconer, D.S. and Mackay, T.F.C. (1996) Introduction to Quantitative Genetics. 4th Edition, Addison Wesley Longman, Harlow. [Most classical; a lot of online version]
-
-3. Neale, B., Ferreira, M., Medland, S., & Posthuma, D. (Eds.). (2007). Statistical Genetics: Gene Mapping Through Linkage and Association (1st ed.). Taylor & Francis. https://doi.org/10.1201/9780203967201 [chapter on biometrical genetics; can be borrowed from HKU lib]
-
+
+# Population Genetics and Diseases {#pop-genetics}
+
+## Case study 1: Heritability and human traits
+
+### Part 1
+
+**Scenario:** You are a researcher working on a twin study on cardiovascular traits to assess the genetic and environmental contribution relevant to metabolism and cardiovascular disease risk. You have recruited a cohort of volunteer adult twins of the same ancestry. The volunteers have undergone a series of baseline clinical evaluations and performed genotyping on a panel of single nucleotide polymorphisms that may be associated with the traits.
+
+
+#### Questions for Discussion
+
+**Q1.** Besides the clinical measurements, what data do you need to collect from the subjects?
+<details>
+  <summary>**Answers:** </summary>
+  - Sex
+- Age
+- Other confounding factors, e.g. BMI, blood pressure, smoking status, etc.
+</details>
+
+**Q2.** How is genotype data represented for statistical genetic analysis?  
+<details>
+  <summary>**Answers:** </summary>
+  - Allele: 0/1, 1/2, A/C, etc
+- Genotype: 0 0, 0 1, 1 0, 1 1
+- Genotype probabilities: P(0/0)=0, P(0/1)=1, P(1/1)=0
+- Genotype dosage: 0/1/2, 0.678 (continuous from 0-1 or 0-2)
+</details>
+
+**Q3.** How can you test for association between genotypes and phenotypes (binary and quantitative)? 
+<details>
+  <summary>**Answers:** </summary>
+  - Allelic chi-square test
+- Fisher's exact test
+- Linear/Logistic regression
+- Linear mixed model 
+</details>
+
+
+#### Hands-on exercise : Association test
+
+Now, you are given a dataset of age- and sex-matched twin cohort with two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this 
+[GitHub Repo](https://github.com/StatBiomed/BMDS-book/blob/main/notebooks/module4-genetics/dataTwin2024.dat).
+
+**The information for columns:**
+
+- zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin
+- `T1QTL_A[1-5]` and `T2QTL_A[1-5]`: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively
+- The same 5 QTL (D1-D5) in dominance coding for T1 and T2
+- Phenotype scores of T1 and T2 for the two quantitative cardiovascular traits
+
+**Download the data `dataTwin2024.dat` to your working directory.** Start the RStudio program and set the working directory.
+
+```{r readDataR}
+dataTwin <- read.table("dataTwin2024.dat",h=T)
+```
+
+**Exploratory analysis**
+
+**Q1.** How many MZ and DZ volunteers are there?
+<details>
+  <summary>**Answers:** </summary>
+  - 1000 MZ and 1000 DZ
+</details>
+
+**Q2.** How are the genotypes represented?
+<details>
+  <summary>**Answers:** </summary>
+  - Dosage/Count of non-reference allele : 0, 1, and 2 
+    - A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
+    - D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA).
+</details>
+
+**Q3.** Are the QTL independent of each other?
+<details>
+  <summary>**Answers:** </summary>
+  - Yes. The pairwise correlations are low (<0.2). 
+</details>
+
+**Q4.** Are there outliers in phenotypes?
+<details>
+  <summary>**Answers:** </summary>
+  - Yes. T2 individual 1303 has phenotype score (-4.21) being 4 SD below the mean. 
+</details>
+
+```{r genotypedataExplorationR}
+table(dataTwin$zygosity)        # Q1: shows number of MZ and DZ twin pairs
+table(dataTwin$T1QTL_A1)        # Q2: shows the distribution of QTL_A1
+table(dataTwin$T1QTL_D1)        # Q2: shows the distribution of QTL_D1
+table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1)     # Q2: shows the distribution of QTL_A1 in relation to QTL_D1
+cor(dataTwin[,2:11])                            # Q3: shows the correlation between QTL_As
+cor(dataTwin[,2:11])>0.2
+apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) })  # Q4: any outlier < 4 SD from the mean for the two quantitative phenotypes
+apply(dataTwin[22:25],2,function(x){ any(x > (mean(x) + 4*sd(x))) })  # Q4: any outlier > 4 SD from the mean for the two quantitative phenotypes
+# remove the phenotype score of the outlier (T2) for the phenotype 2 (pheno2_T2)
+outlier<- which(dataTwin$pheno2_T2 < (mean(dataTwin$pheno2_T2) - 4*sd(dataTwin$pheno2_T2) )) 
+outlier
+dataTwin$pheno2_T2[outlier]
+dataTwin$pheno2_T2[outlier] <- NA
+```
+
+**Association test**
+
+Test for association between QTL and pheno1 for T1
+
+- Regress `pheno1_T1` on `T1QTL_A1` to estimate the proportion of variance explained (R2).
+- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + e
+- Calculate the conditional mean of phenotype (i.e. phenotypic mean conditional genotype)
+
+If the relationship between the QTL and the phenotype is perfectly linear, the regression line should pass through the conditional means (c_means), and the differences between the conditional means should be about equal.
+
+**Q5.** What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)?
+<details>
+  <summary>**Answers:** </summary>
+  - b0 = 4.1464
+- b1 = 0.9180
+- QTL1 is significantly associated with the phenotype with $P = 1.02\times 10^{-13}$
+</details>
+
+**Q6.** What is the proportion of phenotypic variance explained?
+<details>
+  <summary>**Answers:** </summary>
+  - Proportion of phenotypic variance explained = 0.027
+</details>
+
+```{r lmAR}
+linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin)
+summary(linA1)
+summary(linA1)$r.squared  # proportion of explained variance by additive component
+
+c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean)
+plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
+lines(c(0,1,2), c_means, type="p", col=6, lwd=8)
+lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
+```
+
+To test for the non-linearity, we can use the dominance coding of the QTL and add the dominance term to the regression model. 
+
+- Model: pheno1_T1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e
+- Repeat for T2. 
+
+**Q7.** Why can't we analyse T1 and T2 together?
+<details>
+  <summary>**Answers:** </summary>
+  - As T1 and T2 are biologically related as MZ or DZ twins, the genotypes of the QTLs are not independent. Treating the genotypes of T1 and T2 as independent observations will introduce bias.
+</details>
+
+**Q8.** Is there a dominance effect?
+<details>
+  <summary>**Answers:** </summary>
+  - Yes. The model with dominance provides a better goodness of fit (lower p-value)
+</details>
+
+```{r lmADR}
+linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
+summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
+
+plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
+abline(linA1, lwd=3)
+lines(c(0,1,2), c_means, type='p', col=6, lwd=8)
+lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
+lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3)
+```
+
+**Q9.** Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype T1
+<details>
+  <summary>**Answers:** </summary>
+  - QTL3 with $P = 7.77 \times 10^{-25}$ for model with dominance
+</details>
+
+```{r lmAllQTLR}
+allQTL_A_T1 <- 2:6
+cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1")
+## Additive
+cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
+## Dominance
+cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
+
+#Q9: QTL3 shows the strongest association with P=7.771588e-25
+linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
+summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
+```
+
+If the subjects with top 10% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q9) and interpret the result.
+
+**Q10.** What are the odds ratio, p-value, and 95% confidence interval (CI)?
+<details>
+  <summary>**Answers:** </summary>
+  - Odds ratio is 4.28 and the 95% CI is (2.54, 7.20)
+</details>
+
+```{r logisticADR}
+quant10 <- quantile(c(dataTwin$pheno1_T1),seq(0,1,0.1))
+dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant10[10])
+
+dataTwin$T1QTL_AD3 <- (dataTwin$T1QTL_A3 + dataTwin$T1QTL_D3)/2
+
+logisticAD3 <- summary(glm(CaseT1 ~ T1QTL_AD3, data=dataTwin, family="binomial"))
+exp(logisticAD3$coefficients[2,1])                                     # odds ratio
+exp(logisticAD3$coefficients[2,1]-1.96*logisticAD3$coefficients[2,2])  # lower 95% confidence interval
+exp(logisticAD3$coefficients[2,1]+1.96*logisticAD3$coefficients[2,2])  # upper 95% confidence interval
+```
+
+### Part 2
+
+**Scenario:** You are asked to estimate the additive genetic variance, dominance genetic variance and/or shared environmental variance using regression-based method and a classical twin design. 
+
+\begin{align*} 
+
+\text{For ADE model :  }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{D} + \sigma^{2}_{E}\\ 
+
+\text{For ACE model :  }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{C} + \sigma^{2}_{E}, \quad \text{where} \\
+
+\sigma^{2}_{P} & \text{ is the phenotypic variance}, \\
+
+\sigma^{2}_{A} & \text{ is additive genetic variance}, \\
+
+\sigma^{2}_{D} & \text{ is dominance genetic variance}, \\
+
+\sigma^{2}_{C} & \text{ is shared environmental variance, and} \\
+
+\sigma^{2}_{E} & \text{ is unshared environmental variance.} 
+ 
+\end{align*}
+
+For ADE model, given the standardization of the phenotype, 
+
+\begin{align*} 
+
+  cov(MZ) = cor(MZ) & = rMZ = \sigma^{2}_{A} + \sigma^{2}_{D} \\
+  cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + 0.25 * \sigma^{2}_{D} \quad \text{ , where} \\
+
+\end{align*}
+the coefficients 1/2 and 1/4 are based on quantitative genetic theory (Mather & Jinks, 1971).
+
+By solving the unknowns, the variance explained by different components for the ADE model:
+
+\begin{align*} 
+
+\sigma^{2}_{A} & = 4*rDZ - rMZ \\
+
+\sigma^{2}_{D} & = 2*rMZ - 4*rDZ \\
+\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{D} \\
+
+\end{align*}
+
+For ACE model, 
+
+\begin{align*} 
+
+  cov(MZ) = cor(MZ) & = rMZ= \sigma^{2}_{A} + \sigma^{2}_{C} \\
+  cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + \sigma^{2}_{C} \quad \text{ , where} \\
+
+\end{align*}
+
+By solving the unknowns, the variance explained by different components for the ACE model:
+
+\begin{align*} 
+
+\sigma^{2}_{A} & = 2*(rMZ - rDZ) \\
+\sigma^{2}_{C} & = 2*rDZ - rMZ \\
+\sigma^{2}_{E} & = 1 - \sigma^{2}_{A} - \sigma^{2}_{C} = 1 - rMZ 
+
+\end{align*}
+
+
+#### Hands-on exercise : variance explained using regression-based method
+
+**Q1.** What is the variance of the phenotype? 
+
+**Q2.** Compute the explained variance attributable to the additive genetic component of the QTL with strongest association in Part 1.
+
+**Q3.** Compute the explained variance attributable to the dominance genetic component of the QTL with strongest association in Part 1.
+
+R2 from the regression represents the proportion of phenotypic variance explained; thus the raw explained variance component is R2 times the variance of the phenotype (var_pheno).
+
+<details>
+  <summary>**Answers** </summary>
+  * The proportion of explained variance are 0.0273 (additive) and 0.0541 (total: additive + dominance).
+* As the predictors are uncorrelated, the proportion of explained variance by dominance =  0.0541 - 0.0273 = 0.0267
+* Given the phenotypic variance of 15.102, then 
+  * Total genetic: 0.0541*15.102 = 0.8168
+  * Additive genetic: 0.0273*15.102 = 0.4128
+  * Dominance genetic: 0.0267*15.102 = 0.4040
+</details>
+```{r varianceR}
+var_pheno <- var(dataTwin$pheno1_T1)  # the variance of the phenotype
+var_pheno
+
+linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
+linA3 <- lm(pheno1_T1 ~ T1QTL_A3, data=dataTwin)
+
+summary(linAD3)$r.squared           # proportion of explained variance by total genetic component
+summary(linA3)$r.squared            # proportion of explained variance by additive component
+summary(linAD3)$r.squared*var_pheno # (raw) variance component of total genetic component
+summary(linA3)$r.squared*var_pheno  # (raw) variance component of additive genetic component
+(summary(linAD3)$r.squared-summary(linA3)$r.squared)*var_pheno  # (raw) variance component of dominance genetic component
+```
+
+**Q4.**	Estimate the variance explained by all the QTL using linear regression.
+
+<details>
+  <summary>**Answers** </summary>
+  - Proportion of variance explained by all 5 QTLs with dominance = 0.23 and the total variance explained = 3.52.
+</details>
+
+```{r varianceAllQTLR}
+# compute for all 5 QTL
+linAD5=(lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 +
+                       T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5,
+            data=dataTwin))
+
+summary(linAD5)$r.squared              # proportion of explained variance by total genetic component
+summary(linAD5)$r.squared*var_pheno    # (raw) variance component of total genetic component
+```
+#### Hands-on exercise : variance explained using a classical twin design. 
+
+Based on our regression results, we have estimates of the total genetic variance as well as the A and D components for phenotype 1 explained by the QTLs. In practice, it is impossible to know all the variants associated with any polygenic trait. 
+
+Alternatively, we can use ADE or ACE models to estimate the A (additive genetic) and D (dominance)/C (shared environmental) variance with the classical twin design for phenotype 1 without genotypes.
+
+**Q5.** Which model should be used? ACE or ADE? Let's first compute rMZ and rDZ.
+<details>
+  <summary>**Answers** </summary>
+  - rMZ = 0.5434
+- rDZ = 0.1904
+- As $rMZ > 2*rDZ$, ADE model should be used.
+</details>
+
+```{r MDZ, eval=F}
+dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno1_T1', 'pheno1_T2')] # MZ data frame
+dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno1_T1', 'pheno1_T2')] # DZ data frame
+
+rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix
+rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix
+rMZ
+rDZ
+```
+
+**Q6.**	Estimate the proportion of additive and dominance genetic variances using the ADE model.
+<details>
+  <summary>**Answers** </summary>
+  - $\sigma^{2}_{A} = 0.2181$
+- $\sigma^{2}_{D} = 0.3253$
+- $\sigma^{2}_{E} = 0.4566$
+</details>
+
+
+```{r FalconerADE, eval=F}
+sA2 = 4*rDZ - rMZ
+sD2 = 2*rMZ - 4*rDZ
+sE2 = 1 - sA2 - sD2
+print(c(sA2, sD2, sE2))
+```
+
+Similarly, for phenotype 2, we can estimate the proportion of additive and/or dominance genetic variances as well as shared environmental variance using the classical twin design.
+
+**Q7.**	Which model (ACE or ADE) should be considered for phenotype 2?
+<details>
+  <summary>**Answers** </summary>
+  
+  - ACE as $rMZ < 2*rDZ$
+</details>
+
+**Q8.** Estimate the proportion of A, C/D and E variance components for phenotype 2.
+<details>
+  <summary>**Answers** </summary>
+  - $\sigma^{2}_{A} = 0.3526$
+- $\sigma^{2}_{C} = 0.1610$
+- $\sigma^{2}_{E} = 0.4864$
+</details>
+```{r FalconerACE, eval=F}
+dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno2_T1', 'pheno2_T2')] # MZ data frame
+dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno2_T1', 'pheno2_T2')] # DZ data frame
+
+rMZ=cor(dataMZ, use="complete.obs")[2,1] # element 2,1 in the MZ correlation matrix
+rDZ=cor(dataDZ, use="complete.obs")[2,1] # element 2,1 in the DZ correlation matrix
+rMZ
+rDZ
+
+sA2 = 2*(rMZ - rDZ)
+sC2 = 2*rDZ - rMZ
+sE2 = 1 - rMZ
+print(c(sA2, sC2, sE2))
+```
+
+###  References
+1. Evans DM, Gillespie NA, Martin NG. Biometrical genetics. Biol Psychol. 2002 Oct;61(1-2):33-51. doi: 10.1016/s0301-0511(02)00051-0. PMID: 12385668. [Review article]
+
+2. Falconer, D.S. and Mackay, T.F.C. (1996) Introduction to Quantitative Genetics. 4th Edition, Addison Wesley Longman, Harlow. [Most classical; a lot of online version]
+
+3. Neale, B., Ferreira, M., Medland, S., & Posthuma, D. (Eds.). (2007). Statistical Genetics: Gene Mapping Through Linkage and Association (1st ed.). Taylor & Francis. https://doi.org/10.1201/9780203967201 [chapter on biometrical genetics; can be borrowed from HKU lib]
+
 5. https://ibg.colorado.edu/cdrom2020/dolan/biometricalGenetics/biom_gen_2020.pdf [Course material of the Boulder IBG workshop co-organized by top statistical geneticists]
\ No newline at end of file
diff --git a/notebooks/module4-genetics/dataTwin2023.dat b/notebooks/module4-genetics/dataTwin2024.dat
similarity index 100%
rename from notebooks/module4-genetics/dataTwin2023.dat
rename to notebooks/module4-genetics/dataTwin2024.dat