-
- -
-
-

Chapter 8 Population Genetics and Diseases

-
-

8.1 Case study 1: Heritability and human traits

-
-

8.1.1 Part 1

-

Scenario: You are a researcher working on a twin study on cardiovascular traits to assess the genetic and environmental contribution relevant to metabolism and cardiovascular disease risk. You have recruited a cohort of volunteer adult twins of the same ancestry. The volunteers have undergone a series of baseline clinical evaluations and performed genotyping on a panel of single nucleotide polymorphisms that may be associated with the traits.

-
-

8.1.1.1 Questions for Discussion

-Q1. Besides the clinical measurements, what data do you need to collect from the subjects? + +
+ + +
+

Population Genetics and Diseases

+
+

Case study 1: Heritability and human traits

+
+

Part 1

+

Scenario: You are a researcher working on a twin +study on cardiovascular traits to assess the genetic and environmental +contribution relevant to metabolism and cardiovascular disease risk. You +have recruited a cohort of volunteer adult twins of the same ancestry. +The volunteers have undergone a series of baseline clinical evaluations +and performed genotyping on a panel of single nucleotide polymorphisms +that may be associated with the traits.

+
+

Questions for Discussion

+Q1. Besides the clinical measurements, what data do you +need to collect from the subjects?
Answers: @@ -377,10 +376,12 @@

8.1.1.1 Questions for Discussion<
  • Sex
  • Age
  • -
  • Other confounding factors, e.g. BMI, blood pressure, smoking status, etc.
  • +
  • Other confounding factors, e.g. BMI, blood pressure, smoking status, +etc.

-Q2. How is genotype data represented for statistical genetic analysis?
+Q2. How is genotype data represented for statistical +genetic analysis?
@@ -393,7 +394,8 @@

8.1.1.1 Questions for Discussion<
  • Genotype dosage: 0/1/2, 0.678 (continuous from 0-1 or 0-2)
  • -Q3. How can you test for association between genotypes and phenotypes (binary and quantitative)? +Q3. How can you test for association between genotypes +and phenotypes (binary and quantitative)?
    Answers: @@ -406,100 +408,229 @@

    8.1.1.1 Questions for Discussion<

    -
    -

    8.1.1.2 Hands-on exercise : Association test

    -

    Now, you are given a dataset of age- and sex-matched twin cohort with two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data set and template notebook are available on Moodle (recommended) and also on this -GitHub Repo.

    +
    +

    Hands-on exercise : Association test

    +

    Now, you are given a dataset of age- and sex-matched twin cohort with +two cardiovascular phenotypes and 5 quantitative trait loci (QTL). Data +set and template notebook are available on Moodle (recommended) and also +on this GitHub +Repo.

    The information for columns:

    • zygosity: 1 for monozygotic (MZ) and 2 for dizygotic (DZ) twin
    • -
    • T1QTL_A[1-5] and T2QTL_A[1-5]: 5 quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 (T2) respectively
    • +
    • T1QTL_A[1-5] and T2QTL_A[1-5]: 5 +quantitative loci (A1-A5) in additive coding for Twin 1 (T1) and Twin 2 +(T2) respectively
    • The same 5 QTL (D1-D5) in dominance coding for T1 and T2
    • -
    • Phenotype scores of T1 and T2 for the two quantitative cardiovascular traits
    • +
    • Phenotype scores of T1 and T2 for the two quantitative +cardiovascular traits
    -

    Download the data dataTwin.dat to your working directory. Start the RStudio program and set the working directory.

    -
    dataTwin <- read.table("dataTwin2023.dat",h=T)
    +

    Download the data dataTwin.dat to your working +directory. Start the RStudio program and set the working +directory.

    +
    dataTwin <- read.table("dataTwin2023.dat",h=T)

    Exploratory analysis

      -
    • A1-5: The QTLs are biallelic with two alleles A and a. The genotypes aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
    • -
    • D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 (AA).
    • +
    • A1-5: The QTLs are biallelic with two alleles A and a. The genotypes +aa, Aa, and AA are coded additively as 0 (aa), 1 (Aa) and 2 (AA).
    • +
    • D1-5: The genotypes aa, Aa, and AA are coded as 0 (aa), 1 (Aa) and 0 +(AA).
    • +
    +Q1. How many MZ and DZ volunteers are there? +
    + +Answers: + +
      +
    • 1000 MZ and 1000 DZ
    -

    Q1. How many MZ and DZ volunteers are there?

    -

    Q2. How are the genotypes represented?

    -

    Q3. Are the QTL independent of each other?

    -

    Q4. Are there outliers in phenotypes?

    -
    table(dataTwin$zygosity)        # Q1: shows number of MZ and DZ twin pairs
    -table(dataTwin$T1QTL_A1)        # Q2: shows the distribution of QTL_A1
    -table(dataTwin$T1QTL_D1)        # Q2: shows the distribution of QTL_D1
    -table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1)     # Q2: shows the distribution of QTL_A1 in relation to QTL_D1
    -cor(dataTwin[,2:11])                            # Q3: shows the correlation between QTL_As
    -cor(dataTwin[,2:11])>0.2
    -apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) })  # Q4: any outlier >4 SD from the mean for the two quantitative phenotypes
    +
    +Q2. How are the genotypes represented? +
    + +Answers: + +
      +
    • Dosage/Count of non-reference alleles : 0, 1, and 2
    • +
    +
    +Q3. Are the QTL independent of each other? +
    + +Answers: + +
      +
    • Yes. The pairwise correlation is low (<0.2).
    • +
    +
    +Q4. Are there outliers in phenotypes? +
    + +Answers: + +
      +
    • Yes. T2 individual 1303 has phenotype score (-4.x) below 4 SD from +the mean.
    • +
    +
    +
    table(dataTwin$zygosity)        # Q1: shows number of MZ and DZ twin pairs
    +
    ## 
    +##    1    2 
    +## 1000 1000
    +
    table(dataTwin$T1QTL_A1)        # Q2: shows the distribution of QTL_A1
    +
    ## 
    +##    0    1    2 
    +##  474 1021  505
    +
    table(dataTwin$T1QTL_D1)        # Q2: shows the distribution of QTL_D1
    +
    ## 
    +##    0    1 
    +##  979 1021
    +
    table(dataTwin$T1QTL_A1, dataTwin$T1QTL_D1)     # Q2: shows the distribution of QTL_A1 in relation to QTL_D1
    +
    ##    
    +##        0    1
    +##   0  474    0
    +##   1    0 1021
    +##   2  505    0
    +
    cor(dataTwin[,2:11])                            # Q3: shows the correlation between QTL_As
    +
    ##             T1QTL_A1     T1QTL_A2     T1QTL_A3    T1QTL_A4     T1QTL_A5
    +## T1QTL_A1  1.00000000 -0.005470340  0.021705688  0.01940408  0.016278190
    +## T1QTL_A2 -0.00547034  1.000000000  0.017344822 -0.01421677 -0.008678746
    +## T1QTL_A3  0.02170569  0.017344822  1.000000000  0.01335711 -0.036751338
    +## T1QTL_A4  0.01940408 -0.014216767  0.013357109  1.00000000  0.074899996
    +## T1QTL_A5  0.01627819 -0.008678746 -0.036751338  0.07490000  1.000000000
    +## T2QTL_A1  0.53243815  0.004201635 -0.013909013  0.03252724  0.020081970
    +## T2QTL_A2 -0.04561174  0.464131160 -0.005044127  0.01324172 -0.003012277
    +## T2QTL_A3  0.03316574 -0.003552831  0.521253656  0.02045423  0.009081830
    +## T2QTL_A4  0.03271254 -0.033419904  0.020422583  0.48641289  0.019247531
    +## T2QTL_A5 -0.01285323  0.030413269 -0.045121964  0.08288145  0.457962222
    +##              T2QTL_A1     T2QTL_A2     T2QTL_A3     T2QTL_A4    T2QTL_A5
    +## T1QTL_A1  0.532438150 -0.045611740  0.033165736  0.032712539 -0.01285323
    +## T1QTL_A2  0.004201635  0.464131160 -0.003552831 -0.033419904  0.03041327
    +## T1QTL_A3 -0.013909013 -0.005044127  0.521253656  0.020422583 -0.04512196
    +## T1QTL_A4  0.032527239  0.013241725  0.020454234  0.486412895  0.08288145
    +## T1QTL_A5  0.020081970 -0.003012277  0.009081830  0.019247531  0.45796222
    +## T2QTL_A1  1.000000000  0.006179257 -0.013129314  0.048294183 -0.01325839
    +## T2QTL_A2  0.006179257  1.000000000 -0.020860987  0.002164782 -0.01131418
    +## T2QTL_A3 -0.013129314 -0.020860987  1.000000000 -0.010583797 -0.02101270
    +## T2QTL_A4  0.048294183  0.002164782 -0.010583797  1.000000000  0.04350925
    +## T2QTL_A5 -0.013258394 -0.011314179 -0.021012699  0.043509251  1.00000000
    +
    cor(dataTwin[,2:11])>0.2
    +
    ##          T1QTL_A1 T1QTL_A2 T1QTL_A3 T1QTL_A4 T1QTL_A5 T2QTL_A1 T2QTL_A2
    +## T1QTL_A1     TRUE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE
    +## T1QTL_A2    FALSE     TRUE    FALSE    FALSE    FALSE    FALSE     TRUE
    +## T1QTL_A3    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE    FALSE
    +## T1QTL_A4    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE
    +## T1QTL_A5    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE
    +## T2QTL_A1     TRUE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE
    +## T2QTL_A2    FALSE     TRUE    FALSE    FALSE    FALSE    FALSE     TRUE
    +## T2QTL_A3    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE    FALSE
    +## T2QTL_A4    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE
    +## T2QTL_A5    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE
    +##          T2QTL_A3 T2QTL_A4 T2QTL_A5
    +## T1QTL_A1    FALSE    FALSE    FALSE
    +## T1QTL_A2    FALSE    FALSE    FALSE
    +## T1QTL_A3     TRUE    FALSE    FALSE
    +## T1QTL_A4    FALSE     TRUE    FALSE
    +## T1QTL_A5    FALSE    FALSE     TRUE
    +## T2QTL_A1    FALSE    FALSE    FALSE
    +## T2QTL_A2    FALSE    FALSE    FALSE
    +## T2QTL_A3     TRUE    FALSE    FALSE
    +## T2QTL_A4    FALSE     TRUE    FALSE
    +## T2QTL_A5    FALSE    FALSE     TRUE
    +
    apply(dataTwin[22:25],2,function(x){ any(x < (mean(x) - 4*sd(x))) })  # Q4: any outlier < 4 SD from the mean for the two quantitative phenotypes
    +
    ## pheno1_T1 pheno1_T2 pheno2_T1 pheno2_T2 
    +##     FALSE     FALSE     FALSE      TRUE
    +
    apply(dataTwin[22:25],2,function(x){ any(x > (mean(x) + 4*sd(x))) })  # Q4: any outlier > 4 SD from the mean for the two quantitative phenotypes
    +
    ## pheno1_T1 pheno1_T2 pheno2_T1 pheno2_T2 
    +##     FALSE     FALSE     FALSE     FALSE
    +
    # remove the phenotype score of the outlier (T2) for the phenotype 2 (pheno2_T2)
    +outlier<- which(dataTwin$pheno2_T2 < (mean(dataTwin$pheno2_T2) - 4*sd(dataTwin$pheno2_T2) )) 
    +outlier
    +
    ## [1] 1303
    +
    dataTwin$pheno2_T2[outlier]
    +
    ## [1] -4.21
    +
    dataTwin$pheno2_T2[outlier] <- NA

    Association test

    Test for association between QTL and pheno1 for T1

      -
    • Regress pheno1_T1 on T1QTL_A1 to estimate the proportion of variance explained (R2).
    • +
    • Regress pheno1_T1 on T1QTL_A1 to estimate +the proportion of variance explained (R2).
    • Model: pheno1_T1 = b0 + b1* T1QTL_A1 + e
    • -
    • Calculate the conditional mean of phenotype (i.e. phenotypic mean conditional genotype)
    • +
    • Calculate the conditional mean of phenotype (i.e. phenotypic mean +conditional genotype)
    -

    If the relationship between the QTL and the phenotype is perfectly linear, the regression line should pass through the conditional means (c_means), and the differences between the conditional means should -be about equal.

    -

    Q5. What are the values of b0, b1? Is QTL1 significant associated with the phenotype at alpha<0.01 (multiple testing of 5 loci)?

    -

    Q6. What is the proportion of phenotypic variance explained?

    -
    linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin)
    -summary(linA1)
    -summary(linA1)$r.squared            # proportion of explained variance by additive component
    -
    -c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean)
    -plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
    -lines(c(0,1,2), c_means, type="p", col=6, lwd=8)
    -lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
    -

    To test this “linearity”, we can use the dominance coding of the QTL and add the dominance term to the regression model.

    +

    If the relationship between the QTL and the phenotype is perfectly +linear, the regression line should pass through the conditional means +(c_means), and the differences between the conditional means should be +about equal.

    +

    Q5. What are the values of b0, b1? Is QTL1 +significant associated with the phenotype at alpha<0.01 (multiple +testing of 5 loci)?

    +

    Q6. What is the proportion of phenotypic variance +explained?

    +
    linA1 <- lm(pheno1_T1~T1QTL_A1, data=dataTwin)
    +summary(linA1)
    +summary(linA1)$r.squared            # proportion of explained variance by additive component
    +
    +c_means <- by(dataTwin$pheno1_T1,dataTwin$T1QTL_A1,mean)
    +plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
    +lines(c(0,1,2), c_means, type="p", col=6, lwd=8)
    +lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
    +

    To test this “linearity”, we can use the dominance coding of the QTL +and add the dominance term to the regression model.

    • Model: pheno1_T1 = b0 + b1* T1QTL_A1 + b2* T1QTL_D1 + e
    • Repeat for T2.

    Q7. Why can’t we analyse T1 and T2 together?

    Q8. Is there a dominance effect?

    -
    linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
    -summary(linA1)  # results lm(phenoT1~T1QTL_A1)
    -summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
    -
    -plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
    -abline(linA1, lwd=3)
    -lines(c(0,1,2), c_means, type='p', col=6, lwd=8)
    -lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
    -lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3)
    -

    Q9. Repeat for the other 4 QTL and determine which QTL shows strongest association with the phenotype T1

    -
    allQTL_A_T1 <- 2:6
    -cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1")
    -## Additive
    -cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
    -## Dominance
    -cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
    -
    -#Q9: QTL3 shows the strongest association with P=7.771588e-25
    -linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
    -summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
    -

    If the subjects with top 5% of the phenotype score are considered as cases, perform case-control association test for most significant SNP (from Q9) and interpret the result.

    -

    Q10. What are the odds ratio, p-value, and 95% confidence interval?

    -
    quant05 <- quantile(c(dataTwin$pheno1_T1,dataTwin$pheno1_T2),seq(0,1,0.05))
    -dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant05[20]) 
    -dataTwin$CaseT2 <- as.numeric(dataTwin$pheno1_T2>quant05[20])
    -logisticAD1 <- summary(glm(CaseT1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin, family="binomial"))
    -exp(logisticAD1$coefficients[2,1])                                     # odds ratio
    -exp(logisticAD1$coefficients[2,1]-1.96*logisticAD1$coefficients[2,2])  # lower 95% confidence interval
    -exp(logisticAD1$coefficients[2,1]+1.96*logisticAD1$coefficients[2,2])  # upper 95% confidence interval
    +
    linAD1 <- lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_D1, data=dataTwin)
    +summary(linA1)  # results lm(phenoT1~T1QTL_A1)
    +summary(linAD1) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
    +
    +plot(dataTwin$pheno1_T1 ~ dataTwin$T1QTL_A1, col='grey', ylim=c(3,7))
    +abline(linA1, lwd=3)
    +lines(c(0,1,2), c_means, type='p', col=6, lwd=8)
    +lines(sort(dataTwin$T1QTL_A1),sort(linA1$fitted.values), type='b', col="dark green", lwd=3)
    +lines(sort(dataTwin$T1QTL_A1),sort(linAD1$fitted.values), type='b', col="blue", lwd=3)
    +

    Q9. Repeat for the other 4 QTL and determine which +QTL shows strongest association with the phenotype T1

    +
    allQTL_A_T1 <- 2:6
    +cpheno1_T1 <- which(colnames(dataTwin)=="pheno1_T1")
    +## Additive
    +cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
    +## Dominance
    +cbind(lapply(allQTL_A_T1,function(x){ fstat<- summary(lm(pheno1_T1 ~ ., data=dataTwin[,c(x,x+10,cpheno1_T1)]))$fstatistic;  pf(fstat[1],fstat[2],fstat[3],lower.tail = F) }))
    +
    +#Q9: QTL3 shows the strongest association with P=7.771588e-25
    +linAD3 <- lm(pheno1_T1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin)
    +summary(linAD3) # results lm(phenoT1~T1QTL_A1+T1QTL_D1)
    +

    If the subjects with top 5% of the phenotype score are considered as +cases, perform case-control association test for most significant SNP +(from Q9) and interpret the result.

    +

    Q10. What are the odds ratio, p-value, and 95% +confidence interval?

    +
    quant05 <- quantile(c(dataTwin$pheno1_T1,dataTwin$pheno1_T2),seq(0,1,0.05))
    +dataTwin$CaseT1 <- as.numeric(dataTwin$pheno1_T1>quant05[20]) 
    +dataTwin$CaseT2 <- as.numeric(dataTwin$pheno1_T2>quant05[20])
    +logisticAD1 <- summary(glm(CaseT1 ~ T1QTL_A3 + T1QTL_D3, data=dataTwin, family="binomial"))
    +exp(logisticAD1$coefficients[2,1])                                     # odds ratio
    +exp(logisticAD1$coefficients[2,1]-1.96*logisticAD1$coefficients[2,2])  # lower 95% confidence interval
    +exp(logisticAD1$coefficients[2,1]+1.96*logisticAD1$coefficients[2,2])  # upper 95% confidence interval
    -
    -

    8.1.2 Part 2

    -

    Scenario: You are asked to estimate the additive genetic variance, dominance genetic variance and/or shared environmental variance using regression-based method and a classical twin design.

    +
    +

    Part 2

    +

    Scenario: You are asked to estimate the additive +genetic variance, dominance genetic variance and/or shared environmental +variance using regression-based method and a classical twin design.

    \[\begin{align*} -\text{For ADE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{D} + \sigma^{2}_{E}\\ +\text{For ADE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + +\sigma^{2}_{D} + \sigma^{2}_{E}\\ -\text{For ACE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + \sigma^{2}_{C} + \sigma^{2}_{E}, \quad \text{where} \\ +\text{For ACE model : }~ & \sigma^{2}_{P} = \sigma^{2}_{A} + +\sigma^{2}_{C} + \sigma^{2}_{E}, \quad \text{where} \\ \sigma^{2}_{P} & \text{ is the phenotypic variance}, \\ @@ -515,11 +646,13 @@

    8.1.2 Part 2\[\begin{align*} cov(MZ) = cor(MZ) & = rMZ = \sigma^{2}_{A} + \sigma^{2}_{D} \\ - cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + 0.25 * \sigma^{2}_{D} \quad \text{ , where} \\ + cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + 0.25 * +\sigma^{2}_{D} \quad \text{ , where} \\ -\end{align*}\] -the coefficients 1/2 and 1/4 are based on quantitative genetic theory (Mather & Jinks, 1971).

    -

    By solving the unknowns, the Falconer’s equations for the ADE model:

    +\end{align*}\] the coefficients 1/2 and 1/4 are based on +quantitative genetic theory (Mather & Jinks, 1971).

    +

    By solving the unknowns, the Falconer’s equations for the ADE +model:

    \[\begin{align*} \sigma^{2}_{A} & = 4*rDZ - rMZ \\ @@ -532,10 +665,12 @@

    8.1.2 Part 2\[\begin{align*} cov(MZ) = cor(MZ) & = rMZ= \sigma^{2}_{A} + \sigma^{2}_{C} \\ - cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + \sigma^{2}_{C} \quad \text{ , where} \\ + cov(DZ) = cor(DZ) & = rDZ = 0.5 * \sigma^{2}_{A} + \sigma^{2}_{C} +\quad \text{ , where} \\ \end{align*}\]

    -

    By solving the unknowns, the Falconer’s equations for the ACE model:

    +

    By solving the unknowns, the Falconer’s equations for the ACE +model:

    \[\begin{align*} \sigma^{2}_{A} & = 2*(rMZ - rDZ) \\ @@ -543,30 +678,42 @@

    8.1.2 Part 2 -

    8.1.2.1 Questions for discussions :

    -

    Q1. What is missing heritability of common traits in the era of genome-wide association analysis (GWAS)?

    -

    Q2. What are the potential sources of missing heritability?

    +
    +

    Questions for discussions :

    +

    Q1. What is missing heritability of common traits in +the era of genome-wide association analysis (GWAS)?

    +

    Q2. What are the potential sources of missing +heritability?

    • Suggested reading:
        -
      • Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature. 2009 ;461(7265):747-753. doi:10.1038/nature08494
      • +
      • Manolio TA, Collins FS, Cox NJ, et al. Finding the missing +heritability of complex diseases. Nature. 2009 ;461(7265):747-753. doi:10.1038/nature08494
    -
    -

    8.1.2.2 Hands-on exercise : variance explained using regression-based method

    +
    +

    Hands-on exercise : variance explained using regression-based +method

    Q1. What is the variance of the phenotype?

    -

    Q2. Compute the explained variance attributable to the additive genetic component of the QTL with strongest association in Part 1.

    -

    Q3. Compute the explained variance attributable to the dominance genetic component of the QTL with strongest association in Part 1.

    -

    R2 from the regression represents the proportion of phenotypic variance explained; thus the raw explained variance component is R2 times the variance of the phenotype (var_pheno).

    +

    Q2. Compute the explained variance attributable to +the additive genetic component of the QTL with strongest association in +Part 1.

    +

    Q3. Compute the explained variance attributable to +the dominance genetic component of the QTL with strongest association in +Part 1.

    +

    R2 from the regression represents the proportion of phenotypic +variance explained; thus the raw explained variance component is R2 +times the variance of the phenotype (var_pheno).

    Example for T1 on QTL1
      -
    • The proportion of explained variance are 0.02732 (additive) and 0.03658 (total: additive + dominance).
    • -
    • As the predictors are uncorrelated, the proportion of explained variance by dominance = 0.03658 - 0.02732 = 0.00926
    • +
    • The proportion of explained variance are 0.02732 (additive) and +0.03658 (total: additive + dominance).
    • +
    • As the predictors are uncorrelated, the proportion of explained +variance by dominance = 0.03658 - 0.02732 = 0.00926
    • Given the phenotypic variance of 15.102, then
      • Total genetic: 0.03658*15.102 = 0.552
      • @@ -574,136 +721,133 @@

        8.1.2.2 Hands-on exercise : varia
      • Dominance genetic: 0.00926*15.102 = 0.140
    -
    var_pheno <- var(dataTwin$pheno1_T1)  # the variance of the phenotype
    -var_pheno
    -summary(linAD1)$r.squared           # proportion of explained variance by total genetic component
    -summary(linA1)$r.squared            # proportion of explained variance by additive component
    -summary(linAD1)$r.squared*var_pheno # (raw) variance component of total genetic component
    -summary(linA1)$r.squared*var_pheno  # (raw) variance component of additive genetic component
    -(summary(linAD1)$r.squared-summary(linA1)$r.squared)*var_pheno  # (raw) variance component of dominance genetic component
    +
    var_pheno <- var(dataTwin$pheno1_T1)  # the variance of the phenotype
    +var_pheno
    +summary(linAD1)$r.squared           # proportion of explained variance by total genetic component
    +summary(linA1)$r.squared            # proportion of explained variance by additive component
    +summary(linAD1)$r.squared*var_pheno # (raw) variance component of total genetic component
    +summary(linA1)$r.squared*var_pheno  # (raw) variance component of additive genetic component
    +(summary(linAD1)$r.squared-summary(linA1)$r.squared)*var_pheno  # (raw) variance component of dominance genetic component
    -

    Q4. Estimate the variance explained by all the QTL using linear regression.

    -
    # compute for all 5 QTL
    -linAD5=(lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 +
    -                       T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5,
    -            data=dataTwin))
    -
    -summary(linAD5)$r.squared              # proportion of explained variance by total genetic component
    -summary(linAD5)$r.squared*var_pheno    # (raw) variance component of total genetic component
    +

    Q4. Estimate the variance explained by all the QTL +using linear regression.

    +
    # compute for all 5 QTL
    +linAD5=(lm(pheno1_T1 ~ T1QTL_A1 + T1QTL_A2 + T1QTL_A3 + T1QTL_A4 + T1QTL_A5 +
    +                       T1QTL_D1 + T1QTL_D2 + T1QTL_D3 + T1QTL_D4 + T1QTL_D5,
    +            data=dataTwin))
    +
    +summary(linAD5)$r.squared              # proportion of explained variance by total genetic component
    +summary(linAD5)$r.squared*var_pheno    # (raw) variance component of total genetic component
    -
    -

    8.1.2.3 Hands-on exercise : variance explained using a classical twin design.

    -

    Based on our regression results, we have estimates of the total genetic variance as well as the A and D components for phenotype 1. In practice, it is impossible to know all the variants associated with any polygenic trait.

    -

    Given rMZ > 2*rDZ, we can use Falconer’s formula based on ADE model to estimate the A (additive genetic) and D (dominance) variance with the classical twin design for phenotype 1 without genotypes.

    +
    +

    Hands-on exercise : variance explained using a classical twin +design.

    +

    Based on our regression results, we have estimates of the total +genetic variance as well as the A and D components for phenotype 1. In +practice, it is impossible to know all the variants associated with any +polygenic trait.

    +

    Given rMZ > 2*rDZ, we can use Falconer’s formula +based on ADE model to estimate the A (additive genetic) and D +(dominance) variance with the classical twin design for phenotype 1 +without genotypes.

    Q5. Compute rMZ and rDZ.

    -

    Q6. Estimate the proportion of additive and dominance genetic variances using the Falconer’s equations for the ADE model.

    -
    dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno1_T1', 'pheno1_T2')] # MZ data frame
    -dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno1_T1', 'pheno1_T2')] # DZ data frame
    -
    -rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix
    -rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix
    -
    -sA2 = 4*rDZ - rMZ
    -sD2 = 2*rMZ - 4*rDZ
    -sE2 = 1 - sA2 - sD2
    -print(c(sA2, sD2, sE2))
    -

    Similarly, for phenotype 2, we can estimate the proportion of additive and/or dominance genetic variances as well as shared environmental variance using the Falconer’s formula.

    -

    Q7. Which model (ACE or ADE) should be considered for phenotype 2?

    -

    Q8. Estimate the proportion of A, C/D and E variance components for phenotype 2.

    -
    dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno2_T1', 'pheno2_T2')] # MZ data frame
    -dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno2_T1', 'pheno2_T2')] # DZ data frame
    -
    -rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix
    -rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix
    -
    -sA2 = 2*(rMZ - rDZ)
    -sC2 = 2*rDZ - rMZ
    -sE2 = 1 - rMZ
    -print(c(sA2, sC2, sE2))
    - - +

    Q6. Estimate the proportion of additive and +dominance genetic variances using the Falconer’s equations for the ADE +model.

    +
    dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno1_T1', 'pheno1_T2')] # MZ data frame
    +dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno1_T1', 'pheno1_T2')] # DZ data frame
    +
    +rMZ=cor(dataMZ)[2,1] # element 2,1 in the MZ correlation matrix
    +rDZ=cor(dataDZ)[2,1] # element 2,1 in the DZ correlation matrix
    +
    +sA2 = 4*rDZ - rMZ
    +sD2 = 2*rMZ - 4*rDZ
    +sE2 = 1 - sA2 - sD2
    +print(c(sA2, sD2, sE2))
    +

    Similarly, for phenotype 2, we can estimate the proportion of +additive and/or dominance genetic variances as well as shared +environmental variance using the Falconer’s formula.

    +

    Q7. Which model (ACE or ADE) should be considered +for phenotype 2?

    +

    Q8. Estimate the proportion of A, C/D and E variance +components for phenotype 2.

    +
    dataMZ = dataTwin[dataTwin$zygosity==1, c('pheno2_T1', 'pheno2_T2')] # MZ data frame
    +dataDZ = dataTwin[dataTwin$zygosity==2, c('pheno2_T1', 'pheno2_T2')] # DZ data frame
    +
    +rMZ=cor(dataMZ, use="complete.ods")[2,1] # element 2,1 in the MZ correlation matrix
    +rDZ=cor(dataDZ, use="complete.obs")[2,1] # element 2,1 in the DZ correlation matrix
    +
    +sA2 = 2*(rMZ - rDZ)
    +sC2 = 2*rDZ - rMZ
    +sE2 = 1 - rMZ
    +print(c(sA2, sC2, sE2))
    +
    +

    References

    +
      +
    1. Evans DM, Gillespie NA, Martin NG. Biometrical genetics. Biol +Psychol. 2002 Oct;61(1-2):33-51. doi: 10.1016/s0301-0511(02)00051-0. +PMID: 12385668. [Review article]

    2. +
    3. Falconer, D.S. and Mackay, T.F.C. (1996) Introduction to +Quantitative Genetics. 4th Edition, Addison Wesley Longman, Harlow. +[Most classical; a lot of online version]

    4. +
    5. Neale, B., Ferreira, M., Medland, S., & Posthuma, D. (Eds.). +(2007). Statistical Genetics: Gene Mapping Through Linkage and +Association (1st ed.). Taylor & Francis. https://doi.org/10.1201/9780203967201 [chapter on +biometrical genetics; can be borrowed from HKU lib]

    6. +
    7. https://ibg.colorado.edu/cdrom2020/dolan/biometricalGenetics/biom_gen_2020.pdf +[Course material of the Boulder IBG workshop co-organized by top +statistical geneticists]

    8. +
    +

    + -
    - -
    -