“Generalization” is the replication of an association between a genetic variant and a trait, discovered in one population, to another population.
- Most genetic association studies were performed in populations of European Ancestry (EA)
- These are often detected in very large GWAS (e.g. 100,000 individuals)
- To know, whether associations that were discovered in one populations exists in another.
- This may not always be true...
- To gain power by limiting the number of variants tested for associations to those already previously reported.
- Because we need to perform analysis, but we do not have access to an independent study with the same type of population and/or the same trait.
- An intuitive approach to generalization analysis:
- Take the list of SNP associations reported in a paper
- Test the same SNPs with the same trait in your data
- Report the significant associations.
- What should be the p-value threshold to report associations?
Wait for it...
- We developed a generalization testing framework that originated in the replication analysis literature.
- We combine test results (p-values) from both the discovery study, and our study (the follow-up)
- and calculate an r-value.
- (for every SNP).
- These r-values take into account multiple testing (of both studies),
- And are used like p-values.
- Since they are already adjusted for multiple testing, an association is generalized if the r-value<0.05.
- The generalization framework also takes into account the direction of associations.
- If the estimated association is negative in one study, and positive in the other, the association will not generalize.
- Here, the cells in gray represent generalized associations.
- Suppose that we ran a GWAS of platelet count in the HCHS/SOL.
- The results are displayed in the Manhattan plot:
- The platelet GWAS discovered 5 new associations
- that were then replicated in independent studies.
- There was another association that did not replicate.
- And there were a few additional known associations that were statistically significant.
- What about 55 other associations that were previously reported in other papers, reporting GWAS in other populations?
- Generalization analysis!
- The generalization R package have an example from the HCHS/SOL platelet count paper.
- We first load this package. (Install it if you haven't already!)
#library(devtools)
#install_github("tamartsi/generalize@Package_update",
# subdir = "generalize")
require(generalize)
- The generalization R package has an example data set.
- It has results reported by Geiger et al., 2011, and matched association results from the HCHS/SOL.
- Generalization analysis is done for one study at a time.
# load the data set from the package
data("dat")
# look at the column names:
matrix(colnames(dat), ncol = 3)
## [,1] [,2] [,3]
## [1,] "rsID" "study1.beta" "study2.alleleB"
## [2,] "chromosome" "study1.se" "study2.beta"
## [3,] "position" "study1.pval" "study2.se"
## [4,] "study1.alleleA" "study1.n.test" "study2.pval"
## [5,] "study1.alleleB" "study2.alleleA" "Ref"
- The data.frame with the example provides all information we need for generalization analysis.
head(dat)
## rsID chromosome position study1.alleleA study1.alleleB
## 1 rs2336384 1 12046062 G T
## 2 rs10914144 1 171949749 T C
## 3 rs1668871 1 205237136 C T
## 4 rs7550918 1 247675558 T C
## 5 rs3811444 1 248039450 C T
## 6 rs1260326 2 27730939 T C
## study1.beta study1.se study1.pval study1.n.test study2.alleleA
## 1 2.172 0.382 1.25e-08 2710000 G
## 2 3.417 0.487 2.22e-12 2710000 T
## 3 2.804 0.368 2.59e-14 2710000 T
## 4 3.133 0.471 2.91e-11 2710000 C
## 5 3.346 0.574 5.60e-09 2710000 C
## 6 2.334 0.381 9.12e-10 2710000 T
## study2.alleleB study2.beta study2.se study2.pval Ref
## 1 T 1.1164496 0.8084368 0.1672795709 Gieger,2011
## 2 C 1.9402873 0.9881444 0.0495803692 Gieger,2011
## 3 C 0.4107451 0.9386512 0.6616829698 Gieger,2011
## 4 T -0.9727501 0.8973005 0.2783270717 Gieger,2011
## 5 T 3.4528058 0.8908264 0.0001062059 Gieger,2011
## 6 C 2.5336998 0.8571613 0.0031173839 Gieger,2011
dat.matched <- matchEffectAllele(dat$rsID,
study2.effect = dat$study2.beta,
study1.alleleA = dat$study1.alleleA,
study2.alleleA = dat$study2.alleleA,
study1.alleleB = dat$study1.alleleB,
study2.alleleB = dat$study2.alleleB)
## passed data entry checks, orienting the effects of study2 to the correct effect allele. Assuming same strand.
head(dat.matched)
## snpID study2.effect study1.alleleA flip strand.ambiguous
## 1 rs2336384 1.1164496 G FALSE FALSE
## 2 rs10914144 1.9402873 T FALSE FALSE
## 3 rs1668871 -0.4107451 C TRUE FALSE
## 4 rs7550918 0.9727501 T TRUE FALSE
## 5 rs3811444 3.4528058 C FALSE FALSE
## 6 rs1260326 2.5336998 T FALSE FALSE
dat$study2.beta <- dat.matched$study2.effect
dat$alleleA <- dat$study1.alleleA
dat$alleleB <- dat$study1.alleleB
dat$study1.alleleA <- dat$study1.alleleB <-
dat$study2.alleleA <- dat$study2.alleleB <- NULL
head(dat)
## rsID chromosome position study1.beta study1.se study1.pval
## 1 rs2336384 1 12046062 2.172 0.382 1.25e-08
## 2 rs10914144 1 171949749 3.417 0.487 2.22e-12
## 3 rs1668871 1 205237136 2.804 0.368 2.59e-14
## 4 rs7550918 1 247675558 3.133 0.471 2.91e-11
## 5 rs3811444 1 248039450 3.346 0.574 5.60e-09
## 6 rs1260326 2 27730939 2.334 0.381 9.12e-10
## study1.n.test study2.beta study2.se study2.pval Ref alleleA
## 1 2710000 1.1164496 0.8084368 0.1672795709 Gieger,2011 G
## 2 2710000 1.9402873 0.9881444 0.0495803692 Gieger,2011 T
## 3 2710000 -0.4107451 0.9386512 0.6616829698 Gieger,2011 C
## 4 2710000 0.9727501 0.8973005 0.2783270717 Gieger,2011 T
## 5 2710000 3.4528058 0.8908264 0.0001062059 Gieger,2011 C
## 6 2710000 2.5336998 0.8571613 0.0031173839 Gieger,2011 T
## alleleB
## 1 T
## 2 C
## 3 T
## 4 C
## 5 T
## 6 C
- Test for generalization:
gen.res <- testGeneralization(dat$rsID, dat$study1.pval,
dat$study2.pval, dat$study1.n.test[1],
study1.effect = dat$study1.beta,
study2.effect = dat$study2.beta,
directional.control = TRUE,
control.measure = "FDR" )
## Controlling FDRat the 0.05 level
## Generating one-sided p-values guided by study1's directions of effects...
## Calcluating FDR r-values...
head(gen.res)
## snpID gen.rvals generalized
## 1 rs2336384 0.2422669647 FALSE
## 2 rs10914144 0.0867656461 FALSE
## 3 rs1668871 1.0000000000 FALSE
## 4 rs7550918 0.3542344549 FALSE
## 5 rs3811444 0.0005575808 TRUE
## 6 rs1260326 0.0093521516 TRUE
- Create a figure:
require(ggplot2,quietly = TRUE)
require(gridExtra,quietly = TRUE)
require(RColorBrewer,quietly = TRUE)
figure.out <- paste0(getwd(),
"/Generalization_example.pdf")
prepareGenResFigure(dat$rsID, dat$study1.beta,
dat$study1.se, dat$study2.beta, dat$study2.se,
gen.res$generalized, gen.res$gen.rvals,
dat$study1.n.test[1],
output.file = figure.out,
study1.name = "Study1",
study2.name = "Study2")
- Coverage of the confidence intervals... depends on the number of tests!
- e.g. (1 − α/10)×100% for 10 tests in a study for Bonferroni-type coverage.
- There are other options, controlling "False coverage rate", more complicated.
- Generalization of only "lead SNPs" compared to all SNPs with p-value below some threshold.
- Lead SNP in EA GWAS may be correlated with the causal SNP in EA, but not with Hispanics/Latinos!
- Non-generalization due to lack of power.
- Summarize information across non-generalized associations, e.g.:
- Test consistency of direction of associations between the discovery study and HCHS/SOL;
- Test trait association with Genetic Risk Score (GRS) - GRS can be generated as the sum of reported trait-increasing alleles. Test a GRS composed solely of SNP alleles of non-generalized associations.
- We ran a GWAS of Diabetes in the HCHS/SOL.
- Reported in Qi et. al. (2017) "Genetics of Type 2 Diabetes in US Hispanic/Latino Individuals: Results from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL)", .
- The GWAS identified two genome-wide significant associations (p-value<5 × 10−8) in known regions.
- There were 76 known independent associations at the time.
- The power to detect these associations at the p-value<5 × 10−8 was low.
\begin{figure} \includegraphics[scale=0.4]{diabetes_discovery_power.pdf} \end{figure} Examples from our work - diabetes
- We approximated the power to detect the associations in generalization analysis using Bonferroni threshold. \begin{figure} \includegraphics[scale=0.4]{diabetes_generalization_power.pdf} \end{figure}
- 14 of the associations generalized in generalization analysis.
\begin{center} Question: could other associations generalize if we had more power? \end{center}
- To address this, we constructed a GRS by summing all non-generalized diabetes risk-alleles for all participants in the analysis.
- And tested the association of this GRS with diabetes.
- The resulting p-value=6.12 × 10−14.
- In the generalization manuscript we investigated approaches for generalization when entire GWAS is available
- Compared to the case where only lead SNPs are available.
- Reported in Sofer et. al. (2017), "A powerful statistical framework for generalization testing in GWAS, with application to the HCHS/SOL", .
- The GLGC consortium published a list of 74 lead SNPs, from 74 genomic regions, in Willer et al. (2013).
- European Ancestry (EA); ∼190, 000 individuals.
- In addition, the complete results from Willer et al.’s analysis are freely available online.
- In generalization analysis applied on these 74 SNPs .
- In generalization analysis applied on 4,106 SNPs with p-value<5 × 10−8 in the Willer et al. GWAS 2,206 SNPs generalized.
- These SNPs were from .
- 34 of the lead SNPs reported by Willer et al. generalized (only 33 of these generalized in the "usual" generalization analysis)
- And also non-lead SNPs from 8 additional genomics regions.
- In generalization analysis applied on 5,399 SNPs SNPs with p-value<1 × 10−6 in the Willer et al. GWAS 2,418 SNPs generalized.
- These SNPs were from .
The TC example demonstrates that
- Due to differences in LD structure, there are instances where the lead EA SNP is different than the lead SNP in HCHS/SOL.
- Applying generalization testing on more SNPs (not just the lead SNPs) is useful.
- Considering SNPs with higher p-value than the commonly-used 5 × 10−8 can increase power.
- I generated a data set based on generalization analysis that I have done for the diabetes GWAS manuscript in HCHS/SOL.
- The following exercise will take you through generalization analysis based on this data set.
- Use the command read.csv() to read the files and with
- Association results published in a Mahajan et al. (2014) paper with results of diabetes GWAS in the DIAGRAM consortium (altered a bit).
- Association results of a few more variants in the HCHS/SOL (also altered a bit).
More in the next slide...
-
Use the function match() to subset the results from HCHS/SOL to those from Mahajan et al.
-
How would you know if variants have the same direction of association in the HCHS/SOL and in the DIAGRAM consortium?
-
Use the function matchEffectAllele() to match the effect sizes in the HCHS/SOL to correspond the same effect allele as in the DIAGRAM.
-
Test which associations generalize to the HCHS/SOL.
- Take the number of tested associations in the DIAGRAM to be 106.
-
How many associations generalized?
-
Compare the effect allele frequencies between the two studies using plot() command.