From c906fb10ef1ef6648f49cd245825ef288cda62b8 Mon Sep 17 00:00:00 2001 From: Gregory Owens <5419829+owensgl@users.noreply.github.com> Date: Mon, 11 Mar 2024 16:31:09 -0700 Subject: [PATCH] Create lab_9.md --- labs/lab_9.md | 320 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 320 insertions(+) create mode 100644 labs/lab_9.md diff --git a/labs/lab_9.md b/labs/lab_9.md new file mode 100644 index 000000000..3d0b29d6b --- /dev/null +++ b/labs/lab_9.md @@ -0,0 +1,320 @@ +--- +title: GWAS +element: lab +layout: default +--- + +## Objectives + +- To format data for a GWA analysis +- To run a GWA +- To plot a GWA analysis +- To run a single association test manually + +**** + +In today's lab we're going to run a genome-wide association +on some example data. Login to the server and remember to source +your bash.sh to have access to the module system. + +First we're going to organize our directory and create a symbolic link +with the data file. +```shell +cd ~ +mkdir lab_9 +mkdir lab_9/vcf +mkdir lab_9/pca +mkdir lab_9/gwas +mkdir lab_9/info +cd lab_9 +ln -s /project/ctb-grego/sharing/lab_9/chinook.gwas.vcf.gz vcf/ +``` +We've created a series of different directories where we can put our different +files to keep everything organized. Moving forward, remember to put data files and output +in the correct location. Organization is very important because for many projects, +you're going to be working on them for perhaps years and you will forget where you +put individual items. + + +```shell +module load plink +plink --vcf vcf/chinook.gwas.vcf.gz --out pca/chinook.gwas --pca --allow-extra-chr --double-id --autosome-num 95 +``` + +Lets take a look at this to make sure none of the samples are behaving badly. For example, if we accidentally had a sample of the wrong species, when we look at this PCA then it will be very far from all the other samples. Open Rstudio server. + +```R + +library(ggplot2) +library(dplyr) +library(readr) +library(tidyr) +library(forcats) + +pca_data <- read_table("lab_9/pca/chinook.gwas.eigenvec", + col_names = c("family.id","sample.id",paste0("PC",1:20))) + +pca_data %>% + ggplot(.,aes(x=PC1, y=PC2)) + + geom_point() +``` +![PCA plot](../figs/lab_9.1.png) + +### Question +- Based on the sample names, color code the PCA plot by year. +Are the two years different in the first two PCs? What about PCs 3 and 4? + +We want to use the PCA values as covariates for our GWA. We're going to use +plink for the GWA so we have to follow its format. It wants the first two columns to +contain the family ID and sample ID. If you had related individuals, you could use this +to show which family each was in, but in our case each individual is a random fish +and we don't expect any to be close relatives. So for each sample we're using the same +value for family ID and sample ID. + +```R +pca_data %>% + select(family.id, sample.id, PC1, PC2, PC3, PC4, PC5) %>% + write_tsv("lab_9/info/chinook.gwas.pca.txt") + +``` +The phenotype file I've included in the shared directory, so we can copy that over +```shell +cp /project/ctb-grego/sharing/lab_9/chinook.date.pheno info/chinook.gwas.pheno.txt +``` + +Now we're ready to run our first GWAS. +```shell +plink --vcf vcf/chinook.gwas.vcf.gz \ + --out gwas/chinook.gwas \ + --linear \ + --covar info/chinook.gwas.pca.txt \ + --allow-extra-chr \ + --double-id \ + --pheno info/chinook.gwas.pheno.txt \ + --all-pheno \ + --allow-no-sex +``` +With this command, we're using a linear model to test the association between this trait +and the genotype (since its a quantitative trait). We're using the first 5 PCs as covariates +to control for population stratification. We're passing the phenotype values in a separate file +with the --pheno option. + +Now lets plot this in R. +```R +gwas <- read_table("lab_9/gwas/chinook.gwas.P1.assoc.linear") +head(gwas) +``` +```output +> head(gwas) +# A tibble: 6 × 10 + CHR SNP BP A1 TEST NMISS BETA STAT P X10 + +1 CM031216.1 . 455 T ADD 95 -3.09 -1.67 9.79e- 2 NA +2 CM031216.1 . 455 T COV1 95 40.9 2.86 5.25e- 3 NA +3 CM031216.1 . 455 T COV2 95 -112 -7.34 9.92e-11 NA +4 CM031216.1 . 455 T COV3 95 46.6 3.34 1.24e- 3 NA +5 CM031216.1 . 455 T COV4 95 0.534 0.0378 9.70e- 1 NA +6 CM031216.1 . 455 T COV5 95 20.8 1.40 1.64e- 1 NA +``` +For each SNP in the genome, we have multiple lines. One of the lines is the association between the SNP and our phenotype (after controlling for covariates), while the others are the +effect of the covariate on the SNP. We aren't particularly interested in the covariate +effects, so we can filter those out. +```R +gwas <- gwas %>% + filter(TEST == "ADD") +``` +Lastly, when plotting GWAS, its easier to visualize -log10(p) values, so lets create that column. +```R +gwas <- gwas %>% + mutate(log_p = -log10(P)) +``` +When making GWA plots, we often have huge numbers of points. If you try to make a figure +with millions of points, it often bogs down you computer or results in a pdf file that +is way too big to handle. So, an easy work around is to filter points to not plot +SNPs that are completely unassociated. +```R +gwas %>% + filter(log_p > 1) %>% + ggplot(.,aes(x=BP,y=log_p)) + + geom_point() +``` +![GWAS plot](../figs/lab_9.2.png) + +We see a pretty big peak at the right side of the graph, as well as a +couple high points midway across the chromosome. Lets zoom in on these. + +```R +gwas %>% + arrange(desc(log_p)) %>% + head() +``` +```output +# A tibble: 6 × 11 + CHR SNP BP A1 TEST NMISS BETA STAT P X10 log_p + +1 CM031216.1 . 12211921 C ADD 114 -12.5 -5.18 0.00000106 NA 5.97 +2 CM031216.1 . 33100788 A ADD 114 -12.2 -5.18 0.00000107 NA 5.97 +3 CM031216.1 . 12224440 T ADD 114 -12.6 -5.13 0.00000130 NA 5.88 +4 CM031216.1 . 33049078 T ADD 114 -12.1 -5.10 0.00000146 NA 5.83 +5 CM031216.1 . 32303435 T ADD 114 -9.04 -5.04 0.00000191 NA 5.72 +6 CM031216.1 . 33037939 T ADD 113 -11.1 -5.04 0.00000195 NA 5.71 +``` +```R +top_snps <- gwas %>% + arrange(desc(log_p)) %>% + head(n=10) %>% + pull(BP) + +plotting_range <- 200000 +gwas %>% + filter(BP > top_snps[1] - plotting_range, + BP < top_snps[1] + plotting_range) %>% + ggplot(.,aes(x=BP,y=log_p)) + + geom_point() + +gwas %>% + filter(BP > top_snps[2] - plotting_range, + BP < top_snps[2] + plotting_range) %>% + ggplot(.,aes(x=BP,y=log_p)) + + geom_point() +``` + + +![GWAS zoom 1](../figs/lab_9.3.png) +![GWAS zoom 2](../figs/lab_9.4.png) + +Now a big question we might as are what genes are under those peaks? +We can take a look at that in R. I've organized a list of genes, for the +chromosome we're working on. + +```R +genes <- read_tsv("/project/ctb-grego/sharing/lab_9/chinook.genes.txt") + +plot_1 <- gwas %>% + filter(BP > top_snps[1] - plotting_range, + BP < top_snps[1] + plotting_range) %>% + ggplot(.,aes(x=BP,y=log_p)) + + geom_point() + +plot_1 <- plot_1 + + geom_segment(data = genes %>% filter( start > top_snps[1] - plotting_range, + end < top_snps[1] + plotting_range), + aes(x = start, y = -0.1, xend = end, yend = -0.1), size = 2, color = "blue") + +plot_1 +``` +![GWAS zoom 3](../figs/lab_9.5.png) + +Now that we've done some GWAS calculations lets see if we can validate our +p-values for one SNP. We're going to extract the genotypes for one of our +top SNPs and run our own linear model. + +```R +library(vcfR) +#We have to tell it the path to the original file, not the symbolic link +vcf <- read.vcfR("/project/ctb-grego/sharing/lab_9/chinook.gwas.vcf.gz") +#extract the genotypes +gt_data <- vcfR2tidy(vcf, format_fields = 'GT' ) +gt = gt_data$gt + +#This step removes the gt_data, since it hogs memory +rm(gt_data) + +#Get the genotypes for your top SNP candidate +top_gt <- gt %>% + filter(POS == top_snps[1]) + +phenotypes <- read_tsv("lab_9/info/chinook.gwas.pheno.txt", + col_names = c("Indiv","spacer","phenotype")) + +top_gt %>% + inner_join(phenotypes) %>% + ggplot(aes(x=gt_GT, y=phenotype)) + + geom_boxplot() +``` + +![boxplot](../figs/lab_9.6.png) + +We can see from this that the alternate allele seems to cause a smaller +phenotype value. Based on this, we'd probably say that the alternate +allele is recessive. Our simple model assumes its additive, but the data +fits that model as well which is why the P-value is low. Plink is basically +just running a linear model, so lets try that ourselves. + +```R +linear_model <- top_gt %>% + inner_join(phenotypes) %>% + mutate(genotype_numeric = case_when(gt_GT == "0/0" ~ 0, + gt_GT == "0/1" ~ 1, + gt_GT == "1/1" ~ 2, + TRUE ~ NA)) %>% + lm(phenotype ~ genotype_numeric, data=.) + +summary(linear_model) +``` +```output +Call: +lm(formula = phenotype ~ genotype_numeric, data = .) + +Residuals: + Min 1Q Median 3Q Max +-41.965 -8.965 -1.965 5.785 53.035 + +Coefficients: + Estimate Std. Error t value Pr(>|t|) +(Intercept) 204.965 1.828 112.122 < 2e-16 *** +genotype_numeric -14.001 3.450 -4.058 9.2e-05 *** +--- +Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 + +Residual standard error: 17.29 on 112 degrees of freedom +Multiple R-squared: 0.1282, Adjusted R-squared: 0.1204 +F-statistic: 16.47 on 1 and 112 DF, p-value: 9.2e-05 +``` +The p-value for this version of the test is 9.2e-5, which is pretty significant. +One difference is that we didn't control for the PCAs, so lets try that for our linear model. + +```R +pcas <- read_tsv("lab_9/info/chinook.gwas.pca.txt") %>% + rename(Indiv = family.id) + +linear_model_plus <- top_gt %>% + inner_join(phenotypes) %>% + inner_join(pcas) %>% + mutate(genotype_numeric = case_when(gt_GT == "0/0" ~ 0, + gt_GT == "0/1" ~ 1, + gt_GT == "1/1" ~ 2, + TRUE ~ NA)) %>% + lm(phenotype ~ PC1 + PC2 + PC3 + PC4 + PC5+ genotype_numeric, data=.) + +summary(linear_model_plus) +``` +```output +Call: +lm(formula = phenotype ~ PC1 + PC2 + PC3 + PC4 + PC5 + genotype_numeric, + data = .) + +Residuals: + Min 1Q Median 3Q Max +-38.636 -6.602 1.060 6.247 31.398 + +Coefficients: + Estimate Std. Error t value Pr(>|t|) +(Intercept) 204.620 1.242 164.699 < 2e-16 *** +PC1 49.129 11.647 4.218 5.17e-05 *** +PC2 -113.713 11.666 -9.747 < 2e-16 *** +PC3 57.262 11.710 4.890 3.57e-06 *** +PC4 16.984 11.705 1.451 0.150 +PC5 -11.305 12.002 -0.942 0.348 +genotype_numeric -12.542 2.422 -5.178 1.06e-06 *** +--- +Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 + +Residual standard error: 11.65 on 107 degrees of freedom +Multiple R-squared: 0.6223, Adjusted R-squared: 0.6011 +F-statistic: 29.38 on 6 and 107 DF, p-value: < 2.2e-16 +``` + +After controlling for the PCs, the genotype effect is still +signficant, suggesting it isn't only caused by correlation +with population stratification.