-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yeast analysis example #70
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
# Yeast breeding experiment | ||
In this example, we want to model a cross of two diploid yeast strains. Depending on which sequence data you have | ||
available, you can take multiple approaches depending on which genome data you have available. In this document we will | ||
demonstrate a workflow that makes us of variant call files, and a workflow that uses whole genome alignment. In both | ||
cases we study two beer yeasts from the [1002 Yeast Genome collection](http://1002genomes.u-strasbg.fr/): Orval | ||
trappist yeast strain DBVPG6695 from Belgium and American ale strain 1.3_Safale_US05 (codenames BRQ and CFD). | ||
|
||
First we set up a new gen repository to store and track our sequences. For this project we don't anticipate using | ||
more than one gen database file, and also don't need more than one collection. By setting defaults we won't have to | ||
enter these details in future commands. | ||
|
||
```console | ||
bvh@mbp:~$ gen init | ||
|
||
Gen repository initialized. | ||
``` | ||
``` console | ||
bvh@mbp:~$ gen defaults --database breeding_experiment.db --collection genome | ||
|
||
Default database set to breeding_experiment.db | ||
Default collection set to genome | ||
``` | ||
|
||
## Starting from variant calls | ||
The dataset we are working with includes variant call files that were generated by aligning sequencing reads to the | ||
S288C reference genome, specifically version R64-1-1. We'll start by downloading this reference sequence from the | ||
Saccharomyces Genome Database (SGD) and extracting the archive. | ||
|
||
```console | ||
bvh@mbp:~$ wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz | ||
``` | ||
```console | ||
bvh@mbp:~$ tar -xzvf S288C_reference_genome_R64-1-1_20110203.tgz | ||
``` | ||
|
||
Let's add the reference genome sequence to our gen repository | ||
|
||
```console | ||
bvh@mbp:~$ gen import --fasta S288C_reference_genome_R64-1-1_20110203\/S288C_reference_sequence_R64-1-1_20110203.fsa | ||
|
||
Created it | ||
``` | ||
|
||
The variant calls themselves are stored in a large Genomic VCF (GVCF) file that includes variants from all samples in | ||
the dataset. We download a compressed version of this file from the the 1002 Yeast Genomes project, and then use | ||
bcftools to extract the two samples we are interested in. | ||
|
||
```console | ||
bvh@mbp:~$ wget http://1002genomes.u-strasbg.fr/files/1011Matrix.gvcf.gz | ||
``` | ||
|
||
We then use the bcftools application to extract just the samples we want. To do this, we must first create a text file | ||
that contains the sample identifiers separated by a line break. | ||
```console | ||
bvh@mbp:~$ bcftools view -s BRQ -o BRQ.vcf 1011Matrix.gvcf.gz | ||
bvh@mbp:~$ bcftools view -s CFD -o CFD.vcf 1011Matrix.gvcf.gz | ||
``` | ||
|
||
Unfortunately these variant call files refer to the chromosomes by their number, whereas the reference sequence listed | ||
them by their accession identifier. We rename them in the VCF file using bcftools with a text file called | ||
contig_mapping.txt that has the following contents: | ||
``` | ||
chromosome1 ref|NC_001133| | ||
chromosome2 ref|NC_001134| | ||
chromosome3 ref|NC_001135| | ||
chromosome4 ref|NC_001136| | ||
chromosome5 ref|NC_001137| | ||
chromosome6 ref|NC_001138| | ||
chromosome7 ref|NC_001139| | ||
chromosome8 ref|NC_001140| | ||
chromosome9 ref|NC_001141| | ||
chromosome10 ref|NC_001142| | ||
chromosome11 ref|NC_001143| | ||
chromosome12 ref|NC_001144| | ||
chromosome13 ref|NC_001145| | ||
chromosome14 ref|NC_001146| | ||
chromosome15 ref|NC_001147| | ||
chromosome16 ref|NC_001148| | ||
chromosome17 ref|NC_001224| | ||
``` | ||
|
||
```console | ||
bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt BRQ.vcf -o BRQ_2.vcf | ||
bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt CFD.vcf -o CFD_2.vcf | ||
``` | ||
|
||
Because Gen also uses the sample names from the VCF file to create new block groups, or find an existing block group to | ||
update, we need to rename the samples as well so that the variants will be loaded into one sample called F1. We again | ||
use bcftools to do this, using a text file sample_mapping.txt with the following contents: | ||
``` | ||
BRQ F1 | ||
CFD F1 | ||
``` | ||
|
||
```console | ||
bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o BRQ3.vcf BRQ2.vcf | ||
bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o CFD3.vcf CFD2.vcf | ||
``` | ||
|
||
We can now import the individual VCF files into gen, which will then use the chromosome/contig and sample names in the | ||
VCF file to find the appropriate block groups to update or create. | ||
|
||
```console | ||
bvh@mbp:~$ gen update --vcf BRQ3.vcf | ||
bvh@mbp:~$ gen update --vcf CFD3.vcf | ||
``` | ||
|
||
Lastly, we export the F1 sample to a GFA file that can be used by graph-aware sequence aligners. | ||
|
||
```console | ||
bvh@mbp:~$ gen export --gfa cross.gfa | ||
``` | ||
|
||
## Starting from assembled genomes | ||
First we download the genome assembly archive and extract the genome sequence for the strains we want to mate. | ||
|
||
``` | ||
wget http://1002genomes.u-strasbg.fr/files/1011Assemblies.tar.gz | ||
tar -xzf 1011Assemblies.tar.gz GENOMES_ASSEMBLED/BRQ_4.re.fa GENOMES_ASSEMBLED/CFD_4.re.fa | ||
``` | ||
|
||
Then we use [Cactus](https://doi.org/10.1038/s41586-020-2871-y) to create a graph via whole genome alignment. | ||
The Comparative Genomics Toolkit provides a Docker container that includes all the infrastructure needed to do so. | ||
We launch the container and bind our working directory to the /data directory in the container: | ||
|
||
```console | ||
bvh@mbp:~$ docker run --mount type=bind,source=$PWD,target=/data -it quay.io/comparative-genomics-toolkit/cactus:v2.9.0 bash | ||
``` | ||
|
||
We tell Cactus which genomes to align by providing a text file containing names and file paths. You can set this up | ||
manually, or generate it for all fasta files in our current directory and subdirectories: | ||
|
||
```console | ||
bvh@mbp:~$ find . -type f -name "*.fa" -printf "%f %p\n" | awk '{ sub(/\..*$/, "", $1); print $1, $2 }' > file_list.txt | ||
``` | ||
|
||
Then we can start the Cactus pipeline using the command list below. `./js` specifies where intermediate files are stored, | ||
outName and outDir influence the pipeline, and `--gfa` tells Cactus to output a GFA file. While this is a graph based | ||
alignment, we must still designate one genome as reference, in this case we pick BRQ. | ||
```console | ||
bvh@mbp:~$ cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outName cross --reference BRQ_4 --gfa | ||
``` | ||
|
||
After this completes (approximately 10 minutes), we can load it back into gen by running: | ||
|
||
```console | ||
bvh@mbp:~$ gen import --gfa cactus_output/cross.gfa | ||
``` | ||
|
||
## References | ||
1) https://www.nature.com/articles/s41586-018-0030-5 | ||
|
||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove the prefixes from your cpu so all this code is copy/pasteable without additional work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in my next goaround (David had some other suggestions to expand this example case)