Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yeast analysis example #70

Merged
merged 3 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 17 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ common features.
`<figure 2>`

## Installing from Source
Make sure you have a Rust compiler installed on your system. You can install the Rust toolset using the [rustup installer](https://rustup.rs/).
Make sure you have a Rust compiler installed on your system. You can install the Rust toolset using the [rustup
installer](https://rustup.rs/).


1. Clone the [source](https://github.com/ginkgobioworks/gen) with `git`:
Expand All @@ -43,15 +44,25 @@ Make sure you have a Rust compiler installed on your system. You can install the
2. Compile the gen package and its dependencies:

```
cargo build
cargo build --release
```

3. You can find the gen executable in ./targets/debug/ or execute it via cargo:
3. You can find the gen executable in ./target/release/ or execute it via cargo:

```
cargo run -- <arguments>
```


To cross-compile gen to run on a different architecture, you need to first add a target to the Rust toolchain and
install a linker. For macOS to Linux this can be done as follows:

```
rustup target add x86_64-unknown-linux-gnu
brew install SergioBenitez/osxct/x86_64-unknown-linux-gnu
cargo build --release --target=x86_64-unknown-linux-gnu
```

The executable will be placed in ./target/x86_64-unknown-linux-gnu/release/

## Usage

Expand All @@ -70,7 +81,8 @@ Make sure you have a Rust compiler installed on your system. You can install the

### Recording sequence changes
<!-- From a VCF file -->
Sequence variants observed through NGS can be imported into a gen repository via standard VCF file obtained from variant callers like Freebayes, GATK, or DeepVariant. [...]
Sequence variants observed through NGS can be imported into a gen repository via standard VCF file obtained from variant
callers like Freebayes, GATK, or DeepVariant. [...]
<!-- -From a sequence file that was edited externally -->

<!-- -From the gen command line -->
Expand Down
154 changes: 154 additions & 0 deletions examples/yeast_crosses/Analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Yeast breeding experiment
In this example, we want to model a cross of two diploid yeast strains. Depending on which sequence data you have
available, you can take multiple approaches depending on which genome data you have available. In this document we will
demonstrate a workflow that makes us of variant call files, and a workflow that uses whole genome alignment. In both
cases we study two beer yeasts from the [1002 Yeast Genome collection](http://1002genomes.u-strasbg.fr/): Orval
trappist yeast strain DBVPG6695 from Belgium and American ale strain 1.3_Safale_US05 (codenames BRQ and CFD).

First we set up a new gen repository to store and track our sequences. For this project we don't anticipate using
more than one gen database file, and also don't need more than one collection. By setting defaults we won't have to
enter these details in future commands.

```console
bvh@mbp:~$ gen init
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove the prefixes from your cpu so all this code is copy/pasteable without additional work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in my next goaround (David had some other suggestions to expand this example case)


Gen repository initialized.
```
``` console
bvh@mbp:~$ gen defaults --database breeding_experiment.db --collection genome

Default database set to breeding_experiment.db
Default collection set to genome
```

## Starting from variant calls
The dataset we are working with includes variant call files that were generated by aligning sequencing reads to the
S288C reference genome, specifically version R64-1-1. We'll start by downloading this reference sequence from the
Saccharomyces Genome Database (SGD) and extracting the archive.

```console
bvh@mbp:~$ wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz
```
```console
bvh@mbp:~$ tar -xzvf S288C_reference_genome_R64-1-1_20110203.tgz
```

Let's add the reference genome sequence to our gen repository

```console
bvh@mbp:~$ gen import --fasta S288C_reference_genome_R64-1-1_20110203\/S288C_reference_sequence_R64-1-1_20110203.fsa

Created it
```

The variant calls themselves are stored in a large Genomic VCF (GVCF) file that includes variants from all samples in
the dataset. We download a compressed version of this file from the the 1002 Yeast Genomes project, and then use
bcftools to extract the two samples we are interested in.

```console
bvh@mbp:~$ wget http://1002genomes.u-strasbg.fr/files/1011Matrix.gvcf.gz
```

We then use the bcftools application to extract just the samples we want. To do this, we must first create a text file
that contains the sample identifiers separated by a line break.
```console
bvh@mbp:~$ bcftools view -s BRQ -o BRQ.vcf 1011Matrix.gvcf.gz
bvh@mbp:~$ bcftools view -s CFD -o CFD.vcf 1011Matrix.gvcf.gz
```

Unfortunately these variant call files refer to the chromosomes by their number, whereas the reference sequence listed
them by their accession identifier. We rename them in the VCF file using bcftools with a text file called
contig_mapping.txt that has the following contents:
```
chromosome1 ref|NC_001133|
chromosome2 ref|NC_001134|
chromosome3 ref|NC_001135|
chromosome4 ref|NC_001136|
chromosome5 ref|NC_001137|
chromosome6 ref|NC_001138|
chromosome7 ref|NC_001139|
chromosome8 ref|NC_001140|
chromosome9 ref|NC_001141|
chromosome10 ref|NC_001142|
chromosome11 ref|NC_001143|
chromosome12 ref|NC_001144|
chromosome13 ref|NC_001145|
chromosome14 ref|NC_001146|
chromosome15 ref|NC_001147|
chromosome16 ref|NC_001148|
chromosome17 ref|NC_001224|
```

```console
bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt BRQ.vcf -o BRQ_2.vcf
bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt CFD.vcf -o CFD_2.vcf
```

Because Gen also uses the sample names from the VCF file to create new block groups, or find an existing block group to
update, we need to rename the samples as well so that the variants will be loaded into one sample called F1. We again
use bcftools to do this, using a text file sample_mapping.txt with the following contents:
```
BRQ F1
CFD F1
```

```console
bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o BRQ3.vcf BRQ2.vcf
bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o CFD3.vcf CFD2.vcf
```

We can now import the individual VCF files into gen, which will then use the chromosome/contig and sample names in the
VCF file to find the appropriate block groups to update or create.

```console
bvh@mbp:~$ gen update --vcf BRQ3.vcf
bvh@mbp:~$ gen update --vcf CFD3.vcf
```

Lastly, we export the F1 sample to a GFA file that can be used by graph-aware sequence aligners.

```console
bvh@mbp:~$ gen export --gfa cross.gfa
```

## Starting from assembled genomes
First we download the genome assembly archive and extract the genome sequence for the strains we want to mate.

```
wget http://1002genomes.u-strasbg.fr/files/1011Assemblies.tar.gz
tar -xzf 1011Assemblies.tar.gz GENOMES_ASSEMBLED/BRQ_4.re.fa GENOMES_ASSEMBLED/CFD_4.re.fa
```

Then we use [Cactus](https://doi.org/10.1038/s41586-020-2871-y) to create a graph via whole genome alignment.
The Comparative Genomics Toolkit provides a Docker container that includes all the infrastructure needed to do so.
We launch the container and bind our working directory to the /data directory in the container:

```console
bvh@mbp:~$ docker run --mount type=bind,source=$PWD,target=/data -it quay.io/comparative-genomics-toolkit/cactus:v2.9.0 bash
```

We tell Cactus which genomes to align by providing a text file containing names and file paths. You can set this up
manually, or generate it for all fasta files in our current directory and subdirectories:

```console
bvh@mbp:~$ find . -type f -name "*.fa" -printf "%f %p\n" | awk '{ sub(/\..*$/, "", $1); print $1, $2 }' > file_list.txt
```

Then we can start the Cactus pipeline using the command list below. `./js` specifies where intermediate files are stored,
outName and outDir influence the pipeline, and `--gfa` tells Cactus to output a GFA file. While this is a graph based
alignment, we must still designate one genome as reference, in this case we pick BRQ.
```console
bvh@mbp:~$ cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outName cross --reference BRQ_4 --gfa
```

After this completes (approximately 10 minutes), we can load it back into gen by running:

```console
bvh@mbp:~$ gen import --gfa cactus_output/cross.gfa
```

## References
1) https://www.nature.com/articles/s41586-018-0030-5