ginkgobioworks · bobvh · Oct 25, 2024 · Oct 10, 2024 · Oct 11, 2024 · Oct 11, 2024
diff --git a/README.md b/README.md
@@ -30,7 +30,8 @@ common features.
 `<figure 2>`
 
 ## Installing from Source
-Make sure you have a Rust compiler installed on your system. You can install the Rust toolset using the [rustup installer](https://rustup.rs/).
+Make sure you have a Rust compiler installed on your system. You can install the Rust toolset using the [rustup
+installer](https://rustup.rs/).
 
 
 1. Clone the [source](https://github.com/ginkgobioworks/gen) with `git`:
@@ -43,15 +44,25 @@ Make sure you have a Rust compiler installed on your system. You can install the
 2. Compile the gen package and its dependencies:
 
     ```
-    cargo build
+    cargo build --release
     ```
 
-3. You can find the gen executable in ./targets/debug/ or execute it via cargo:
+3. You can find the gen executable in ./target/release/ or execute it via cargo:
 
     ```
     cargo run -- <arguments>
     ```
-
+
+To cross-compile gen to run on a different architecture, you need to first add a target to the Rust toolchain and
+install a linker. For macOS to Linux this can be done as follows:
+
+    ```
+    rustup target add x86_64-unknown-linux-gnu
+    brew install SergioBenitez/osxct/x86_64-unknown-linux-gnu
+    cargo build --release --target=x86_64-unknown-linux-gnu
+    ```
+
+The executable will be placed in ./target/x86_64-unknown-linux-gnu/release/
 
 ## Usage
 
@@ -70,7 +81,8 @@ Make sure you have a Rust compiler installed on your system. You can install the
 
 ### Recording sequence changes
 <!-- From a VCF file -->
-Sequence variants observed through NGS can be imported into a gen repository via standard VCF file obtained from variant callers like Freebayes, GATK, or DeepVariant. [...]
+Sequence variants observed through NGS can be imported into a gen repository via standard VCF file obtained from variant
+callers like Freebayes, GATK, or DeepVariant. [...]
 <!-- -From a sequence file that was edited externally -->
 
 <!-- -From the gen command line -->

diff --git a/examples/yeast_crosses/Analysis.md b/examples/yeast_crosses/Analysis.md
@@ -0,0 +1,154 @@
+# Yeast breeding experiment
+In this example, we want to model a cross of two diploid yeast strains. Depending on which sequence data you have 
+available, you can take multiple approaches depending on which genome data you have available. In this document we will
+demonstrate a workflow that makes us of variant call files, and a workflow that uses whole genome alignment. In both
+cases we study two beer yeasts from the [1002 Yeast Genome collection](http://1002genomes.u-strasbg.fr/): Orval 
+trappist yeast strain DBVPG6695 from Belgium and American ale strain 1.3_Safale_US05 (codenames BRQ and CFD).
+
+First we set up a new gen repository to store and track our sequences. For this project we don't anticipate using
+more than one gen database file, and also don't need more than one collection. By setting defaults we won't have to
+enter these details in future commands.
+
+```console
+bvh@mbp:~$ gen init
+
+Gen repository initialized.
+```
+``` console
+bvh@mbp:~$ gen defaults --database breeding_experiment.db --collection genome
+
+Default database set to breeding_experiment.db
+Default collection set to genome
+```
+
+## Starting from variant calls
+The dataset we are working with includes variant call files that were generated by aligning sequencing reads to the
+S288C reference genome, specifically version R64-1-1. We'll start by downloading this reference sequence from the
+Saccharomyces Genome Database (SGD) and extracting the archive.
+
+```console
+bvh@mbp:~$ wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz
+```
+```console
+bvh@mbp:~$ tar -xzvf S288C_reference_genome_R64-1-1_20110203.tgz 
+```
+
+Let's add the reference genome sequence to our gen repository
+
+```console
+bvh@mbp:~$ gen import --fasta S288C_reference_genome_R64-1-1_20110203\/S288C_reference_sequence_R64-1-1_20110203.fsa 
+
+Created it
+```
+
+The variant calls themselves are stored in a large Genomic VCF (GVCF) file that includes variants from all samples in
+the dataset. We download a compressed version of this file from the the 1002 Yeast Genomes project, and then use
+bcftools to extract the two samples we are interested in.
+
+```console
+bvh@mbp:~$ wget http://1002genomes.u-strasbg.fr/files/1011Matrix.gvcf.gz
+```
+
+We then use the bcftools application to extract just the samples we want. To do this, we must first create a text file
+that contains the sample identifiers separated by a line break.
+```console
+bvh@mbp:~$ bcftools view -s BRQ -o BRQ.vcf 1011Matrix.gvcf.gz 
+bvh@mbp:~$ bcftools view -s CFD -o CFD.vcf 1011Matrix.gvcf.gz 
+```
+
+Unfortunately these variant call files refer to the chromosomes by their number, whereas the reference sequence listed
+them by their accession identifier. We rename them in the VCF file using bcftools with a text file called
+contig_mapping.txt that has the following contents:
+```
+chromosome1 ref|NC_001133|
+chromosome2 ref|NC_001134|
+chromosome3 ref|NC_001135|
+chromosome4 ref|NC_001136|
+chromosome5 ref|NC_001137|
+chromosome6 ref|NC_001138|
+chromosome7 ref|NC_001139|
+chromosome8 ref|NC_001140|
+chromosome9 ref|NC_001141|
+chromosome10 ref|NC_001142|
+chromosome11 ref|NC_001143|
+chromosome12 ref|NC_001144|
+chromosome13 ref|NC_001145|
+chromosome14 ref|NC_001146|
+chromosome15 ref|NC_001147|
+chromosome16 ref|NC_001148|
+chromosome17 ref|NC_001224|
+```
+
+```console
+bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt BRQ.vcf -o BRQ_2.vcf
+bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt CFD.vcf -o CFD_2.vcf
+```
+
+Because Gen also uses the sample names from the VCF file to create new block groups, or find an existing block group to
+update, we need to rename the samples as well so that the variants will be loaded into one sample called F1. We again
+use bcftools to do this, using a text file sample_mapping.txt with the following contents:
+```
+BRQ F1
+CFD F1
+```
+
+```console
+bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o BRQ3.vcf BRQ2.vcf
+bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o CFD3.vcf CFD2.vcf
+```
+
+We can now import the individual VCF files into gen, which will then use the chromosome/contig and sample names in the
+VCF file to find the appropriate block groups to update or create.
+
+```console
+bvh@mbp:~$ gen update --vcf BRQ3.vcf
+bvh@mbp:~$ gen update --vcf CFD3.vcf
+```
+
+Lastly, we export the F1 sample to a GFA file that can be used by graph-aware sequence aligners.
+
+```console
+bvh@mbp:~$ gen export --gfa cross.gfa
+```
+
+## Starting from assembled genomes
+First we download the genome assembly archive and extract the genome sequence for the strains we want to mate.
+
+```
+wget http://1002genomes.u-strasbg.fr/files/1011Assemblies.tar.gz
+tar -xzf 1011Assemblies.tar.gz GENOMES_ASSEMBLED/BRQ_4.re.fa GENOMES_ASSEMBLED/CFD_4.re.fa
+```
+
+Then we use [Cactus](https://doi.org/10.1038/s41586-020-2871-y) to create a graph via whole genome alignment.
+The Comparative Genomics Toolkit provides a Docker container that includes all the infrastructure needed to do so.
+We launch the container and bind our working directory to the /data directory in the container:
+
+```console
+bvh@mbp:~$ docker run --mount type=bind,source=$PWD,target=/data -it quay.io/comparative-genomics-toolkit/cactus:v2.9.0 bash
+```
+
+We tell Cactus which genomes to align by providing a text file containing names and file paths. You can set this up
+manually, or generate it for all fasta files in our current directory and subdirectories: 
+
+```console
+bvh@mbp:~$ find . -type f -name "*.fa" -printf "%f %p\n" | awk '{ sub(/\..*$/, "", $1); print $1, $2 }' > file_list.txt
+```
+
+Then we can start the Cactus pipeline using the command list below. `./js` specifies where intermediate files are stored,
+outName and outDir influence the pipeline, and `--gfa` tells Cactus to output a GFA file. While this is a graph based
+alignment, we must still designate one genome as reference, in this case we pick BRQ.
+```console
+bvh@mbp:~$ cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outName cross --reference BRQ_4 --gfa
+```
+
+After this completes (approximately 10 minutes), we can load it back into gen by running:
+
+```console
+bvh@mbp:~$ gen import --gfa cactus_output/cross.gfa
+```
+
+## References
+1) https://www.nature.com/articles/s41586-018-0030-5
+
+
+