From a7141c98478f125ef0a81a927e16ff0ada9362e1 Mon Sep 17 00:00:00 2001
From: bobvh <1587584+bobvh@users.noreply.github.com>
Date: Thu, 10 Oct 2024 15:06:55 -0400
Subject: [PATCH 1/3] vcf and cactus

---
 README.md                          | 22 +++++++++---
 examples/yeast_crosses/Analysis.md | 54 ++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 5 deletions(-)
 create mode 100644 examples/yeast_crosses/Analysis.md
diff --git a/README.md b/README.md
index 2cc58bf..6b4e46e 100644
--- a/README.md
+++ b/README.md
@@ -30,7 +30,8 @@ common features.
 `<figure 2>`
 
 ## Installing from Source
-Make sure you have a Rust compiler installed on your system. You can install the Rust toolset using the [rustup installer](https://rustup.rs/).
+Make sure you have a Rust compiler installed on your system. You can install the Rust toolset using the [rustup
+installer](https://rustup.rs/).
 
 
 1. Clone the [source](https://github.com/ginkgobioworks/gen) with `git`:
@@ -43,15 +44,25 @@ Make sure you have a Rust compiler installed on your system. You can install the
 2. Compile the gen package and its dependencies:
 
     ```
-    cargo build
+    cargo build --release
     ```
 
-3. You can find the gen executable in ./targets/debug/ or execute it via cargo:
+3. You can find the gen executable in ./target/release/ or execute it via cargo:
 
     ```
     cargo run -- <arguments>
     ```
-    
+
+To cross-compile gen to run on a different architecture, you need to first add a target to the Rust toolchain and
+install a linker. For macOS to Linux this can be done as follows:
+
+    ```
+    rustup target add x86_64-unknown-linux-gnu
+    brew install SergioBenitez/osxct/x86_64-unknown-linux-gnu
+    cargo build --release --target=x86_64-unknown-linux-gnu
+    ```
+
+The executable will be placed in ./target/x86_64-unknown-linux-gnu/release/
 
 ## Usage
 
@@ -70,7 +81,8 @@ Make sure you have a Rust compiler installed on your system. You can install the
 
 ### Recording sequence changes
 <!-- From a VCF file -->
-Sequence variants observed through NGS can be imported into a gen repository via standard VCF file obtained from variant callers like Freebayes, GATK, or DeepVariant. [...]
+Sequence variants observed through NGS can be imported into a gen repository via standard VCF file obtained from variant
+callers like Freebayes, GATK, or DeepVariant. [...]
 <!-- -From a sequence file that was edited externally -->
 
 <!-- -From the gen command line -->
diff --git a/examples/yeast_crosses/Analysis.md b/examples/yeast_crosses/Analysis.md
new file mode 100644
index 0000000..a643f27
--- /dev/null
+++ b/examples/yeast_crosses/Analysis.md
@@ -0,0 +1,54 @@
+# Yeast breeding experiment
+In this example, we want to model a cross of two diploid yeast strains. Depending on which sequence data you have 
+available, you can take multiple approaches depending on which genome data you have available. In this document we will
+demonstrate a workflow that makes us of variant call files, and a workflow that uses whole genome alignment. In both
+cases we start from two beer yeasts from the [1000 Yeast Genome collection](http://1002genomes.u-strasbg.fr/): Orval 
+trappist yeast strain DBVPG6695 from Belgium and American ale strain 1.3_Safale_US05 (codenames BRQ and CFD).
+
+## Starting from variant calls
+
+
+
+bcftools query -l 1011Matrix.gvcf.gz | head > samples.txt
+bcftools view -S samples.txt -r chromosome1,chromosome2 filtered.vcf.gz 1011Matrix.gvcf.gz
+bcftools annotate --rename-chrs mapping.txt 1011Matrix.gvcf.gz -o 1011Matrix_renamed.gvcf.gz
+
+
+
+https://www.nature.com/articles/s41586-018-0030-5
+
+Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
+http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz
+
+## Starting from assembled genomes
+First we download the genome assembly archive and extract the genome sequence for the strains we want to mate.
+```
+wget http://1002genomes.u-strasbg.fr/files/1011Assemblies.tar.gz
+tar -xzf 1011Assemblies.tar.gz GENOMES_ASSEMBLED/BRQ_4.re.fa GENOMES_ASSEMBLED/CFD_4.re.fa
+```
+
+Then we use [Cactus](https://doi.org/10.1038/s41586-020-2871-y) to create a graph via whole genome alignment.
+The Comparative Genomics Toolkit provides a Docker container that includes all the infrastructure needed to do so.
+We launch the container and bind our working directory to the /data directory in the container:
+
+```
+ docker run --mount type=bind,source=$PWD,target=/data -it quay.io/comparative-genomics-toolkit/cactus:v2.9.0 bash
+```
+
+We tell Cactus which genomes to align by providing a text file containing names and file paths. You can set this up
+manually, or generate it for all fasta files in our current directory and subdirectories: 
+ 
+```
+find . -type f -name "*.fa" -printf "%f %p\n" | awk '{ sub(/\..*$/, "", $1); print $1, $2 }' > file_list.txt
+```
+
+Then we can start the Cactus pipeline using the command list below. `./js` specifies where intermediate files are stored,
+outName and outDir influence the pipeline, and `--gfa` tells Cactus to output a GFA file. While this is a graph based
+alignment, we must still designate one genome as reference, in this case we pick BRQ.
+```
+cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outName cross --reference BRQ_4 --gfa
+```
+
+After this completes (approximately 10 minutes), we can load it back into gen by running:
+
+```gen --db main import --gfa cactus_output/cross.gfa --name my_collection```
\ No newline at end of file

From b37126586115e02d0915b6dc92d8845874c59ce7 Mon Sep 17 00:00:00 2001
From: bobvh <1587584+bobvh@users.noreply.github.com>
Date: Fri, 11 Oct 2024 13:44:50 -0400
Subject: [PATCH 2/3] vcf route

---
 examples/yeast_crosses/Analysis.md | 96 +++++++++++++++++++++++++-----
 1 file changed, 82 insertions(+), 14 deletions(-)

diff --git a/examples/yeast_crosses/Analysis.md b/examples/yeast_crosses/Analysis.md
index a643f27..7a1947a 100644
--- a/examples/yeast_crosses/Analysis.md
+++ b/examples/yeast_crosses/Analysis.md
@@ -2,26 +2,86 @@
 In this example, we want to model a cross of two diploid yeast strains. Depending on which sequence data you have 
 available, you can take multiple approaches depending on which genome data you have available. In this document we will
 demonstrate a workflow that makes us of variant call files, and a workflow that uses whole genome alignment. In both
-cases we start from two beer yeasts from the [1000 Yeast Genome collection](http://1002genomes.u-strasbg.fr/): Orval 
+cases we study two beer yeasts from the [1002 Yeast Genome collection](http://1002genomes.u-strasbg.fr/): Orval 
 trappist yeast strain DBVPG6695 from Belgium and American ale strain 1.3_Safale_US05 (codenames BRQ and CFD).
 
+First we set up a new gen repository to store and track our sequences. For this project we don't anticipate using
+more than one gen database file, and also don't need more than one collection. By setting defaults we won't have to
+enter these details in future commands.
+
+```console
+bvh@mbp:~$ gen init
+
+Gen repository initialized.
+```
+``` console
+bvh@mbp:~$ gen defaults --database breeding_experiment.db --collection genome
+
+Default database set to breeding_experiment.db
+Default collection set to genome
+```
+
 ## Starting from variant calls
+The dataset we are working with includes variant call files that were generated by aligning sequencing reads to the
+S288C reference genome, specifically version R64-1-1. We'll start by downloading this reference sequence from the
+Saccharomyces Genome Database (SGD) and extracting the archive.
 
+```console
+bvh@mbp:~$ wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz
+```
+```console
+bvh@mbp:~$ tar -xzvf S288C_reference_genome_R64-1-1_20110203.tgz 
+```
 
+Let's add the reference genome sequence to our gen repository
 
-bcftools query -l 1011Matrix.gvcf.gz | head > samples.txt
-bcftools view -S samples.txt -r chromosome1,chromosome2 filtered.vcf.gz 1011Matrix.gvcf.gz
-bcftools annotate --rename-chrs mapping.txt 1011Matrix.gvcf.gz -o 1011Matrix_renamed.gvcf.gz
+```console
+bvh@mbp:~$ gen import --fasta S288C_reference_genome_R64-1-1_20110203\/S288C_reference_sequence_R64-1-1_20110203.fsa 
 
+Created it
+```
 
+The variant calls themselves are stored in a large Genomic VCF (GVCF) file that includes variants from all samples in the dataset. We download a compressed version of this file from the the 1002 Yeast Genomes project, and then use bcftools to extract the two samples we are interested in.
 
-https://www.nature.com/articles/s41586-018-0030-5
+```console
+bvh@mbp:~$ wget http://1002genomes.u-strasbg.fr/files/1011Matrix.gvcf.gz
+```
 
-Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
-http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz
+We then use the bcftools application to extract just the samples we want. To do this, we must first create a text file that contains the sample identifiers separated by a line break.
+```console
+bvh@mbp:~$ bcftools view -s BRQ -o BRQ.vcf.gz 1011Matrix.gvcf.gz 
+bvh@mbp:~$ bcftools view -s CFD -o CFD.vcf.gz 1011Matrix.gvcf.gz 
+```
+
+Unfortunately these variant call files refer to the chromosomes by their number, whereas the reference sequence listed them by their accession identifier. We rename them in the VCF file using bcftools with a text file called mapping.txt that has the following contents:
+```
+chromosome1 ref|NC_001133|
+chromosome2 ref|NC_001134|
+chromosome3 ref|NC_001135|
+chromosome4 ref|NC_001136|
+chromosome5 ref|NC_001137|
+chromosome6 ref|NC_001138|
+chromosome7 ref|NC_001139|
+chromosome8 ref|NC_001140|
+chromosome9 ref|NC_001141|
+chromosome10 ref|NC_001142|
+chromosome11 ref|NC_001143|
+chromosome12 ref|NC_001144|
+chromosome13 ref|NC_001145|
+chromosome14 ref|NC_001146|
+chromosome15 ref|NC_001147|
+chromosome16 ref|NC_001148|
+chromosome17 ref|NC_001224|
+```
+
+```console
+bvh@mbp:~$ bcftools annotate --rename-chrs mapping.txt BRQ.gvcf.gz -o BRQ.gvcf.gz
+bvh@mbp:~$ bcftools annotate --rename-chrs mapping.txt CFD.gvcf.gz -o CFD.gvcf.gz
+```
 
 ## Starting from assembled genomes
 First we download the genome assembly archive and extract the genome sequence for the strains we want to mate.
+
 ```
 wget http://1002genomes.u-strasbg.fr/files/1011Assemblies.tar.gz
 tar -xzf 1011Assemblies.tar.gz GENOMES_ASSEMBLED/BRQ_4.re.fa GENOMES_ASSEMBLED/CFD_4.re.fa
@@ -31,24 +91,32 @@ Then we use [Cactus](https://doi.org/10.1038/s41586-020-2871-y) to create a grap
 The Comparative Genomics Toolkit provides a Docker container that includes all the infrastructure needed to do so.
 We launch the container and bind our working directory to the /data directory in the container:
 
-```
- docker run --mount type=bind,source=$PWD,target=/data -it quay.io/comparative-genomics-toolkit/cactus:v2.9.0 bash
+```console
+bvh@mbp:~$ docker run --mount type=bind,source=$PWD,target=/data -it quay.io/comparative-genomics-toolkit/cactus:v2.9.0 bash
 ```
 
 We tell Cactus which genomes to align by providing a text file containing names and file paths. You can set this up
 manually, or generate it for all fasta files in our current directory and subdirectories: 
  
-```
-find . -type f -name "*.fa" -printf "%f %p\n" | awk '{ sub(/\..*$/, "", $1); print $1, $2 }' > file_list.txt
+```console
+bvh@mbp:~$ find . -type f -name "*.fa" -printf "%f %p\n" | awk '{ sub(/\..*$/, "", $1); print $1, $2 }' > file_list.txt
 ```
 
 Then we can start the Cactus pipeline using the command list below. `./js` specifies where intermediate files are stored,
 outName and outDir influence the pipeline, and `--gfa` tells Cactus to output a GFA file. While this is a graph based
 alignment, we must still designate one genome as reference, in this case we pick BRQ.
-```
-cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outName cross --reference BRQ_4 --gfa
+```console
+bvh@mbp:~$ cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outName cross --reference BRQ_4 --gfa
 ```
 
 After this completes (approximately 10 minutes), we can load it back into gen by running:
 
-```gen --db main import --gfa cactus_output/cross.gfa --name my_collection```
\ No newline at end of file
+```console
+bvh@mbp:~$ gen --db main import --gfa cactus_output/cross.gfa --name my_collection
+```
+
+## References
+1) https://www.nature.com/articles/s41586-018-0030-5
+
+
+

From 1929e6959b7316e709d7b4e06b210a37d76e999e Mon Sep 17 00:00:00 2001
From: bobvh <1587584+bobvh@users.noreply.github.com>
Date: Fri, 11 Oct 2024 15:08:12 -0400
Subject: [PATCH 3/3] Variant call route

---
 examples/yeast_crosses/Analysis.md | 48 +++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/examples/yeast_crosses/Analysis.md b/examples/yeast_crosses/Analysis.md
index 7a1947a..6c52684 100644
--- a/examples/yeast_crosses/Analysis.md
+++ b/examples/yeast_crosses/Analysis.md
@@ -41,19 +41,24 @@ bvh@mbp:~$ gen import --fasta S288C_reference_genome_R64-1-1_20110203\/S288C_ref
 Created it
 ```
 
-The variant calls themselves are stored in a large Genomic VCF (GVCF) file that includes variants from all samples in the dataset. We download a compressed version of this file from the the 1002 Yeast Genomes project, and then use bcftools to extract the two samples we are interested in.
+The variant calls themselves are stored in a large Genomic VCF (GVCF) file that includes variants from all samples in
+the dataset. We download a compressed version of this file from the the 1002 Yeast Genomes project, and then use
+bcftools to extract the two samples we are interested in.
 
 ```console
 bvh@mbp:~$ wget http://1002genomes.u-strasbg.fr/files/1011Matrix.gvcf.gz
 ```
 
-We then use the bcftools application to extract just the samples we want. To do this, we must first create a text file that contains the sample identifiers separated by a line break.
+We then use the bcftools application to extract just the samples we want. To do this, we must first create a text file
+that contains the sample identifiers separated by a line break.
 ```console
-bvh@mbp:~$ bcftools view -s BRQ -o BRQ.vcf.gz 1011Matrix.gvcf.gz 
-bvh@mbp:~$ bcftools view -s CFD -o CFD.vcf.gz 1011Matrix.gvcf.gz 
+bvh@mbp:~$ bcftools view -s BRQ -o BRQ.vcf 1011Matrix.gvcf.gz 
+bvh@mbp:~$ bcftools view -s CFD -o CFD.vcf 1011Matrix.gvcf.gz 
 ```
 
-Unfortunately these variant call files refer to the chromosomes by their number, whereas the reference sequence listed them by their accession identifier. We rename them in the VCF file using bcftools with a text file called mapping.txt that has the following contents:
+Unfortunately these variant call files refer to the chromosomes by their number, whereas the reference sequence listed
+them by their accession identifier. We rename them in the VCF file using bcftools with a text file called
+contig_mapping.txt that has the following contents:
 ```
 chromosome1 ref|NC_001133|
 chromosome2 ref|NC_001134|
@@ -75,8 +80,35 @@ chromosome17 ref|NC_001224|
 ```
 
 ```console
-bvh@mbp:~$ bcftools annotate --rename-chrs mapping.txt BRQ.gvcf.gz -o BRQ.gvcf.gz
-bvh@mbp:~$ bcftools annotate --rename-chrs mapping.txt CFD.gvcf.gz -o CFD.gvcf.gz
+bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt BRQ.vcf -o BRQ_2.vcf
+bvh@mbp:~$ bcftools annotate --rename-chrs contig_mapping.txt CFD.vcf -o CFD_2.vcf
+```
+
+Because Gen also uses the sample names from the VCF file to create new block groups, or find an existing block group to
+update, we need to rename the samples as well so that the variants will be loaded into one sample called F1. We again
+use bcftools to do this, using a text file sample_mapping.txt with the following contents:
+```
+BRQ F1
+CFD F1
+```
+
+```console
+bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o BRQ3.vcf BRQ2.vcf
+bvh@mbp:~$ bcftools reheader -s sample_mapping.txt -o CFD3.vcf CFD2.vcf
+```
+
+We can now import the individual VCF files into gen, which will then use the chromosome/contig and sample names in the
+VCF file to find the appropriate block groups to update or create.
+
+```console
+bvh@mbp:~$ gen update --vcf BRQ3.vcf
+bvh@mbp:~$ gen update --vcf CFD3.vcf
+```
+
+Lastly, we export the F1 sample to a GFA file that can be used by graph-aware sequence aligners.
+
+```console
+bvh@mbp:~$ gen export --gfa cross.gfa
 ```
 
 ## Starting from assembled genomes
@@ -112,7 +144,7 @@ bvh@mbp:~$ cactus-pangenome ./js file_list.txt --outDir ./cactus_output --outNam
 After this completes (approximately 10 minutes), we can load it back into gen by running:
 
 ```console
-bvh@mbp:~$ gen --db main import --gfa cactus_output/cross.gfa --name my_collection
+bvh@mbp:~$ gen import --gfa cactus_output/cross.gfa
 ```
 
 ## References