Skip to content

Commit

Permalink
Merge pull request #74 from ginkgobioworks/combinatorial-example
Browse files Browse the repository at this point in the history
Combinatorial example
  • Loading branch information
dkhofer authored Oct 24, 2024
2 parents 5d81be1 + 79fece4 commit 13bbb31
Show file tree
Hide file tree
Showing 11 changed files with 5,925 additions and 0 deletions.
266 changes: 266 additions & 0 deletions examples/combinatorial_plasmid_design/combinatorial_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
# Combinatorial plasmid design

In this example we will design a library of expression plasmids for _E. coli_. We are looking for the combination of
promoter and ribosome binding site that result in the highest expression of an insulin precursor peptide.

We start by setting up a new gen repository and a default database and collection name. This way we don't have to keep
specifying which database file and collection to use in the `gen import` and `gen update` commands.

```console
bvh@mbp:~$ gen init

Gen repository initialized.
```
``` console
bvh@mbp:~$ gen defaults --database insulin.db --collection plasmids

Default database set to insulin.db
Default collection set to plasmids
```

Next, we import our base vector from a fasta file.
``` console
bvh@mbp:~$ gen import --fasta puc19.fa

Created it
```

By importing the sequence we have created one new block group and one new path. These are comprised of two edges from
the source node to the start of the imported sequence node, and from the end of the imported sequence node to the sink
node. We can verify this by querying the database _(todo: implement this functionality in gen)_.
```console
bvh@mbp:~$ sqlite3 --header insulin.db

SQLite version 3.39.5 2022-10-14 20:58:05
Enter ".help" for usage hints.

sqlite> select * from block_group;
id|collection_name|sample_name|name
1|plasmids||M77789.2

sqlite> select * from path;
id|block_group_id|name
1|1|M77789.2

sqlite> select * from edges;
id|source_node_id|source_coordinate|source_strand|target_node_id|target_coordinate|target_strand|chromosome_index|phased
1|1|0|+|3|0|+|0|0
2|3|2686|+|2|0|+|0|0

sqlite> .quit
```

We see that both the path and the block group are called M77789.2, is accession ID for the pUC19 plasmid, which was
extracted from the fasta header. Next, we will prepare a _gen update_ operation to insert the insulin operon variants
into the vector, more specifically between position 106 and 539. We need two files to specify the design: a _parts_ file
that contains the sequences of all of the genetic parts that go into the design, and a _library_ file that describes how
the parts should be arranged.

The library file is a simple CSV table without headers, where each column represents a 'slot' in the construct, and the
rows represent the possible parts to include in each slot. Gen will create a combinatorial design where all options for
each slot are combined with all options for the other slots. In the example below we have 3 slots, with respectively 5,
2, and 1 part options, this results in 10 possible outcomes (5x2x1).

<table align="center">
<th colspan="3" style="text-align:center">design.csv</th>
<tr>
<td>BBa_J23100</td>
<td>BBa_B0030</td>
<td>proinsulin</td>
</tr>
<tr>
<td>BBa_J23101</td>
<td>BBa_B0032</td>
<td> </td>
</tr>
<tr>
<td>BBa_J23102</td>
<td>BBa_B0034</td>
<td> </td>
</tr>
<tr>
<td>BBa_J23103</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>BBa_J23104</td>
<td> </td>
<td> </td>
</tr>
</table>

If you create this file by hand outside of a spreadsheet program, please ensure that empty cells are still be separated
by commas. We can then run the update operation using the following command:


```console
bvh@mbp:~$ gen update --path-name M77789.2 --start 106 --end 539 --library design.csv --parts parts.fa
```

This update operation did not create new block groups or paths, it only created new edges in the existing block group.
You can verify this by running the queries described above, or by rendering a graphical representation of the block
group.

## Visualizing the library
We start by exporting our block group to a GFA file that can be used by graph handling software like vg. The gen
'export' command can take a sample name as argument if you only want to export the block group associated with a
specific sample, but in this case we just have one block group and it is not associated with any sample.

```console
bvh@mbp:~$ gen export --gfa library.gfa
```

We will use the vg tools using Docker, which we will start as follows to automatically download the vg Docker image, and
bind the current working directory to the /data directory in the Docker container.

```console
bvh@mbp:~$ docker run --volume $PWD:/data --workdir /data --interactive -t quay.io/vgteam/vg:v1.60.0
```

Once we're inside the container, we still have to install the graphviz suite to render our images to SVG format.

```console
root@e97629b542ca:/vg# apt install -y graphviz
```

Then we run `vg view` and pipe its output to the `dot` program, which was installed as part of graphviz.
```console
root@e97629b542ca:/vg# vg view --gfa-in library.gfa --dot --color --simple-dot | dot -Tsvg -o library.svg
```

This results in the following image, in which the new nodes are highlighted in red. Nodes 1, 9, and 3 are the promoters;
nodes 7, 8 and 2 are ribosome binding sites. The original pUC19 sequence is also visible as the path over nodes [4, 5,
6]. ![visual representation of the graph](./library.svg)

Note that VG uses its own numbering for the nodes. To find out which gen nodes these correspond to, run the following
series of commands instead:

```console
root@e97629b542ca:/vg# vg convert --gfa-in library.gfa --gfa-trans translation_table.txt --vg-out | vg view --vg-in - --dot --color --simple-dot | dot -Tsvg -o library.svg
root@e97629b542ca:/vg# cp library.svg library_fixed.svg && while IFS=$'\t' read _ new old; do sed "s#font-size=\"14.00\">$old</text>#font-size=\"14.00\">$new</text>#g" library_fixed.svg > temp_file.html && mv temp_file.html library_fixed.svg; done < translation_table.txt
```

This results in the following output, where the nodes are referred to by their gen identifier in the block graph model
used when exporting to GFA. In this model, nodes are identified by two numbers separated by a period: the first number
is the original node identifier in the regular gen model, the number behind the period is the coordinate to where the
block starts. Since we asked to insert the library from coordinate 106 to 539 on the reference path, node 3 is split up
into 3.0, 3.106 and 3.159 in the block graph model. ![visual representation with the actual node
IDs](./library_fixed.svg)

## Searching the library
We can use a sequence-to-graph mapper like [minigraph](https://github.com/lh3/minigraph?tab=readme-ov-file) to search a
block group for the presence of a subsequence, traversing all valid edges if needed. Minigraph is a light weight
application that can be built from source like this:
```console
wget "https://github.com/lh3/minigraph/releases/download/v0.21/minigraph-0.21.tar.bz2"
tar -xzf minigraph-0.21.tar.bz2
cd minigraph-0.21
make
cp minigraph /usr/local/bin/
```

Queries are entered via a fasta file. If your query is particularly short you will need to tune minigraph's parameters.
For short queries where you expect an exact match these work well:
```console
minigraph -c -n 1,1 -m 1,1 library.gfa query.fa
```

## Analysing sequencing data
### Isolate
We will also use minigraph to map longread NGS reads obtained from a single colony isolate:
```console
minigraph -cx lr library.gfa sample1.fq -o sample1.gaf

[M::main::0.000*8.39] loaded the graph from "library.gfa"
[M::mg_index::0.002*3.58] indexed the graph
[M::mg_opt_update::0.002*3.37] occ_max1=50; lc_max_occ=2
[M::worker_pipeline::0.004*2.59] mapped 10 sequences
[M::main] Version: 0.21-r606
[M::main] CMD: minigraph -cx lr -o sample1.gaf library.gfa sample1.fq
[M::main] Real time: 0.005 sec; CPU: 0.012 sec; Peak RSS: 0.002 GB
```

The resulting GAF file has the path to which a read maps listed in the 6th column. We can extract all unique paths that
were identified as follows:
```console
cut -f6 sample1.gaf | sort | uniq

<3.539
<3.539<4.0
>3.0>8.0>10.0>4.0>3.539
>3.539
>4.0>3.539
```

Not all reads cover the graph from end to end, but by looking at the longest path `>3.0>8.0>10.0>4.0>3.539` we are able
to identify the genotype of a colony.

### Pool
Pooled DNA assembly is a great cost-effective way to access a lot of sequence diversity. Instead of making each
combination of parts in a separate sample during cloning, we add all possible parts to a single tube in a one-pot
cloning reaction. Analysing the population of sequences obtained from such a reaction is where long-read NGS and graph
sequence representations really shine.

```console
minigraph -cx lr library.gfa sample2.fq -o sample2.gaf
cut -f6 sample1.gaf | sort | uniq

<3.539
<3.539<3.106
<3.539<3.106<3.0
<3.539<4.0
<3.539<4.0<10.0<6.0<3.0
<3.539<4.0<10.0<7.0<3.0
<3.539<4.0<10.0<8.0<3.0
<3.539<4.0<11.0<5.0<3.0
<3.539<4.0<11.0<6.0<3.0
<3.539<4.0<11.0<7.0<3.0
[...]
```

As you can see in the output above, there are a lot more unique paths present in this sample. One path in particular you
may be interested in is `>3.0>3.106>3.539` or its reverse complement `<3.539<3.106<3.0`, which represent the empty
vector you started out with. For many cloning operations there will always be some carryover of the empty vector. In
conventional cloning this would be eliminated when you create an isolate by picking a single colony, but in pooled
sample you can't do that. This usually isn't a problem, as long as the frequency of empty vector is relative low. We can
measure this by counting the relative occurence of the empty vector paths amongst all observe paths that traverse node
3.0 and 3.539.

```console
total_count=$(cut -f6 sample2.gaf | grep '3\.0' | grep '3\.539' | wc -l)
(.venv) cut -f6 sample2.gaf | grep '3\.0' | grep '3\.539' | sort | uniq -c | awk -v total="$total_count" '{printf "%.2f %s\n", $1 / total * 100, $2}'

27.87 <3.539<3.106<3.0
1.64 <3.539<4.0<10.0<6.0<3.0
1.64 <3.539<4.0<10.0<7.0<3.0
4.92 <3.539<4.0<10.0<8.0<3.0
1.64 <3.539<4.0<11.0<5.0<3.0
1.64 <3.539<4.0<11.0<6.0<3.0
3.28 <3.539<4.0<11.0<7.0<3.0
3.28 <3.539<4.0<11.0<8.0<3.0
1.64 <3.539<4.0<11.0<9.0<3.0
[...]
```

Here we see that approximately 28% of all molecules in the sample are carryover, which isn't great. In a real sample
from a Golden Gate cloning reaction for example this will generally be much lower. Note that we haven't taken into
account any biases to length in this quick analysis.


# Appendix
## Simulating NGS reads using VG
The NGS data used in this example was simulated using the VG toolkit. For the isolate sample we started from a single
sequence and converted that into a graph first:
```
root@32a61b3f3998:/data# vg construct -r sample1.fa | vg convert --xg-out - > isolate.xg
root@32a61b3f3998:/data# vg sim -x isolate.xg -n 10 -l 2000 -a | vg view --fastq-out - > sample1.fq
```

For the pooled sample we use a GFA file exported by gen. VG doesn't handle lowercase nucleotides, so we convert
everything to uppercase first.
```
root@32a61b3f3998:/data# cat library.gfa | tr 'a-z' 'A-Z' > library_allcaps.gfa
root@32a61b3f3998:/data# vg convert --gfa-in library_allcaps.gfa --xg-out > library.xg
root@32a61b3f3998:/data# vg sim -x library.xg -n 1000 -l 2000 -a | vg view --fastq-out - > sample2.fq
```
5 changes: 5 additions & 0 deletions examples/combinatorial_plasmid_design/design.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
BBa_J23100,BBa_B0030,proinsulin
BBa_J23101,BBa_B0032,
BBa_J23102,BBa_B0034,
BBa_J23103,,
BBa_J23104,,
39 changes: 39 additions & 0 deletions examples/combinatorial_plasmid_design/library.gfa
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
S 6.0 tttacagctagctcagtcctaggtattatgctagc *
S 11.0 TCTAGAGTCACACAGGAAAGTACTAG *
S 4.0 atgcgcttcgtcaatcagcacctttgtggttctcacctcgttgaggctttgtaccttgtttgcggtgaacgtggtttcttctacactcctaagactcgcgaggcagaggacctgcaagtggggcaggtggagctgggcgggggccctggtgcaggcagcctgcaacccttggccctggaggggtccctgcaacgtggcattgtggaacaatgctgtaccagcatctgctccctctaccagctggagaactactgcaactaataa *
S 8.0 ctgatagctagctcagtcctagggattatgctagc *
S 5.0 ttgacggctagctcagtcctaggtacagtgctagc *
S 3.0 GCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATT *
S 3.106 AATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAG *
S 3.539 TTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTTCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGTTCTTTCCTGCGTTATCCCCTGATTCTGTGGATAACCGTATTACCGCCTTTGAGTGAGCTGATACCGCTCGCCGCAGCCGAACGACCGAGCGCAGCGAGTCAGTGAGCGAGGAAGCGGAAGA *
S 10.0 TCTAGAGAAAGAGGAGAAATACTAG *
S 12.0 TCTAGAGATTAAAGAGGAGAAATACTAG *
S 7.0 ttgacagctagctcagtcctaggtactgtgctagc *
S 9.0 ttgacagctagctcagtcctaggtattgtgctagc *
L 9.0 + 10.0 + 0M
L 3.0 + 8.0 + 0M
L 7.0 + 12.0 + 0M
L 3.0 + 9.0 + 0M
L 9.0 + 11.0 + 0M
L 7.0 + 11.0 + 0M
L 8.0 + 10.0 + 0M
L 3.0 + 7.0 + 0M
L 10.0 + 4.0 + 0M
L 4.0 + 3.539 + 0M
L 8.0 + 12.0 + 0M
L 5.0 + 11.0 + 0M
L 6.0 + 12.0 + 0M
L 12.0 + 4.0 + 0M
L 9.0 + 12.0 + 0M
L 5.0 + 10.0 + 0M
L 6.0 + 11.0 + 0M
L 8.0 + 11.0 + 0M
L 6.0 + 10.0 + 0M
L 7.0 + 10.0 + 0M
L 3.0 + 5.0 + 0M
L 5.0 + 12.0 + 0M
L 3.0 + 6.0 + 0M
L 11.0 + 4.0 + 0M
L 3.0 + 3.106 + 0M
L 3.106 + 3.539 + 0M
P M77789.2 3.0+,3.106+,3.539+ *
Loading

0 comments on commit 13bbb31

Please sign in to comment.