Solidifying the Open Virus Graphs Infrastructure for Deployment
Note that the following information has been produced in a collaborative open-science format. It has not been peer-reviewed, and should not be construed as medical advice, nor should it be used to inform clinical practice.
Do you know what HIV sequence you're working with? If not, you're probably a clinician working with patient samples. If yes, you're probably a basic scientist wishing to control as many variables as possible. Either way, you'd want to maximize HIV sensitivity while maintaining HIV specificity. These questions can be asked with appropriate linear reference genomes (standard), multiple reference genomes, graph reference genomes, and a special case of multile linear reference genomes using transcript models.
Linear references are the gold standard for genomics applications, including capturing viral genome information and viral sequence recovery. Examples include HIV sequence detection and HIV genome assembly. HIV genome assembly can be loosely classified into whole (reference) genome assebly and HIV genotyping (partial assembly).
More soon..
More soon..
There is none!
For nucleobase graphs: NovoGene, VG.
For approximate k-mer graphs: original SWIGG, implemented SWIGG.
In the original graph created by SWIGG without any filtering parameters, the number of nodes are usually so large that it would result in inefficient visualization and other downstream analysis. Therefore, we implemented the algorithm to make a compact de Bruijn graph. Contraction of nodes starts by a depth-first-search from the source node of the graph. Nodes are included into a supernode continuously as the algorithm walks through the graph. A new supernode is created when the algorithm encounters a node with more than 1 neighbor. The size of the contracted graph is significantly reduced after contraction algorithm is applied (Table ?).
The algorithm to build and contract nodes in the de bruijn graph is the same as finding the maximal non-branching path. A description of the algorithm can be found here.
To run extension of Swigg, run python3 swigg_ext.py -f seqfile.fasta -k kmer_length -o output prefix
For split-read mapping, HISAT2 was used, and output .bam was pipped into StringTie, both in Galaxy.
Gephi is used to visualize the output from SWIGG. SnapGene Graphviz
Graphs built with different k-mer lengths:
The above figure contains 5 graphs built with different k-mer lengths. (a) k=16, (b) k=20, (c) k=32, (d) k=50, (e) k=90. Longer k-mers cover more repetitive regions. Therefore, longer k-mers result in simpler graphs. Red rectangles highlights the large loop topology in graphs made by small k-mers. The loops are the results of the repeititive k-mers in regions far apart.
Table of number of nodes and edges before and after contraction
Before Conrtaction | After Contraction | |||
---|---|---|---|---|
kmer | # Nodes | # Edges | # Nodes | # Edges |
16 | 38772 | 55265 | 3781 | 4788 |
20 | 42442 | 55241 | 2867 | 3641 |
32 | 49041 | 55169 | 1350 | 1703 |
50 | 52794 | 55061 | 442 | 554 |
90 | 54449 | 54821 | 68 | 83 |
Current NovoGraph script.. (TBD).
Graphs built with multiple sequence alignment algorithm:
The above graph contains part of the virual graph genereated with multiple sequences alignment algorithm using vg. The line in the graph shows the possible connections between each necleo nodes.
Figure: cDNA+PCR DNAseq ("classic RNAseq"). A. Coverage summary of reads mapped to HXB2 K03455 with HISAT2 with usegalaxy.eu. B. RmDup-processed reads, controling for PCR duplicates after initial alignment. Search strategy = SRA. Searchterms: "HIV-1 and RNAseq and virus". Bioproject: PRJNA320293, specifically SRR3472915. Viewed in IGV. Source: From https://github.com/NCBI-Codeathons/Virus_Graphs/edit/master/README.md.
Will add predicted isoform models soon..
Alejandro Rafael Gener (Lead/Corresponding Author)
[email protected]; [email protected]
Baylor College of Medicine, Houston, TX, USA
MD Anderson Cancer Center, Houston, TX, USA
Universidad Central del Caribe, Bayamón, PR, USA
Nicolas Cooley
[email protected]
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh PA, USA,15206
Charles Scott Kirby
[email protected]
Johns Hopkins University School of Medicine, Baltimore, MD, USA
Zhao Liu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
Rahil Sethi
[email protected]
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA 15206
Yutong Qiu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA