VirusGraphs3

Solidifying the Open Virus Graphs Infrastructure for Deployment

Note that the following information has been produced in a collaborative open-science format. It has not been peer-reviewed, and should not be construed as medical advice, nor should it be used to inform clinical practice.

Are HIV-1 Graph Reference Genomes Right for Me?

Do you know what HIV sequence you're working with? If not, you're probably a clinician working with patient samples. If yes, you're probably a basic scientist wishing to control as many variables as possible. Either way, you'd want to maximize HIV sensitivity while maintaining HIV specificity. These questions can be asked with appropriate linear reference genomes (standard), multiple reference genomes, graph reference genomes, and a special case of multile linear reference genomes using transcript models.

Linear reference genomes

Linear references are the gold standard for genomics applications, including capturing viral genome information and viral sequence recovery. Examples include HIV sequence detection and HIV genome assembly. HIV genome assembly can be loosely classified into whole (reference) genome assebly and HIV genotyping (partial assembly).

Graph Reference genomes

Nucleobase graph reference genome

More soon..

Approximate k-mer graph reference genome

More soon..

HIV-1 transcript model

There is none!

Scope

Implementation

Reference graphs

For nucleobase graphs: NovoGene, VG.

For approximate k-mer graphs: original SWIGG, implemented SWIGG.

Extension of SWIGG

In the original graph created by SWIGG without any filtering parameters, the number of nodes are usually so large that it would result in inefficient visualization and other downstream analysis. Therefore, we implemented the algorithm to make a compact de Bruijn graph. Contraction of nodes starts by a depth-first-search from the source node of the graph. Nodes are included into a supernode continuously as the algorithm walks through the graph. A new supernode is created when the algorithm encounters a node with more than 1 neighbor. The size of the contracted graph is significantly reduced after contraction algorithm is applied (Table ?).

The algorithm to build and contract nodes in the de bruijn graph is the same as finding the maximal non-branching path. A description of the algorithm can be found here.

To run extension of Swigg, run python3 swigg_ext.py -f seqfile.fasta -k kmer_length -o output prefix

Transcript modeling

For split-read mapping, HISAT2 was used, and output .bam was pipped into StringTie, both in Galaxy.

Visualization

Gephi is used to visualize the output from SWIGG. SnapGene Graphviz

Results

for SWIGG

Graphs built with different k-mer lengths:

The above figure contains 5 graphs built with different k-mer lengths. (a) k=16, (b) k=20, (c) k=32, (d) k=50, (e) k=90. Longer k-mers cover more repetitive regions. Therefore, longer k-mers result in simpler graphs. Red rectangles highlights the large loop topology in graphs made by small k-mers. The loops are the results of the repeititive k-mers in regions far apart.

Table of number of nodes and edges before and after contraction

	Before Conrtaction		After Contraction
kmer	# Nodes	# Edges	# Nodes	# Edges
16	38772	55265	3781	4788
20	42442	55241	2867	3641
32	49041	55169	1350	1703
50	52794	55061	442	554
90	54449	54821	68	83

for NovoGraph

Current NovoGraph script.. (TBD).

for VG

Graphs built with multiple sequence alignment algorithm:

The above graph contains part of the virual graph genereated with multiple sequences alignment algorithm using vg. The line in the graph shows the possible connections between each necleo nodes.

HIV-1 transcript model

Figure: cDNA+PCR DNAseq ("classic RNAseq"). A. Coverage summary of reads mapped to HXB2 K03455 with HISAT2 with usegalaxy.eu. B. RmDup-processed reads, controling for PCR duplicates after initial alignment. Search strategy = SRA. Searchterms: "HIV-1 and RNAseq and virus". Bioproject: PRJNA320293, specifically SRR3472915. Viewed in IGV. Source: From https://github.com/NCBI-Codeathons/Virus_Graphs/edit/master/README.md.

Will add predicted isoform models soon..

Team & contact info

Alejandro Rafael Gener (Lead/Corresponding Author)
[email protected]; [email protected]
Baylor College of Medicine, Houston, TX, USA
MD Anderson Cancer Center, Houston, TX, USA
Universidad Central del Caribe, Bayamón, PR, USA

Nicolas Cooley
[email protected]
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh PA, USA,15206

Charles Scott Kirby
[email protected]
Johns Hopkins University School of Medicine, Baltimore, MD, USA

Zhao Liu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

Rahil Sethi
[email protected]
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA 15206

Yutong Qiu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.ipynb_checkpoints		.ipynb_checkpoints
scripts		scripts
swigg_figure		swigg_figure
vg_prunes		vg_prunes
.DS_Store		.DS_Store
HIV-1 splicing cartoon.png		HIV-1 splicing cartoon.png
HIV-1_splicing_cartoon.png		HIV-1_splicing_cartoon.png
HIV-1_splicing_cartoon.tif		HIV-1_splicing_cartoon.tif
HIV1.fasta		HIV1.fasta
LICENSE		LICENSE
README.md		README.md
Virus _Graphs_3_Workflow.pdf		Virus _Graphs_3_Workflow.pdf
Virus3 v1.pptx		Virus3 v1.pptx
Virus3 v2.pptx		Virus3 v2.pptx
Virus_Graphs_3_Workflow_med_screen.tif		Virus_Graphs_3_Workflow_med_screen.tif
Virus_Graphs_3_Workflow_screen.tif		Virus_Graphs_3_Workflow_screen.tif
igv_snapshot_HXB2-mapping_reads_from_SRR3472915_v2.png		igv_snapshot_HXB2-mapping_reads_from_SRR3472915_v2.png
limit_of_linear.pdf		limit_of_linear.pdf
limit_of_linear.png		limit_of_linear.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VirusGraphs3

Are HIV-1 Graph Reference Genomes Right for Me?

Linear reference genomes

Graph Reference genomes

Nucleobase graph reference genome

Approximate k-mer graph reference genome

HIV-1 transcript model

Scope

Implementation

Reference graphs

Extension of SWIGG

Transcript modeling

Visualization

Results

for SWIGG

for NovoGraph

for VG

HIV-1 transcript model

Team & contact info

About

Releases

Packages

Contributors 5

Languages

License

NCBI-Codeathons/VirusGraphs3

Folders and files

Latest commit

History

Repository files navigation

VirusGraphs3

Are HIV-1 Graph Reference Genomes Right for Me?

Linear reference genomes

Graph Reference genomes

Nucleobase graph reference genome

Approximate k-mer graph reference genome

HIV-1 transcript model

Scope

Implementation

Reference graphs

Extension of SWIGG

Transcript modeling

Visualization

Results

for SWIGG

for NovoGraph

for VG

HIV-1 transcript model

Team & contact info

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages