Skip to content

Solidifying the VirusGraphs Infrastructure for Deployment

License

Notifications You must be signed in to change notification settings

rahil19/VirusGraphs3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VirusGraphs3

Solidifying the VirusGraphs Infrastructure for Deployment

insert Charley's figure

Linear reference genomes

Linear references are the gold standard for genomics applications, including capturing viral genome information and viral sequence recovery. Examples include HIV sequence detection and HIV genome assembly. HIV genome assembly can be loosely classified into whole (reference) genome assebly and HIV genotyping (partial assembly).

Graph Reference genomes

Nucleobase graph reference genome

Approximate k-mer graph reference genome

Scope

Implementation

Reference graphs

For nucleobase graphs: NovoGene VG.

For approximate k-mer graphs: original SWIGG, implemented SWIGG.

Extension of SWIGG

In the original graph created by SWIGG without any filtering parameters, the number of nodes are usually so large that it would result in inefficient visualization and other downstream analysis. Therefore, we implemented the algorithm to make a compact de Bruijn graph. Contraction of nodes starts by a depth-first-search from the source node of the graph. Nodes are included into a supernode continuously as the algorithm walks through the graph. A new supernode is created when the algorithm encounters a node with more than 1 neighbor. The size of the contracted graph is significantly reduced after contraction algorithm is applied (Table ?).

Transcript modeling

For split-read mapping, HISAT2 was used, and output .bam was pipped into StringTie, both in Galaxy.

Visualization

Gephi is used to visualize the output from SWIGG. SnapGene Graphviz

Results

for SWIGG

Graphs built with different k-mer lengths: SWIGG built for HIV1 (6 seqs)

The above figure contains 5 graphs built with different k-mer lengths. (a) k=16, (b) k=20, (c) k=32, (d) k=50, (e) k=90. Longer k-mers cover more repetitive regions. Therefore, longer k-mers result in simpler graphs. Red rectangles highlights the large loop topology in graphs made by small k-mers. The loops are the results of the repeititive k-mers in regions far apart.

Table of number of nodes and edges before and after contraction

Before Conrtaction After Contraction
kmer # Nodes # Edges # Nodes # Edges
16 38772 55265 3781 4788
20 42442 55241 2867 3641
32 49041 55169 1350 1703
50 52794 55061 442 554
90 54449 54821 68 83

for NovoGraph

Current NovoGraph script.. (TBD).

for VG

HIV-1 transcript model

Markdown Monster icon Figure: cDNA+PCR DNAseq ("classic RNAseq"). A. Coverage summary of reads mapped to HXB2 K03455 with HISAT2 with usegalaxy.eu. B. RmDup-processed reads, controling for PCR duplicates after initial alignment. Search strategy = SRA. Searchterms: "HIV-1 and RNAseq and virus". Bioproject: PRJNA320293, specifically SRR3472915. Viewed in IGV. Source: From https://github.com/NCBI-Codeathons/Virus_Graphs/edit/master/README.md.

Will add predicted isoform models soon..

Team & contact info

Alejandro Gener (Lead/Corresponding Author)
[email protected]; [email protected]
Baylor College of Medicine, Houston, TX, USA
MD Anderson Cancer Center, Houston, TX, USA
Universidad Central del Caribe, Bayamón, PR, USA

Nicolas Cooley
[email protected]
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh PA, USA,15206

Charles Scott Kirby
[email protected]
Johns Hopkins University School of Medicine, Baltimore, MD, USA

Zhao Liu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

Rahil Sethi

Yutong Qiu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

About

Solidifying the VirusGraphs Infrastructure for Deployment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 93.9%
  • Python 6.1%