Skip to content

Solidifying the VirusGraphs Infrastructure for Deployment

License

Notifications You must be signed in to change notification settings

NCBI-Codeathons/VirusGraphs3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VirusGraphs3

Solidifying the Open Virus Graphs Infrastructure for Deployment

Note that the following information has been produced in a collaborative open-science format. It has not been peer-reviewed, and should not be construed as medical advice, nor should it be used to inform clinical practice.

Are HIV-1 Graph Reference Genomes Right for Me?

Do you know what HIV sequence you're working with? If not, you're probably a clinician working with patient samples. If yes, you're probably a basic scientist wishing to control as many variables as possible. Either way, you'd want to maximize HIV sensitivity while maintaining HIV specificity. These questions can be asked with appropriate linear reference genomes (standard), multiple reference genomes, graph reference genomes, and a special case of multile linear reference genomes using transcript models.

Linear reference genomes

Linear references are the gold standard for genomics applications, including capturing viral genome information and viral sequence recovery. Examples include HIV sequence detection and HIV genome assembly. HIV genome assembly can be loosely classified into whole (reference) genome assebly and HIV genotyping (partial assembly).

Graph Reference genomes

Nucleobase graph reference genome

More soon..

Approximate k-mer graph reference genome

More soon..

HIV-1 transcript model

There is none!

Scope

Implementation

Reference graphs

For nucleobase graphs: NovoGene, VG.

For approximate k-mer graphs: original SWIGG, implemented SWIGG.

Extension of SWIGG

In the original graph created by SWIGG without any filtering parameters, the number of nodes are usually so large that it would result in inefficient visualization and other downstream analysis. Therefore, we implemented the algorithm to make a compact de Bruijn graph. Contraction of nodes starts by a depth-first-search from the source node of the graph. Nodes are included into a supernode continuously as the algorithm walks through the graph. A new supernode is created when the algorithm encounters a node with more than 1 neighbor. The size of the contracted graph is significantly reduced after contraction algorithm is applied (Table ?).

The algorithm to build and contract nodes in the de bruijn graph is the same as finding the maximal non-branching path. A description of the algorithm can be found here.

To run extension of Swigg, run python3 swigg_ext.py -f seqfile.fasta -k kmer_length -o output prefix

Transcript modeling

For split-read mapping, HISAT2 was used, and output .bam was pipped into StringTie, both in Galaxy.

Visualization

Gephi is used to visualize the output from SWIGG. SnapGene Graphviz

Results

for SWIGG

Graphs built with different k-mer lengths: SWIGG built for HIV1 (6 seqs)

The above figure contains 5 graphs built with different k-mer lengths. (a) k=16, (b) k=20, (c) k=32, (d) k=50, (e) k=90. Longer k-mers cover more repetitive regions. Therefore, longer k-mers result in simpler graphs. Red rectangles highlights the large loop topology in graphs made by small k-mers. The loops are the results of the repeititive k-mers in regions far apart.

Table of number of nodes and edges before and after contraction

Before Conrtaction After Contraction
kmer # Nodes # Edges # Nodes # Edges
16 38772 55265 3781 4788
20 42442 55241 2867 3641
32 49041 55169 1350 1703
50 52794 55061 442 554
90 54449 54821 68 83

for NovoGraph

Current NovoGraph script.. (TBD).

for VG

Graphs built with multiple sequence alignment algorithm: VG built for HIV1 (6 seqs)

The above graph contains part of the virual graph genereated with multiple sequences alignment algorithm using vg. The line in the graph shows the possible connections between each necleo nodes.

HIV-1 transcript model

Markdown Monster icon Figure: cDNA+PCR DNAseq ("classic RNAseq"). A. Coverage summary of reads mapped to HXB2 K03455 with HISAT2 with usegalaxy.eu. B. RmDup-processed reads, controling for PCR duplicates after initial alignment. Search strategy = SRA. Searchterms: "HIV-1 and RNAseq and virus". Bioproject: PRJNA320293, specifically SRR3472915. Viewed in IGV. Source: From https://github.com/NCBI-Codeathons/Virus_Graphs/edit/master/README.md.

Will add predicted isoform models soon..

Team & contact info

Alejandro Rafael Gener (Lead/Corresponding Author)
[email protected]; [email protected]
Baylor College of Medicine, Houston, TX, USA
MD Anderson Cancer Center, Houston, TX, USA
Universidad Central del Caribe, Bayamón, PR, USA

Nicolas Cooley
[email protected]
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh PA, USA,15206

Charles Scott Kirby
[email protected]
Johns Hopkins University School of Medicine, Baltimore, MD, USA

Zhao Liu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

Rahil Sethi
[email protected]
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA 15206

Yutong Qiu
[email protected]
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

About

Solidifying the VirusGraphs Infrastructure for Deployment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published