Skip to content

Latest commit

 

History

History
124 lines (82 loc) · 8.78 KB

Group3_Hackaton2024.md

File metadata and controls

124 lines (82 loc) · 8.78 KB

BCM hackathon project

August 2024

Analyzing Antimicrobial Resistance Genes in NCBI Sequence Read Archive

Antimicrobial resistance (AMR) is a growing global health concern, driven by the overuse and misuse of antibiotics. Detecting and monitoring the presence of AMR genes in various environments is crucial for understanding the spread of resistance and informing public health strategies. The Logan database of assembled contigs and unitigs, derived from a massive freeze of the NCBI Sequence Read Archive (SRA), offers a unique and comprehensive resource for studying genetic material across a wide array of samples (Chikhi et al., 2024).

In this project, we will align the genes of CARD (Comprehensive Antibiotic Resistance Database) (Alcock et al., 2023) to the Logan database to identify and catalog AMR genes present in the dataset. By leveraging the highly efficient unitigs and contigs of Logan, we aim to detect AMR genes with high accuracy and sensitivity, despite the inherent complexities and challenges of working with such a large-scale dataset (Bradley et al., 2019)[see section “Antibiotic resistance genes in the ENA”]. This work will provide valuable insights into the distribution and prevalence of AMR genes across a vast range of environments and host organisms.

Flowchart

AFlowchart

Summary of Procedures

  1. Data Preparation
    1. Download (subset) unitigs/contigs from the Logan database.
      2. Obtain the CARD database containing curated sequences of known AMR genes (link, file) [Daniel to github]
      3. Download the metadata (data/location) of SRA accessions [Kristen]
      4. Parse the metadata of SRA accessions
    2. Alignment and Detection
      1. Align the sequences from the CARD database to the Logan unitigs/contigs using appropriate bioinformatics tools using minimap2 (Li, 2018) and Diamond (Buchfink et al., 2015)[work in progress]. []
      2. Identify and annotate matches, focusing on high-confidence alignments that suggest the presence of AMR genes.
    3. Post-Processing
      1. Filter and curate the results to remove low-confidence hits (alignment length, alignment identity using NM tag)
      2. Finding literature for AMR genes in SRA
        1. Specific biological question [Hassan]
      3. Summarize the findings in terms of the presence, distribution, and frequency of different AMR genes across the samples including metadata.
      4. Interpret the results considering the limitations of the approach.

Possible Future Directions

  1. Annotation and Visualization:
    1. Develop scripts or pipelines to annotate AMR genes in the Logan dataset.
    2. Create visualizations (e.g., heatmaps, phylogenetic trees, geographic plots) to represent the distribution of AMR genes across samples.
  2. Statistical Analysis:
    1. Perform statistical tests to compare the prevalence of AMR genes across different environments or hosts.
    2. Investigate correlations between the presence of AMR genes and metadata (e.g., sample origin, sequencing platform).

This project will not only contribute to the understanding of AMR gene distribution but also provide participants with hands-on experience in handling large-scale genomic datasets and applying bioinformatics tools in a real-world context.

Extra information

The Logan database:

  • https://github.com/IndexThePlanet/Logan
  • The Logan database is a comprehensive collection of DNA and RNA sequences assembled from the entire NCBI Sequence Read Archive, offering an efficient and condensed representation of vast genomic data through unitigs and contigs.

The CARD database:

  • https://card.mcmaster.ca/

  • The CARD (Comprehensive Antibiotic Resistance Database) is a curated repository of sequences and associated data for known antimicrobial resistance genes, providing a critical resource for the identification and study of resistance mechanisms in various organisms.

  • For parsing CARD files, this code from this paper might be helpful.

    • aro_metadata.tsv & nucleotide_fasta_protein_homolog_model.fasta

Alignment results
Instructions from Rayan here

References

Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., Edalatmand, A., Petkau, A., Syed, S. A., Tsang, K. K., Baker, S. J. C., Dave, M., McCarthy, M. C., Mukiri, K. M., Nasir, J. A., Golbon, B., Imtiaz, H., Jiang, X., Kaur, K., … McArthur, A. G. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research, 51(D1), D690–D699.
Bradley, P., den Bakker, H. C., Rocha, E. P. C., McVean, G., & Iqbal, Z. (2019). Ultrafast search of all deposited bacterial and viral genomic data. Nature Biotechnology, 37(2), 152–159.
Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60.
Chikhi, R., Raffestin, B., Korobeynikov, A., Edgar, R., & Babaian, A. (2024). Logan: Planetary-Scale Genome Assembly Surveys Life’s Diversity. In bioRxiv (p. 2024.07.30.605881). https://doi.org/10.1101/2024.07.30.605881
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics , 34(18), 3094–3100.

Team members:

Useful links

Zoom link for the group meetings: https://jhubluejays.zoom.us/my/majidian

Zoom link for the hackathon https://dnanexus.zoom.us/j/98866720215?pwd=MUl0WnBmYUQyc1dHU3l3cGpuNHAyQT09

Hackathon schedule: https://docs.google.com/document/d/1d9ZsMiFST9YFqCRGv23OA6V5S-iCwPbQXfJrdJWx5xc/edit#heading=h.ljuyhf5x89pp

GitHub page https://github.com/collaborativebioinformatics/2024_pangenomegraph

DNA nexus command line help page https://documentation.dnanexus.com/getting-started/cli-quickstart

Hackathon paper
https://docs.google.com/document/d/1VdB_qICMPKzlhnhZvWOAz54-u7s7aa_H2NSEp32tgx4/edit

name role task E-mail (gmail) Github dnaNexus
Daniel colead Sysadmin [email protected] DanielPAagustinho DanielPA
Sina colead [email protected] sinamajidian smajidian
Abohassan Flowchart/ writer1 [email protected] AbolhassanBahari abolhassan
Kristen [email protected] kdc10 kristendcurry
Aanuoluwa [email protected] aanuoluwaduro aanuoluwaduro
Christian
Jen-Yu writer2 [email protected] jyw-atgithub jenyuwang
Rayan
Narges [email protected] nargessangaranipour NargesSangaraniPour

roles to be distributed: Sysadmin DNAnexus /Github/ , Flowchart, presenter_day1, writer1, writer2

ToDo list: (day1 )

Higher priority on top

  • Prepare flowchart [Hassan]

  • Add people to the DNA nexus [Daniel]

  • Downloading Alignment to the DNA nexus [ ? ]

  • [ ]

  • [ ]

  • Filtering alignment (think of packages, command line) [Jen-Yen]

  • Work on the metadata of SRA accessions [Kristen]

  • Literature on AMR genes and SRA [Hassan]

  • Add people to the github

Updates: