title

tags

authors

affiliations

date

bibliography

fastq-dl: efficiently download sequences from ENA and SRA

fastq

download

python

bioinformatics

name	orcid	affiliation
Robert A. Petit III	0000-0002-1350-9426	1, 2

name	orcid	affiliation
Michael B. Hall	0000-0003-3683-6208	3

name	orcid	affiliation
Gerry Tonkin-Hill	0000-0002-1350-9426	4

name	affiliation
Jie Zhu	5

name	orcid	affiliation
Timothy D. Read	0000-0001-8966-9680	2

name	index
Wyoming Public Health Laboratory, Wyoming Department of Health, Cheyenne, Wyoming, USA	1

name	index
Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA	2

name	index
Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Australia	3

name	index
Department of Biostatistics, University of Oslo, Oslo, Norway	4

name	index
Li Ka Shing Institute of Health Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, PR China	5

17 June 2023

paper.bib

Summary

fastq-dl is a convenient bioinformatic tool that simplifies the process of retrieving FASTQ files from ENA and SRA. It was developed to be easy to use and accessible to researchers from all backgrounds. By facilitating efficient downloading of publicly available FASTQ files, users can easily integrate these data into their own research. fastq-dl is available from PyPI and Bioconda for simple installation, and the source code is available at https://github.com/rpetit3/fastq-dl.

Statement of Need

High-throughput sequencing technologies have revolutionized the field of genomics, enabling researchers to generate vast amounts of data quickly and at relatively low cost. The European Nucleotide Archive (ENA) [@Burgin_2023] and the Sequence Read Archive (SRA) [@Katz_2022] are two major repositories for publicly hosting next-generation sequencing data from many research projects. Retrieving sequences from these repositories is often a multi-step process and difficult for researchers who lack experience with bioinformatics. fastq-dl is a bioinformatic tool that simplifies this process of downloading sequences from SRA and ENA.

Implementation

fastq-dl is written in Python and is designed to be user-friendly and simple to use. Users can submit queries to the ENA, via a REST API [@Burgin_2023], or SRA, via pysradb [@Choudhary_2019], with fallback mechanisms in the event either repository is down. fastq-dl supports a range of query types, including taxon ids, species names, and accessions, including BioSample, BioProject, Experiment, and Run Accessions. A query will return metadata for each hit and save this metadata to a tab-delimited file. Unless disabled by the user, fastq-dl will then proceed to download available sequences for each hit of the query. If using ENA, raw FASTQs are downloaded using their available FTP service, otherwise fasterq-dump [@Katz_2022] is used to download from SRA. In the event a repository is unresponsive, download attempts will be made against the other repository. When an Experiment or BioSample has multiple Run accessions associated with it, users can optionally choose to merge these Run accessions. Upon completion, users are provided with a summary file, a metadata file and FASTQ files per-query hit.

Funding

This project was partially supported by the Georgia Emerging Infections Program and the Wyoming Department of Health

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper.md

paper.md

Summary

Statement of Need

Implementation

Funding

References

Files

paper.md

Latest commit

History

paper.md

File metadata and controls

Summary

Statement of Need

Implementation

Funding

References