This project comprises a standardized alignment pipeline for single-cell RNA-seq reads, mapping these reads to locations of known genes, and saving gene expression and alignment metadata in HDF5 format. This data is suitable for input in the scQuery system at https://scquery.cs.cmu.edu/ .
Multiple command-line scripts are provided, for different input types. These scripts share many command-line options, and ultimately save gene expression and alignment metadata in HDF5 format.
process_fastq_file.py
takes one or two FASTQ files as input: one file for single-end, and two files for paired-end.process_fastq_directory.py
searches for FASTQ files in a given directory, identifies paired-end files via_1.fastq
and_2.fastq
suffixes, aligns all single- or paired-end files found, and saves a single output file with gene expression for all FASTQ files.process_sra_file.py
is similar toprocess_fastq_file.py
, but converts reads from SRA to FASTQ format before alignment. Single- and paired-end data is automatically detected.process_sra_directory.py
processes all SRA files in a given directory.
These scripts share some command-line arguments:
-s
or--subprocesses
: number of subprocesses to use for alignment.--reference-path
: path on disk of the HISAT2 index. Note that this path is not of any actual file on disk, but is the "base name" of the HISAT2 index. For instance, if the index is labeled "mm10" and is stored in the/path/to/index
directory as filesmm10.1.ht2
throughmm10.8.ht2
, the reference path would be/path/to/index/mm10
.--output-file
: where to save the gene expression and alignment metadata. If omitted, scripts will save alignment results to an appropriate place. See each script's--help
information for details.--hisat2-options
: extra options passed directly to HISAT2. This must be provided as a single string; this value will be split on whitespace with shell quoting rules and each piece will be a separate argument/option to HISAT2. For multiple options, it will probably be necessary to surround the option string in quotes. For example:--hisat2-options="--mp 4,2 --phred64"
Expression data uploaded to the scQuery service must be aligned to the mouse
genome, release mm10
. A pre-built index for mm10
can be downloaded from the
HISAT2 website, but we recommend using a reference genome index which includes
known splice sites. Such an index is available for download at
https://s3.amazonaws.com/scquery/processed_data/mm10-splice-sites.tar . This
archive can be extracted to any appropriate location, for use with the
--reference-path
script option described above.
After alignment to a reference genome, reads are mapped to genes via a mapping from chromosome names to interval tree indexes. This index is built from the NCBI Consensus CDS (CCDS) data, and the current CCDS database for mouse is available at ftp://ftp.ncbi.nih.gov/pub/CCDS/current_mouse/CCDS.current.txt
If desired, one can build this index from a local copy of the CCDS database, using
the build_tree.py
script. For convenience, a prebuilt index is also available at
https://s3.amazonaws.com/scquery/processed_data/mouse-ccds-index.tar.xz -- this
archive should be extracted to the same directory containing this README file.
- Python 3.6 or newer
- Extra Python packages:
data-path-utils
,intervaltree
,pandas
,pytables
- HISAT2, version 2.1.0 or newer
- NCBI SRA Toolkit, if working with SRA files