scRNA-seq alignment pipeline

This project comprises a standardized alignment pipeline for single-cell RNA-seq reads, mapping these reads to locations of known genes, and saving gene expression and alignment metadata in HDF5 format. This data is suitable for input in the scQuery system at https://scquery.cs.cmu.edu/ .

Usage

Multiple command-line scripts are provided, for different input types. These scripts share many command-line options, and ultimately save gene expression and alignment metadata in HDF5 format.

process_fastq_file.py takes one or two FASTQ files as input: one file for single-end, and two files for paired-end.
process_fastq_directory.py searches for FASTQ files in a given directory, identifies paired-end files via _1.fastq and _2.fastq suffixes, aligns all single- or paired-end files found, and saves a single output file with gene expression for all FASTQ files.
process_sra_file.py is similar to process_fastq_file.py, but converts reads from SRA to FASTQ format before alignment. Single- and paired-end data is automatically detected.
process_sra_directory.py processes all SRA files in a given directory.

These scripts share some command-line arguments:

-s or --subprocesses: number of subprocesses to use for alignment.
--reference-path: path on disk of the HISAT2 index. Note that this path is not of any actual file on disk, but is the "base name" of the HISAT2 index. For instance, if the index is labeled "mm10" and is stored in the /path/to/index directory as files mm10.1.ht2 through mm10.8.ht2, the reference path would be /path/to/index/mm10.
--output-file: where to save the gene expression and alignment metadata. If omitted, scripts will save alignment results to an appropriate place. See each script's --help information for details.
--hisat2-options: extra options passed directly to HISAT2. This must be provided as a single string; this value will be split on whitespace with shell quoting rules and each piece will be a separate argument/option to HISAT2. For multiple options, it will probably be necessary to surround the option string in quotes. For example: --hisat2-options="--mp 4,2 --phred64"

Data Requirements

Short Read Alignment

Expression data uploaded to the scQuery service must be aligned to the mouse genome, release mm10. A pre-built index for mm10 can be downloaded from the HISAT2 website, but we recommend using a reference genome index which includes known splice sites. Such an index is available for download at https://s3.amazonaws.com/scquery/processed_data/mm10-splice-sites.tar . This archive can be extracted to any appropriate location, for use with the --reference-path script option described above.

Expression Quantification

After alignment to a reference genome, reads are mapped to genes via a mapping from chromosome names to interval tree indexes. This index is built from the NCBI Consensus CDS (CCDS) data, and the current CCDS database for mouse is available at ftp://ftp.ncbi.nih.gov/pub/CCDS/current_mouse/CCDS.current.txt

If desired, one can build this index from a local copy of the CCDS database, using the build_tree.py script. For convenience, a prebuilt index is also available at https://s3.amazonaws.com/scquery/processed_data/mouse-ccds-index.tar.xz -- this archive should be extracted to the same directory containing this README file.

Software Requirements

Python 3.6 or newer
Extra Python packages: data-path-utils, intervaltree, pandas, pytables
HISAT2, version 2.1.0 or newer
NCBI SRA Toolkit, if working with SRA files

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.gitignore		.gitignore
README.md		README.md
alignment.py		alignment.py
build_tree.py		build_tree.py
cluster_scheduling.py		cluster_scheduling.py
map_reads_to_genes.py		map_reads_to_genes.py
ncbi_sra_toolkit_config.py		ncbi_sra_toolkit_config.py
paths.py		paths.py
process_fastq_directory.py		process_fastq_directory.py
process_fastq_file.py		process_fastq_file.py
process_sra_directory.py		process_sra_directory.py
process_sra_file.py		process_sra_file.py
process_sra_from_srr_id.py		process_sra_from_srr_id.py
process_sra_from_srr_list.py		process_sra_from_srr_list.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scRNA-seq alignment pipeline

Usage

Data Requirements

Short Read Alignment

Expression Quantification

Software Requirements

About

Releases

Packages

Contributors 2

Languages

mruffalo/sc-rna-seq-pipeline

Folders and files

Latest commit

History

Repository files navigation

scRNA-seq alignment pipeline

Usage

Data Requirements

Short Read Alignment

Expression Quantification

Software Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages