Skip to content

Latest commit

 

History

History
51 lines (39 loc) · 3.81 KB

README.md

File metadata and controls

51 lines (39 loc) · 3.81 KB

ProcessDE

Processing RNA-seq data for gene expression analysis

Starting from either isoform-level counts quantified with Salmon (recommended), BAM files quantified with samtools, or vast-tools expression counts, use the exactTest functionality in the R package edgeR to quantify each desired contrast. Replicates belonging to each sample type and contrasts are specified in CSV tables. Genes are considered differentially expressed if (1) their expression, measured in counts per million (CPM), is at least a given threshold in at least one sample type of a contrast; (2) their log2-fold change is at least a given threshold, and (3) their FDR is lower than a given threshold. Defaults for thresholds can be changed optionally.

Note that if your experimental design is complex, this approach may not be appropriate.

Dependencies

R packages 'optparse', 'edgeR', 'gplots', and 'tximport'.

Usage

ProcessDE -h

ProcessDE -s <CSVFILE> -c <CSVFILE> --tx2gene <TSVFILE> --geneInfo <TSVFILE>

Input

  • Raw data files: Salmon transcript-level output files (recommended); output from samtools idxstats quantification of BAM files; or vast-tools expression output table
  • Sample table in CSV format with columns Sample and Type. Sample names must be unique. There must be replicates for at least one of the sample types in each contrast. In case of Salmon counts, an additional File column must be present that contains the path to the raw files.
  • Contrast table in CSV format with columns Experimental and Control. Each line defines a contrast, where the experimental and control types must match entries in the Type column of the sample table.
  • In case of Salmon input: A tab-separated file specifying transcript associations with genes with, at least, columns transcID, geneID and geneName, where transcID and geneID match the IDs used for Salmon quantification.
  • In case of Salmon input: A tab-separated gene information file with, at least, columns chrom, start, end, strand, geneID, geneName, biotype.

Output

  • RPKM table per gene and sample
  • Table with differentially expressed genes that fulfill the expression level, fold-change and FDR criteria; one per contrast
  • Joint table with all differentially expressed genes from all contrasts
  • MDS plot (a dimensionality reduction similar to PCA) of individual samples
  • Correlation heatmap of individual sample pseudo-counts
  • MA-plots (log-fold change vs. abundance) for each contrast, with differential genes highlighted
  • Correlation heatmap of sample type RPKM
  • Correlation heatmap of contrast log2-fold change (if more than one contrast)
  • Scatter plots of log2-fold changes, comparing all contrasts (if more than one)
  • Plot of the biological coefficient of variation from edgeR
  • Tables of tximport effective gene lengths and pseudocounts
  • Optionally, folder with input files for GO analysis per contrast
  • Log file

Example input

Input files are provided in the example folder based on Salmon counts of RNA-seq samples from mouse neurons depolarized with KCl, Maze et al., Neuron (2015), GEO: GSE69807. The subfolder auxFiles contains a transcript to gene mapping file and a gene information file generated from the GENCODE vM21 annotation.

Author

Please let me know if you have questions or encounter errors by emailing or raising an issue.

Ulrich Braunschweig, University of Toronto

email

Reference

If using in published work, please cite the DOI of the release. The latest one is: DOI