Skip to content

Single-Cell Python Dimensionality Reduction Project from BENG 185

Notifications You must be signed in to change notification settings

asheorann/scPyDR

 
 

Repository files navigation

scPyDR

Single-Cell Python Dimensionality Reduction

scPyDR is a Python package containing tools for the dimensionality reduction and visualization of single-cell RNA sequencing data. The two tools are simpler versions of Scanpy's scanpy.pp.pca and scanpy.tl.umap.

Note: scPyDR's UMAP tool runs independent of scPyDR's PCA tool for benchmarking purposes. It uses scanpy.pp.pca, scanpy.pp.neighbors, and sc.tl.leiden to perform dimensionality reduction and clustering.

Currently, scPyDR only supports the analysis of a single set of scRNA-seq data. Please see scPyDR Options and File Formats for more information.

Installation | Prerequisites | Basic Usage | scPyDR Options | File Formats | Contributors

Installation

scPyDR can be installed with the following commands:

git clone https://github.com/isabelwang30/scPyDR.git
cd scPyDR
python setup.py install

Note: if you do not have root access, scPyDR can be installed locally with the following commands:

git clone https://github.com/isabelwang30/scPyDR.git
cd scPyDR
python setup.py install --user

If the install was successful, the command scpydr --help should show a useful help message.

Note: if you get an error that says the scpydr command was not found, you may need to include the script installation path in your $PATH variable before calling scpydr. You can do this with the following command, replacing <user> with your own username:

export PATH=$PATH:/home/<user>/.local/bin

Prerequisites

scPyDR requires the following python libraries to be installed:

  • numpy
  • pandas
  • matplotlib
  • anndata
  • scanpy
  • umap-learn
  • leidenalg

Specific versions can be found in requirements.txt

These prerequisites can be installed with the following pip command from the scPyDR directory:

pip install -r requirements.txt

Note: if you do not have root access, the packages can be installed locally with the following command:

pip install --user -r requirements.txt

Basic Usage

The basic usage of scPyDR is as follows:

scpydr [DATADIR] [other options]

To run scPyDR's PCA function on a small test example (see benchmark/data in this repo):

scpydr benchmark/data

With the same test example, to run both scPyDR's PCA and UMAP functions:

scpydr benchmark/data -u

Using the -u flag, this should produce the outputs below:

  • data_pca.txt containing a matrix of the original data fitted to the new PCs
  • data_pca_plot.png, a PCA plot of the top 2 principal components that explain the most variance in the data
  • data_umap_plot.png, a UMAP embedding that visualizes the original high dimensionality data as clusters in 2d

A subset of the first row of benchmark_pca.txt (produced with cat data_pca.txt | head -n 1) is shown below:

-1.219651175379992569e+00 6.347092843840126397e-01 2.027305610399767477e-01 -2.803473111483573810e+00 -1.714817537238526146e+00 8.638381501631499371e-02 1.062654968245341780e-01 -2.534042243052277321e+00 3.133832692251309893e-01 -3.090336782067153454e-01 2.307744277491030171e+00 -1.864360114156933257e-01 -6.550224765112774294e-01 -5.911898444561517474e-01 -4.995681008765330278e-01 7.358296476335741687e-01 -3.882210850330847229e-02 2.325693856623583522e-01 -9.621457267116122480e-01 6.094084032488079616e-01 5.809506945609355100e-01 -5.377958204574021517e-01 -5.736244394191321039e-01 -1.053734274308523400e+00 1.202608425858034735e+00 1.099484793472616850e+00 -5.583842415508543100e-01 7.165800657872374302e-01 -6.406463601142169395e-01 1.739142153271877600e+00 1.017837341541046881e+00 3.104882071587936609e-01 9.257044271902127308e-01 4.916145353522260453e-01 1.388882058282096210e+00 1.150715506217820039e+00 9.995956940623004217e-01 7.362629783144015727e-01 3.783095504935519715e-01 8.314976429007380210e-01 -7.011155732554025244e-02 -1.476198849609394070e+00 1.402970631186186035e+00 -5.277030063514125402e-01 -4.900406427198634729e-01 -6.710851577836234316e-01 -2.326839880301993624e+00 1.348014237802151460e-01 4.496566583885364121e-01

The plots for PCA and UMAP from the benchmark dataset are shown below.

Note: because UMAP is a stochastic process, the graph may have global differences. In other words, the local clusters will look similar, but they may be placed in different locations on the plot.

scPyDR Options

scPyDR requires the following input file:

  • [DATADIR]: Directory containing 10x Genomics scRNA-seq data files. See below for format specifications on the 10x Genomics data files.

Additional options include:

  • -o, --output [OUTDIR]: Output directory to store results. Default: working directory.
  • -g, --min_genes [INT]: Minimum number of genes expressed per cell. Default: 200.
  • -c, --min_cells [INT]: Minimum number of cells expressing a gene. Default: 5.
  • -cr, --min_cell_reads [INT]: Minimum number of reads per cell. Default: None.
  • -gc, --min_gene_counts [INT]: Minimum number of counts per gene. Default: None.
  • -ntop, --n_top_genes [INT]: Number of highly variable genes to keep. Default: 500.
  • -t, --target_sum [FLOAT]: Number of reads per cell for normalization. Default: 1e4.
  • -n, --nComp [INT]: Number of principal componenets for PCA. Default: for n data points and m features, there are min(n-1,m) PCs.
  • --version: Print the version of scPyDR.
  • -u, --umap: Run UMAP for dimensionality reduction and visualization of the top 50 principal components.

File Formats

The input files should be the features.tsv.gz, barcodes.tsv.gz, and matrix.mtx.gz files from 10x cellranger's cellranger count analysis pipeline. A widely-used source to publish and find such count matrices from scRNA-seq data is GEO. The benchmarking data for this package was found on GEO. Read more on the count matrix file format here.

Note: To run scPyDR, the barcodes, features, and matrix files should be loaded as gzip files and placed into a single directory. Use the name of this directory (as a str) as the input to scPyDR.

Contributors

This repository was generated by Anushka Sheoran, Isabel Wang, and Monica Park with inspiration from the mypileup project demo and the projects of peers. If any issues should arise, please submit a pull request with any corrections or suggestions. Thank you!

About

Single-Cell Python Dimensionality Reduction Project from BENG 185

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.6%
  • Shell 5.4%