Skip to content

Latest commit

 

History

History
94 lines (55 loc) · 4.37 KB

README.md

File metadata and controls

94 lines (55 loc) · 4.37 KB

gssnng

Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq).

This package is part of the scverse ecosystem and works with Scanpy AnnData objects stored as h5ad files.

  • Read the Docs!

  • Create an AnnData objects with smoothed counts. New Sept. 2024 ===>>> Open In Colab

  • Score cells using .gmt gene set files ===>>> Open In Colab

  • Score cells using the Decoupler/Omnipath style API ===>>> Open In Colab

  • Read the paper ===>>> gssnng

The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq.

Nearest neighbor graphs (NNG) are constructed based on user defined groups (see the 'groupby' parameter below). The defined groups can be processed in parallel, speeding up the calculations. For example, a NNG could be constructed within each cluster or jointly by cluster and sample. Smoothing can be performed using either the adjacency matrix (all 1s) or the weighted graph to give less weight to more distant cells.

Scoring Functions

The list of scoring functions:

geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold.

singscore: Normalised mean (median centered) ranks (requires ranked data)

ssGSEA: Single sample GSEA based on ranked data.

rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set.

robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked).

mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked).

average_score: Mean ranks or counts

median_score: Median of counts or ranks

summed_up: Sum up the ranks or counts.

Parameters

These parameters are used with the “scores_cells.with_gene_sets” function.:

adata: AnnData object from scanpy.read_*
AnnData containing the cells to be scored

gene_set_file: str[path]
The gene set file with list of gene sets, gmt, one per line. See this definition <https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29>_ .

groupby: [str, list, dict]
either a column label in adata.obs, and all categories taken, or a dict specifies one group. SEE DESCRIPTION BELOW

smooth_mode: "adjacency", "connectivity", or "off"
Dictates how to use the neighborhood graph. adjacency weights all neighbors equally, connectivity weights close neighbors more

recompute_neighbors: int
should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N

score_method: str
which scoring method to use

method_params: dict
python dict with XGBoost params.

ranked: bool
whether the gene expression counts should be rank ordered

cores: int
number of parallel processes to work through groupby groups

Method parameters

Some methods have some additional options. They are passed as a dictionary, method_params={param_name, param_value}.:

singscore: {'normalization', 'theoretical'}, {'normalization', 'standard'}
The singscore manuscript describes the theoretical method of standardization which involves determining the theoretical max and minimum ranks for the given gene set.:

rank_biased_overlap: {'rbo_depth', n} (n: int)
Here, n is the depth that is decended down the ranks, where at each step, the overlap with the gene set is measured and added to the score.:

ssGSEA: {'omega': 0.75}