lcsc (linked charts for single cells)

What is it about?

This package allows you to classify single cells based on nearest-neighbor smoothing, not relying on unsupervised learning methods. The documentation can be found here.

Most often annotation of single cell data is done by clustering via graph-based methods followed by differential gene expression analysis between the clusters to find certain marker genes.

In this package we propose to directly start with the marker gene and calculate a smoothed fraction for the marker gene and to select cells based on whether they are above or below a chosen threshold. This smoothed fraction is calculated over the k nearest neighbors by dividing the sum of the marker gene counts for the nearest neighbors by their sum of total counts (“library size”). This fraction is subsequently raised to the power of (\gamma) as some sort of gamma-correction (or Box-Cox correction). The threshold can then be simply chosen by looking at the distribution of this smoothed expression fraction. A screenshot is provided below.


You can install the development version from GitHub with:

# install.packages("devtools")


As in the Guided Clustering Tutorial by Seurat, we will be using the PBMCdataset from 10X containing 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data are made availbe by 10X here.

We start of with the standard workflow consisting of quality control, normalization, and linear/non-linear dimensional reduction using Seurat.

library(Seurat) <- Read10X(data.dir = "inst/extdata/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts =, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[[""]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
pbmc <- ScaleData(pbmc, features = rownames(pbmc))
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:30)
pbmc <- RunUMAP(pbmc, dims = 1:30)
#> An object of class Seurat 
#> 13714 features across 2700 samples within 1 assay 
#> Active assay: RNA (13714 features, 2000 variable features)
#>  2 dimensional reductions calculated: pca, umap

To run the linked-charts for single cells, we need to extract the following data for the Seurat object.

  1. Sparse count matrix (rows = genes, cols = cells)
counts <- GetAssayData(pbmc, "counts")
#> [1] 13714  2700
  1. PC coordinates
pc_space <- Embeddings(pbmc, "pca")
#> [1] 2700   50
  1. Non-linear dimensional reduction embedding (2D)
embedding <- Embeddings(pbmc, "umap")
#> [1] 2700    2
  1. Cell meta data
meta_data <- pbmc[[]]
#>                  orig.ident nCount_RNA nFeature_RNA
#> AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
#> AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
#> AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
#> AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
#> AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
#> AAACGCACTGGTAC-1     pbmc3k       2163          781  1.6643551

Based on these meta we will generate the cells that only contains the needed information and will be used to subset the counts according to which sample is selected.

s = "orig.ident"  # name of the sample column
cells = tibble::tibble(
  id = rownames(meta_data),
  sample = meta_data[[s]]
#> Registered S3 method overwritten by 'cli':
#>   method     from         
#>   print.boxx spatstat.geom
#> # A tibble: 6 × 2
#>   id               sample
#>   <chr>            <fct> 
#> 1 AAACATACAACCAC-1 pbmc3k
#> 2 AAACATTGAGCTAC-1 pbmc3k
#> 3 AAACATTGATCAGC-1 pbmc3k
#> 4 AAACCGTGCTTCCG-1 pbmc3k
#> 5 AAACCGTGTATGCG-1 pbmc3k
#> 6 AAACGCACTGGTAC-1 pbmc3k

Additionally we need to generate the neighborhood graph per sample containing the k nearest neighbors for each cell. For the computation the lcsc package provides the run_nn function, which also takes the number of pc_dimensions (dim) to be considered as input.

Start the linked charts application after generating a nearest neighborhood graph per sample.


k = 50
nn <- run_nn(cells, pc_space, k=k, dim=30)
#> List of 2
#>  $ idx  : num [1:2700, 1:50] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ dists: num [1:2700, 1:50] 0 0 0 0 0 0 0 0 0 0 ...

Now we can finally starts the linked charts application. k refers to the number of nearest neighbors which are used for smoothing. See the equation above.

       k=50 # Smoothing the expression over 50 nearest neighbors

The application would look like this after selecting macrophages using the smoothed expression of CD68.

