PersiST is an exploratory method for analysing spatial transcriptomics (and other spatial 'omics) datsets. Given a spatial transcriptomics data set containing expression data on multiple genes resolved to a shared set of co-ordinates, PerisST computes a single score for each gene that measures the amount of spatial structure that gene shows in it's expression pattern, called the Coefficient of Spatial Structure (CoSS). This score can be used for multiple analytical tasks, as we show below.
For this tutorial, we shall be looking at spatial transcriptomics data on a sample from the Kidney Precision Medicine Project[1].
import pandas as pd
df = pd.read_csv('data/kpmp_30-10125_spatial_expression.csv')
df.head()
x_position | y_position | TSPAN6 | TNMD | DPM1 | SCYL3 | C1orf112 | FGR | CFH | FUCA2 | ... | ENSG00000288156 | ENSG00000288162 | ENSG00000288172 | ENSG00000288187 | ENSG00000288234 | ENSG00000288253 | ENSG00000288302 | ENSG00000288380 | ENSG00000288398 | SOD2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.548810 | 0.834208 | 0.00000 | 0.0 | 0.000000 | 0.0 | 0.00000 | 117.633220 | 0.00000 | 0.00000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1058.6990 |
1 | 0.589610 | 0.809106 | 0.00000 | 0.0 | 0.000000 | 0.0 | 0.00000 | 86.865880 | 173.73177 | 86.86588 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1737.3176 |
2 | 0.571644 | 0.166174 | 75.90709 | 0.0 | 75.907090 | 0.0 | 0.00000 | 0.000000 | 151.81418 | 0.00000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2201.3057 |
3 | 0.539074 | 0.714422 | 382.89725 | 0.0 | 127.632416 | 0.0 | 0.00000 | 127.632416 | 0.00000 | 0.00000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1148.6918 |
4 | 0.570493 | 0.468741 | 82.88438 | 0.0 | 0.000000 | 0.0 | 82.88438 | 0.000000 | 82.88438 | 0.00000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1989.2250 |
5 rows × 26026 columns
This is a pandas DataFrame where the first two columns correspond to the well co-ordinates, and the remaining columns contain the expression of each gene. This is the format PersiST expects spatial transcriptomics data to come in.
Let's compute CoSS scores for all the genes in this sample (this will take about 10 - 20 minutes).
from compute_persistence import run_persistence
metrics = run_persistence(df)
Let's take a look at those genes with the highest CoSS scores
metrics = metrics.sort_values('CoSS', ascending=False)
metrics.iloc[:10,:]
gene | CoSS | ratio | gene_rank | possible_artefact | svg | |
---|---|---|---|---|---|---|
16443 | IGLC1 | 0.141620 | 0.651803 | 1.0 | No | Yes |
16483 | IGHG1 | 0.114255 | 0.467722 | 2.0 | No | Yes |
5372 | MT1G | 0.105850 | 0.335738 | 3.0 | No | Yes |
10798 | DEFB1 | 0.103534 | 0.376595 | 4.0 | No | Yes |
12467 | CCL19 | 0.101025 | 0.649770 | 5.0 | No | Yes |
22516 | C17orf113 | 0.098336 | 0.574433 | 6.0 | No | Yes |
6980 | ALDOB | 0.096201 | 0.271491 | 7.0 | No | Yes |
5750 | PODXL | 0.095475 | 0.327815 | 8.0 | No | Yes |
1102 | SLC12A3 | 0.095306 | 0.352575 | 9.0 | No | Yes |
11812 | UMOD | 0.094709 | 0.401716 | 10.0 | No | Yes |
from plotting_utils import plot_many_genes
plot_many_genes(df, list(metrics.gene)[:20])
We can see that PersiST effectively surfaces those genes with notable spatial structure.
From the CoSS scores PersiST automatically calles genes as SV or not (this is the 'svg' column in the results). Once the data set has been reduced to the comparatively small number of genes PersiST typically calls as SV, in our experience simple clustering methods, such as hierarchical clustering, were effective to pick out groups of co-expressed SVGs.
Here is such a group of genes all expressed in the glomeruli of this particular sample [2].
plot_many_genes(df, ['PODXL', 'PTGDS', 'IGFBP5', 'TGFBR2', 'IFI27', 'HTRA1'], numcols=3)
[1] Blue B Lake et al. “An atlas of healthy and injured cell states and niches in the human kidney”. In: Nature 619.7970 (2023), pp. 585–594.
[2] PersiST paper (not yet published)