Automated cell type annotation from rich, labeled reference data
Repository: openproblems-bio/task_label_projection
A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes. However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.
To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated reference dataset and then project labels from the reference to the new dataset.
Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.
name | roles |
---|---|
Nikolay Markov | author, maintainer |
Scott Gigante | author |
Robrecht Cannoodt | author |
flowchart TB
file_common_dataset("<a href='https://github.com/openproblems-bio/task_label_projection#file-format-common-dataset'>Common Dataset</a>")
comp_process_dataset[/"<a href='https://github.com/openproblems-bio/task_label_projection#component-type-data-processor'>Data processor</a>"/]
file_solution("<a href='https://github.com/openproblems-bio/task_label_projection#file-format-solution'>Solution</a>")
file_test("<a href='https://github.com/openproblems-bio/task_label_projection#file-format-test-data'>Test data</a>")
file_train("<a href='https://github.com/openproblems-bio/task_label_projection#file-format-training-data'>Training data</a>")
comp_control_method[/"<a href='https://github.com/openproblems-bio/task_label_projection#component-type-control-method'>Control method</a>"/]
comp_metric[/"<a href='https://github.com/openproblems-bio/task_label_projection#component-type-metric'>Metric</a>"/]
comp_method[/"<a href='https://github.com/openproblems-bio/task_label_projection#component-type-method'>Method</a>"/]
file_prediction("<a href='https://github.com/openproblems-bio/task_label_projection#file-format-prediction'>Prediction</a>")
file_score("<a href='https://github.com/openproblems-bio/task_label_projection#file-format-score'>Score</a>")
file_common_dataset---comp_process_dataset
comp_process_dataset-->file_solution
comp_process_dataset-->file_test
comp_process_dataset-->file_train
file_solution---comp_control_method
file_solution---comp_metric
file_test---comp_control_method
file_test---comp_method
file_train---comp_control_method
file_train---comp_method
comp_control_method-->file_prediction
comp_metric-->file_score
comp_method-->file_prediction
file_prediction---comp_metric
A subset of the common dataset.
Example file: resources_test/common/cxg_immune_cell_atlas/dataset.h5ad
Format:
AnnData object
obs: 'cell_type', 'batch'
var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
obsm: 'X_pca'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id'
Data structure:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
Cell type information. |
obs["batch"] |
string |
Batch information. |
var["feature_id"] |
string |
Unique identifier for the feature, usually a ENSEMBL gene id. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A ranking of the features by hvg. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
A label projection dataset processor.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
A subset of the common dataset. |
--output_train |
file |
(Output) The training data. |
--output_test |
file |
(Output) The test data (without labels). |
--output_solution |
file |
(Output) The solution for the test data. |
The solution for the test data
Example file:
resources_test/task_label_projection/cxg_immune_cell_atlas/solution.h5ad
Format:
AnnData object
obs: 'label', 'batch'
var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
obsm: 'X_pca'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id'
Data structure:
Slot | Type | Description |
---|---|---|
obs["label"] |
string |
Ground truth cell type labels. |
obs["batch"] |
string |
Batch information. |
var["feature_id"] |
string |
Unique identifier for the feature, usually a ENSEMBL gene id. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A ranking of the features by hvg. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
The test data (without labels)
Example file:
resources_test/task_label_projection/cxg_immune_cell_atlas/test.h5ad
Format:
AnnData object
obs: 'batch'
var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
obsm: 'X_pca'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_organism', 'normalization_id'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
Batch information. |
var["feature_id"] |
string |
Unique identifier for the feature, usually a ENSEMBL gene id. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A ranking of the features by hvg. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_organism"] |
string |
The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
The training data
Example file:
resources_test/task_label_projection/cxg_immune_cell_atlas/train.h5ad
Format:
AnnData object
obs: 'label', 'batch'
var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
obsm: 'X_pca'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_organism', 'normalization_id'
Data structure:
Slot | Type | Description |
---|---|---|
obs["label"] |
string |
Ground truth cell type labels. |
obs["batch"] |
string |
Batch information. |
var["feature_id"] |
string |
Unique identifier for the feature, usually a ENSEMBL gene id. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A ranking of the features by hvg. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_organism"] |
string |
The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
Quality control methods for verifying the pipeline.
Arguments:
Name | Type | Description |
---|---|---|
--input_train |
file |
The training data. |
--input_test |
file |
The test data (without labels). |
--input_solution |
file |
The solution for the test data. |
--output |
file |
(Output) The prediction file. |
A label projection metric.
Arguments:
Name | Type | Description |
---|---|---|
--input_solution |
file |
The solution for the test data. |
--input_prediction |
file |
The prediction file. |
--output |
file |
(Output) Metric score file. |
A label projection method.
Arguments:
Name | Type | Description |
---|---|---|
--input_train |
file |
The training data. |
--input_test |
file |
The test data (without labels). |
--output |
file |
(Output) The prediction file. |
The prediction file
Example file:
resources_test/task_label_projection/cxg_immune_cell_atlas/prediction.h5ad
Format:
AnnData object
obs: 'label_pred'
uns: 'dataset_id', 'normalization_id', 'method_id'
Data structure:
Slot | Type | Description |
---|---|---|
obs["label_pred"] |
string |
Predicted labels for the test cells. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["method_id"] |
string |
A unique identifier for the method. |
Metric score file
Example file:
resources_test/task_label_projection/cxg_immune_cell_atlas/score.h5ad
Format:
AnnData object
uns: 'dataset_id', 'normalization_id', 'method_id', 'metric_ids', 'metric_values'
Data structure:
Slot | Type | Description |
---|---|---|
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["method_id"] |
string |
A unique identifier for the method. |
uns["metric_ids"] |
string |
One or more unique metric identifiers. |
uns["metric_values"] |
double |
The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |