-
Notifications
You must be signed in to change notification settings - Fork 18
SummarizedExperiment API
We have basically the same goals as for (G)Ranges: make SummarizedExperiment easier to use. This will eventually be implemented in the “plyomics” package.
The trouble with SE is that it is a collection of tables, like a database, not just a simple single table. There is a lot of pressure to denormalize SE into a table, so that it folds into existing R infrastructure that operates on tables. Meanwhile, we do not want to discard the semantic notions around feature-by-sample assay data coupled with metadata on the features (rows) and samples (columns).
There is a real risk of breaking the constraints that support those semantics, at least higher than when treating GRanges as a table. For example, while it is possible for someone to attempt to drop the “start” column from a GRanges, it should be fairly obvious that doing so will either fail or drop the GRanges to a tibble. However, in the case of SE, a user could, for example, break the rectangularity of the assay matrix in subtle ways. How can we help the user avoid mistakes?
One way would be force the user to be explicit about features and samples. Besides preserving data integrity, this would be beneficial in communicating semantics.
filter_features(x, pathway=="Glycolysis")
filter_samples(x, treatment=="A")
The semantics of SE are actually more general than features and
samples. For example, the metadata accessors are rowData()
and
colData()
, but it would be confusing to use those terms when they
are inconsistent with the tabular view we are presenting.
The denormalization really becomes a problem when working with assay data. Let us assume the user wants to filter the features that have an average expression above some value. The assay values should then be implicitly grouped by feature; otherwise, the user needs to do more work and will make mistakes doing that work. If we stored the assay values in arrays (as they are now), then the user could do:
filter_features(x, rowMeans(exprs) > 5)
But if we wanted to abstract away the array notion, we might have:
filter_features(x, mean(exprs) > 5)
We have effectively grouped the assay data by feature, when filtering features. We could do the opposite when filtering samples, such as when restricting to samples where at least half of their values are non-zero:
filter_samples(x, mean(exprs == 0L) >= 0.5)
When filtering by sample, we group by sample. When filtering by feature, we group by feature.
This is similar to the OLAP cube approach, where the user is ultimately interested in a denormalized table, but the data model preserves the structure in order to interpret high-level queries.
The current RangedSummarizedExperiment API directly supports many range operations. Do we want to do the same, for convenience and consistency? Semantically, SE a different beast than a GRanges — primarily experimental data, with metadata, some of which happen to be ranges. It is the data, not the ranges, that are primary. If we do not support direct range operations, then we will need an accessors to get and set the ranges.
Ideally users do not have to directly construct these objects, because they are derived from instrument output, or curated resources. There really is no standard mechanism for communicating data corresponding to a SummarizedExperiment, but Bioconductor provides interfaces that map different data sources to SE objects.
Should be able to restrict by row or column. As indicated above, the assay values should probably be grouped for convenience.
High-level API:
- Features
filter_features()
- Samples
filter_samples()
Should support aggregation by feature (row) or sample (column). A common use case of feature aggregation is moving from transcripts to genes, or genes to pathway. Sample aggregation generally happens through linear modeling, i.e., sample is converted to contrast. No, there is nothing wrong with the samples in an SE corresponding to contrasts. But simple sample-level aggregations might also make sense, for example, over technical replicates.
We should support grouping and summarizing over either dimension:
-
group_samples()
,summarize_samples()
-
group_features()
,summarize_features()
If we explicitly group by feature (or sample), then the assay values should be implicitly and additionally grouped by sample (or feature). This lets the user write:
summarize_features(x, exprs=mean(exprs))
Instead of:
summarize_features(x, exprs=colMeans(exprs))
Or is that being too smart?
We might want to merge:
- Feature metadata
- Sample metadata
- Experiments as a whole
This suggests having join variants for each of those.
Sorting could happen in either dimension, suggesting:
arrange_features()
arrange_samples()
SE consists of multiple components and it sometimes will be desirable to manipulate them independently. That means we will need accessors for components like:
- Feature metadata
- Sample metadata
- Specific assays
- Ranges
For example,
set_feature_data(x, feature_data(x) %>% mutate(...) %>% etc)