Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage when using assay() on large RaggedExperiments #25

Open
biobenkj opened this issue Feb 4, 2020 · 3 comments
Open

High memory usage when using assay() on large RaggedExperiments #25

biobenkj opened this issue Feb 4, 2020 · 3 comments

Comments

@biobenkj
Copy link

biobenkj commented Feb 4, 2020

RaggedExperiment continues to rule for all our 'omics related work! I did notice something interesting yesterday when running compactSummarizedExperiment(), when I attempt to access the names of the assays in a large RE

# RaggedExperiment in question
> aaml
class: RaggedExperiment
dim: 36019710 1401
assays(2): pc compartments
rownames: NULL
colnames(1401): 813584_Dx 814465_Dx ... RO02776B RO02815
colData names(99): Timepoint Gender ... MLLT10 KMT2A

#size
> object_size(aaml)
1.01 GB

#names access
#high memory usage (100s of GB)
names(assay(aaml))

#names access
#near instant
assayNames(aaml)

it will either be near instantaneous with using assayNames(), or require 100s of GB of memory with names(assay(my_RE)). Do you know why this might be the case? I'll work on getting a smaller reproducible example if there is interest.

Thanks again for all that you do and RaggedExperiments!

@mtmorgan
Copy link
Contributor

mtmorgan commented Feb 4, 2020

I believe, without actually checking, that the names are stored independently of the underlying data representation, and the cost is associated with adding names and hence duplicating the underlying data. If it's 'easy' to simulate the data for a reproducible example that would be great.

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Feb 4, 2020

Hi Ben, @biobenkj
I'm glad to hear you are making use of this data representation!
The trick behind RaggedExperiment involves providing a matrix representation from a GRangesList object. In the background, the stored representation is a GRangesList so accessing the metadata it relatively straightforward. When using assay, the GRangesList representation has to be converted to matrix, this involves creating quite a large sparse matrix from the mcols in the original GRangesList, a costly operation.
I agree, a minimal and reproducible example would be helpful. We'll see what we can do to increase the efficiency of this conversion. Thank you.

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Nov 12, 2020

@biobenkj Any updates on this?
Would a dgCMatrix representation help? Have you tested this?
We can create additional functionality to return this data representation.
If you can provide a reproducible example to help this move along, that would be great. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants