Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capacity Error with readXenium() Function and Issues Combining Samples Using cbind() #48

Open
aasingh2 opened this issue Aug 5, 2024 · 7 comments

Comments

@aasingh2
Copy link

aasingh2 commented Aug 5, 2024

Hello,

Thank you for the package! I am encountering a couple of issues and would appreciate your guidance.

Issue 1: Capacity Error with readXenium()

I am working with multiple Xenium samples and can successfully read most of them using the readXenium() function. However, I receive the following error message for a few samples:
Error: Capacity error: array cannot contain more than 2147483646 bytes, have 2157274215.

It seems that this error is related to the Arrow package used to read 'parquet' files. Is there a way to resolve this issue or a workaround that you would recommend?

Issue 2: Combining Samples with cbind()

I intended to use cbind() to combine multiple samples into a single SpatialFeatureExperiment. Unfortunately, I encountered an error because my samples have different numbers of rows (e.g., differing number of control probes, antisense probes, etc.). The error is as follows:

Error in FUN(X[[i]], ...) : column(s) 'ID' in ‘mcols’ are duplicated and the data do not match

The error disappears when I use cbind() after subsetting only the genes on the Xenium gene panel (i.e., excluding the negative control probes, antisense, etc.). Is there a way to combine the samples without needing to subset them first?

Thank you for the help!

@alikhuseynov
Copy link
Collaborator

Hi, could you please include traceback() when there is an error.

Issue 1:

  • yes, it is arrow related but I never saw this error when loading Xenium data, what version of XOA is this data from? I think we will need some time to tackle that.

Issue 2:

  • cbind() would work only if the genes are the same in all samples. We don't support full join like merge() yet but check this issue:
  • Merge method for SFE #29
  • also using genes present in some sample but not in others would bias the downstream analysis in any case.
  • If you want to keep all background probes, probably renaming them to same names across the all samples would work.

@lambdamoses
Copy link
Collaborator

Actually I have encountered the arrow error before. I haven't implemented it yet but I can try modifying the code to split the transcript spots and write them to multiple smaller GeoParquet files and then using DuckDB to concatenate the files.

@lambdamoses
Copy link
Collaborator

Seurat style full join is very problematic in that genes present in sample 1 but not sample 2 are NA's in sample 2, but Seurat fills in 0, which is not an appropriate stand in for NA.

@aasingh2
Copy link
Author

aasingh2 commented Aug 6, 2024

Hi @alikhuseynov and @lambdamoses ,

Thank you for the quick reply. Regarding the issue with joining samples, I am now planning to perform quality control (QC) prior to merging and remove the remaining probes from the gene panel. I believe this will allow cbind() to work correctly.

We are using XOA version 3.0.0.15. Below is the traceback of the error encountered in Issue 1:

traceback()
9: Table__from_dots(dots, schema, option_use_threads())
8: arrow::Table$create(df)
7: sfarrow::st_write_parquet(mols, file_out)
6: withCallingHandlers(expr, warning = function(w) if (inherits(w,
classes)) tryInvokeRestart("muffleWarning"))
5: suppressWarnings(sfarrow::st_write_parquet(mols, file_out))
4: formatTxSpots(file, dest = dest, spatialCoordsNames = spatialCoordsNames,
gene_col = gene_col, z = z, phred_col = phred_col, min_phred = min_phred,
split_col = split_col, flip = flip, z_option = z_option,
file_out = file_out, BPPARAM = BPPARAM, return = TRUE)
3: addTxSpots(sfe, file = fn, sample_id = sample_id, spatialCoordsNames = spatialCoordsNames,
gene_col = gene_col, z = z, phred_col = "qv", min_phred = min_phred,
split_col = split_col, z_option = z_option, flip = flip,
file_out = file_out, BPPARAM = BPPARAM)
2: addTxTech(sfe, data_dir, sample_id, tech = "Xenium", min_phred = min_phred,
BPPARAM = BPPARAM, flip = (flip == "geometry"), file_out = file_out)
1: readXenium(data_dir = "./output-XETG00291__0018868__1802-2017__20240726__093125",
sample_id = "1802_2017", image = "morphology_focus", segmentations = c("cell",
"nucleus"), add_molecules = TRUE, file_out = "Xe_1802_2017")

I had the error for 5 out of the 13 samples that I tried to read in.

@alikhuseynov
Copy link
Collaborator

alikhuseynov commented Aug 6, 2024

Thanks, so it is multimodal Xenium data and probably those 5 samples have very large transcript (tx) files.
Error happens when writing tx file, we will add support for large tx file split. Until then, If you don't need transcripts coords in your analysis, you can set add_molecules = FALSE, or file_out = NULL (which would read but not write processed transcript data).
You can do QC per sample and subset stuff before combining, but again, cbind() would only work if all samples have the same number of features with same names.

@lambdamoses
Copy link
Collaborator

We might not need to use DuckDB after all. I just found out that the sfarrow package can partition a large sf object before writing to GeoParquet: https://wcjochem.github.io/sfarrow/reference/write_sf_dataset.html

@alikhuseynov
Copy link
Collaborator

alikhuseynov commented Sep 23, 2024

that's great! splitting using partitioning would be the easiest way I think.
https://wcjochem.github.io/sfarrow/articles/example_sfarrow.html#partitioned-datasets-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants