-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: Out of memory when using arrow #57
Comments
I've been able to simplify the issue to this code: library(dplyr)
covariates <- arrow::open_dataset("d:/temp/vignetteSccsArrow/covariates", format = "feather")
x <- covariates %>%
arrange(.data$covariateId, .data$stratumId, .data$rowId)
arrow::write_dataset(x, "d:/temp/test.arrow", format = "feather")
# R crashes The |
If I resave the In contrast, using RSQLite, I never exceed 0.5GB of memory usage, and it is about as fast as arrow (on this task): # Collect covariates to RAM, then save as SQLite table:
covRam <- covariates %>% collect()
sqlite <- RSQLite::dbConnect(RSQLite::SQLite(), "d:/temp/test.sqlite")
DBI::dbWriteTable(conn = sqlite, name = "covariates", value = covRam)
DBI::dbDisconnect(sqlite)
rm(covRam)
# Open the SQLite database, sort the table, and process in batches:
sqlite <- RSQLite::dbConnect(RSQLite::SQLite(), "d:/temp/test.sqlite")
result <- DBI::dbSendQuery(sqlite, "SELECT * FROM covariates ORDER BY covariateId, stratumId, rowId;")
while (!DBI::dbHasCompleted(result)) {
batch <- DBI::dbFetch(result, n = 100000)
print(nrow(batch))
} @ablack3 : This looks like a deal-breaker for the arrow implementation at this moment. There are actual studies I need to do that I can't do with the arrow implementation that run just fine with the SQLite implementation. There's a blog post here on how they have a better sort implementation in Rust. Maybe they'll port that over to R one day? |
I did look into this a bit. You are correct the sort is done in memory. It accumulates the batches in memory and then sorts the data. Seems to be single threaded as well. They have this warning for the node that does this computation in the docs:
There's the same warning for their One of their main c++ developers on his personal roadmap for next year has an item to address these "pipeline breakers" :
I did explore a bit other libraries that do spilling, among them the Rust implementation that is based on con <- DBI::dbConnect(duckdb::duckdb(), read_only=F)
dbExecute(con, "
PRAGMA temp_directory='./tmp/';
SET memory_limit='10GB' ;
COPY (SELECT * from './parquetTestHuge/*.parquet' order by rowId, stratumId, covariateId )
TO 'arrowTestResults/' (FORMAT PARQUET, PER_THREAD_OUTPUT 1,
ALLOW_OVERWRITE 1);")
dbDisconnect(con, shutdown=TRUE) I think we should at least make an issue in the arrow repo and explain our need for computations on larger-than-memory data and that currently the sorting (and maybe others operatins) are stopping us from adopting arrow. I think from their response we can see if this is coming in the near future or not. |
@egillax Thanks for looking into this issue. Would you be willing to open that issue on arrow and link it here. @schuemie I agree this does seem like a deal breaker for Andromeda-arrow. So I guess we stick with SQLite for now and keep the arrow branch alive as experimental. Almost everything I do with arrow on my mac crashes R so it's not working great for me right now. |
@ablack3 I think that's a good idea. At least in regards to this issue that is a blocker for the transition. I made an issue in the arrow repo a month ago. The response was this is one of the top priorities for |
Agreed @ablack3 . It is a pity given all the work we put into this, but I don't see us moving forward with arrow until this is resolved. |
Ok sounds good. Hopefully it will get resolved at some point. I'll keep the arrow branch alive. The issue with using arrow on Mac is also a major challenge for me. |
@schuemie @egillax I'm most of the way through the arrow course and my initial thought is that we don't even need andromeda anymore. We can just use the arrow package directly. Works quite well. Out of memory sorting is a difficult problem but I would like to suggest that we use arrow as a general purpose tool, abandon the Andromeda wrapper, and implement the specific operations needed in each of the other package that depend on Andromeda (e.g. find alternatives to a general purpose out of memory sort). Is that crazy? Another idea (orthogonal to/independent of the suggestion above) is to use both arrow and duckdb and pass data back and forth to work around the limitations of each. Both support different subsets of dplyr and moving data between them is almost zero cost. |
I certainly like arrow a lot, but there are two reasons why I prefer keeping the Andromed wrapper:
Do you have a suggestion for how to solve the out-of-memory problem using an alternative approach? |
I'm fine with continuing development on Andromeda if you think the abstraction adds value. I think if I spent some time really understanding exactly what operations we need would help. For out of memory sorting I think duckdb can do this so I might just add a special function that converts the data to duckdb, sort, then converts it back to arrow. So maybe try something like this?
It seems to work but takes a long time. I think my suggestion is to examine exactly where we need to sort and question whether we really need to sort a very large unpartitioned dataset. If we can partition or filter before the sort this would help speed up the operation. Yea my suggestion is to be more specific about the exact operations we need and try to avoid needing to sort a gigantic table. I think sorting will require at least one full pass through the data. What about replacing sqlite with duckdb? That might be an easier change than moving from sqlite to arrow. |
The Andromeda abstraction/interface is closely tied to the database backend and does not map so well onto arrow. For example you cannot "append rows" to a parquet file so appendToTable will need to be dropped from Andromeda. You have to rewrite the entire file. You also cannot add indexes. So if we switch to arrow all the hades code that uses these Andromeda functions will need to change anyway. There is no notion of closing a connection to an Arrow file system dataset. I think if we want to use arrow the andromeda exported will need to change. Duckdb on the other hand would be very similar to sqlite. All andromeda references: https://github.com/search?q=org%3AOHDSI%20Andromeda%3A%3A&type=code I still think it is worth exploring deprecating Andromeda but it requires a careful review of all the larger than memory operations required by Hades. And leaving Andromeda alone (sqlite) is also a good option since it requires no additional work. |
Out of memory sort code locations: Trying to identify all instances where out of memory sorting is used in Hades. https://github.com/OHDSI/CohortDiagnostics/blob/85bae994c71aafade7598c7a2a5bfb3eb619e5b1/R/CohortRelationship.R#L161 Well this search is not so quick and easy to do since I'm looking for places where arrange is called on an Andromeda object. But that's what I found so far. |
It seems arrow doesn't play very nice with dplyr as it relates to memory. When I call
batchApply()
on the results of a dplyr query, this currently causes the result of the dplyr query to be written to a temp Andromeda object. However, for large data objects this causes my system (with 64GB of RAM) to run out of memory:This is a bit unfortunate because the whole idea of Andromeda is to handle objects that don't fit in memory. What is especially odd is that the data isn't even that big. If I collect the data, it is 17GB in memory.
I notice the same behavior when not writing to a temp Andromeda, but using
as_record_batch_reader()
instead, as implemented in PR #56. It seems that executing the dplyr query is what causes the massive memory use. The query in my test case is a simple sort.The text was updated successfully, but these errors were encountered: