-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Save File stalls causing lost cached information #131
Comments
Looking at the internals we're using using So the implementation is:
And the call from Using a csv is also problematic because we have issues like this (where a task fails mid write) but also race conditions etc. But this is a pretty straightforward key value store and we should never need to store the entire block in memory at all. Currently exported functions are:
Simply appending to the csv file would be sufficient, provided the checksum process works for each task. I may be missing something but reading the existing data at all seems completely unnecessary, all we care about is not overwriting it. refactor Incremental mode is also not something that needs to be audited or transfered with study results, so using a csv should not be a hard requirement here. |
Noting this in case others run into this problem, which we observed when running a large number of cohorts through CohortGenerator via Strategus. This appears to be related to writing to csv files with multiple threads and as a work-around we've applied this code in the CohortGeneratorModule: # Setting the readr.num_threads=1 to prevent multi-threading for reading
# and writing csv files which sometimes causes the module to hang on
# machines with multiple processors. This option is only overridden
# in the scope of this function.
withr::local_options(list(readr.num_threads=1)) By setting the option of I'd like to repurpose this issue to provide a mechanism to store this incremental information in the database so that it is stored with the cohort table(s) vs the file system. We'll leave the file approach in the v0.x for the package and aim to find a new home for file-based incremental operation tracking at some point. |
This may be a bit tricky to reproduce, as I had this issue generating 17k cohorts using cohort generator.
Symptoms
As the incremental file grows (it's just a csv with cohort hashID and generation) the
write_csv
call will eventually hang. The problem is that it will hang in the middle of writing the results, thus loosing everything that wasn't 'flushed' to disk during save. This causes subsequent runs to think it never generated the origional cohort list...so a cohort set of 8k finished cohorts reverts back to 1k finished cohorts because the write to csv hangs writing to the file.I suspect it's something to do with
readr
but I haven't been able to find any reported issues with that. The other possibility is that I am running in an EC2 instance, possibly writing to NAS attached storage, and something just flakes out and it hangs writing to disk. When I open the file (.csv) it doesn't seem to report another process is locking it, the file just seems to stop receiving I/O and the R process hangs. It is the hang-and-restart-R that leads to the issue where the partially written .csv.The current work-around is to disable the incremental mode so that it doesn't hang on writing the file.
Ideas for Work Arounds:
If the problem is the volume of data being written to disk at once, perhaps the solution is to write the contents to the file through a series of batch-append calls to the CSV.
...That's basically all my ideas, i'm not sure of the exact nature of the failure to propose methods to address it.
The text was updated successfully, but these errors were encountered: