Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faux-atomic write #8

Open
clbarnes opened this issue Jun 15, 2022 · 2 comments
Open

Faux-atomic write #8

clbarnes opened this issue Jun 15, 2022 · 2 comments

Comments

@clbarnes
Copy link
Collaborator

Mentioned elsewhere but worth its own issue:

It would be really helpful for downstream processing purposes for the in-process writing to be done to some file which is named differently to the final output, and then at the end of the process, rename it. Currently, it's nontrivial to tell whether a file is still being written to or whether it's complete. For the purposes of per-slice post-processing (e.g. converting to a sensible format), it would be nice to regularly run a script which just looks for files of the right name and deals with them.

This should be a relatively small change: the current software should just write to f"{currentname}.part" and then do rename(f"{currentname}.part", currentname)at the end of the process.

@trautmane
Copy link
Member

To address this at Janelia, a companion .keep file is written after each dat file write completes.

For example:

/cygdrive/d/UploadFlags/0522-09_ZF-Card^E^^Images^Zebrafish^Y2022^M07^D12^Merlin-6257_22-07-12_153254_0-0-1.dat^keep

is written for

/cygdrive/e/Images/Zebrafish/Y2022/M07/D12/Merlin-6257_22-07-12_153254_0-0-1.dat

I'm not sure how/where this is done since I'm just a consumer of this data, but it might be available to you already.
I like the simplicity of your suggested .part naming scheme - the .keep file names are horrid because they are embedding so much information into the name.

However, a few advantages to the .keep file approach are:

  • You can see what is done and ready for transfer in one place (you don't need to scan the filesystem).
  • You can remove the .keep files post-transfer to easily track what remains to be transferred/processed. This could also be accomplished by removing the .dat from the scope, but we have not done that.
  • The .keep file name also includes a data set or project name (0522-09_ZF-Card in the example above) that is useful for organizing the data post transfer. This could be pulled from .dat header data instead.

I'm not a big fan of the .keep file setup, but I thought it was worth mentioning that it exists and how we currently use it.

@clbarnes
Copy link
Collaborator Author

Thanks! That is another way of doing it.

A halfway house would be to have the part files kept in a parallel directory hierarchy (under in_progres/ directory or something) and then moved into the complete/ hierarchy. So long as they're on the same file system, this should be just as fast, while keeping the first advantage you listed. There could be an equivalent processed/ hierarchy which satisfies the second advantage. I think the third property is probably best addressed another layer up, if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants