Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sarus file-lock acquisition times out with NFS4 shares #36

Open
matteoguglielmi opened this issue Jun 28, 2024 · 3 comments
Open

Sarus file-lock acquisition times out with NFS4 shares #36

matteoguglielmi opened this issue Jun 28, 2024 · 3 comments

Comments

@matteoguglielmi
Copy link

Sarus file-lock acquisition times out for files in ~/.sarus when user homes are shared via NFS4.X (but works with NFS3):

[1229.860589957] [node01-6321] [Flock] [WARN] Still attempting to acquire lock on file "/cluster/raid/home/software/.sarus/metadata.json-ujjdgvwdlmqabqug" after 800 ms (will timeout after 1000 milliseconds)...

Thank you for any help.

@Madeeks
Copy link
Member

Madeeks commented Jun 28, 2024

Hi @matteoguglielmi, thanks for reporting this.
Does NFS4.X support the flock(2) system call?
Sarus is using that function to implement atomic access to the local repository metadata file.
It's possible that some shared/networked file systems do not offer support (either partial or complete) for flock(2).
That would explain the inability to acquire a lock.

@matteoguglielmi
Copy link
Author

Hi @Madeeks, I found this thread, which seems to explain why flock(2) is not working with NFS4 and suggests using alternate functions. By the way, I've compiled and tested the posted C-code to find out I get the same error message when running it on an NFS4 share.

@Madeeks
Copy link
Member

Madeeks commented Jul 2, 2024

Thanks for confirming the missing flock(2) support on NFS4.
What we can do in the short term is to make the lock implementation selectable through a configuration parameter and re-introduce the old implementation based on an explicit lockfile created by Sarus, to work as an alternative to flock-based locking.
The old code only supports exclusive locking (causes noticeable delays when starting O(1000) containers) and its cleanup is not super-robust, but it should work on any kind of filesystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants