Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Duplicate transfer request within the same job prevents cleanup of the job's logical volume #1003

Open
dlherms-ibm opened this issue Apr 30, 2021 · 1 comment
Assignees
Milestone

Comments

@dlherms-ibm
Copy link
Contributor

When submitting the same transfer definition twice within a job and also specifying the same handle and contribid, one of the transfers is successful and the other fails. That is expected. However, after both operations have run to completion, file locks remain on the compute node. These file locks prevent the unmount of the path to the job's logical volume and the deletion of that logical volume. The work around is to end bbProxy for the compute node, manually unmount the path to the logical volume, and then restart bbProxy. Restarting bbProxy will automatically delete that orphaned logical volume.

While this is an application error that performs the same transfer twice using the same handle and contribid, the system should not be left in a state such that the job's logical volume cannot be properly cleaned up.

@dlherms-ibm dlherms-ibm added this to the CS21B milestone Apr 30, 2021
@dlherms-ibm dlherms-ibm self-assigned this Apr 30, 2021
@dlherms-ibm
Copy link
Contributor Author

So, the problem is that when the second start transfer runs, it opens the source file, builds a file handle, and inserts it into the file handle registry overlaying the first file handle which then leaks the fd. Then the failing second transfer removes the file handle it just inserted into the registry. When the first transfer actually completes, it can't find any file handle to close upon completion of the transfer for the file.

I put in additional info logging for fh and fh registry logic in the following bbproxy log that shows what is happening. Entry at timestamp 2021-04-29 13:10:25.802870 shows the overlay and the timestamp at 2021-04-29 13:10:28.938363 shows where we can't find the file handle upon completion of the actual transfer.

Issue1003_Problem log.pdf

Solution being pursued is to check to see if the same file handle entry exists in the registry prior to it being inserted. If a duplicate, error out the operation at that point so that we do not leak the file handle and file descriptor.

dlherms-ibm added a commit to dlherms-ibm/CAST that referenced this issue Apr 30, 2021
tgooding added a commit that referenced this issue Jun 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant