Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to check bytes accessed by MPIIO xfers and retry if possible #501

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mtmoore-hpe
Copy link

Currently the MPIIO backend assumes all completed read or write calls successfully accessed all bytes and always returns hints->transferSize bytes. This allows for silent partial writes only visible during verification and theoretically skews reported bandwidth since all bytes are assumed transferred when calculating the rate.

This patch follows the retry model in the POSIX backend looping up to MAX_RETRY times but retries the full request instead of remaining bytes. The full IO is retried for simplicity and ambiguity if only the first N bytes are accessed across all MPIs including with strided datatypes. We've only observed zero byte partial transfers in practice.

Collective MPIIO transfers aren't retried. Any one rank having a short access would require all ranks re-trying the collective call. Supporting that would require an additional synchronous MPI call to exchange all rank's bytes transferred to all ranks for every transfer. Any short MPIIO collective call returns the actual bytes transferred without retry.

@adilger
Copy link
Contributor

adilger commented Dec 13, 2024

Is there a reason not to do the incremental write retry instead of doing a full overwrite each time? The benefit of doing the incremental write is that it makes forward progress each time, and if the write is deliberately being split for some underlying reason (e.g. it is too large for the underlying storage or something) then it may be the first part will be done repeatedly without actually trying later parts.

@mtmoore-hpe
Copy link
Author

mtmoore-hpe commented Dec 13, 2024

Is there a reason not to do the incremental write retry instead of doing a full overwrite each time?

I agree it would be ideal to simply retry remaining bytes if counts other than 0 or expected length occur in the wild.

It wasn't clear if it was safe to assume it was only the first N bytes accessed for any given MPI implementation particularly in the --mpiio.useStridedDatatype and --mpiio.useFileView case. In that case the initial Xfer call accesses hints->segmentCount * hints->transferSize bytes of data in a single MPIIO call and the backend returns a single segment worth of data was accessed. Subsequent calls immediately return hints->transferSize bytes since the access of all segments already happened in the first call (src/aiori-MPIIO.c:468). Any partial retry would be blockSize aligned due to the file view even if a part of a 2nd or later segment had been successfully accessed. If during the first call any number of bytes greater than hints->transferSize are accessed and didn't progress further the return value would be greater than length and cause a (perhaps) confusing abort. Then again, any backend that doesn't return length from it's Xfer aborts.

To avoid that ambiguity/mixed behavior between options and in the event it's possible some MPI implementation could access segments out of order in the strided+file view case I went with an all-or-nothing approach that would be consistent across all possible paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants