Fix to check bytes accessed by MPIIO xfers and retry if possible #501

mtmoore-hpe · 2024-12-13T16:14:05Z

Currently the MPIIO backend assumes all completed read or write calls successfully accessed all bytes and always returns hints->transferSize bytes. This allows for silent partial writes only visible during verification and theoretically skews reported bandwidth since all bytes are assumed transferred when calculating the rate.

This patch follows the retry model in the POSIX backend looping up to MAX_RETRY times but retries the full request instead of remaining bytes. The full IO is retried for simplicity and ambiguity if only the first N bytes are accessed across all MPIs including with strided datatypes. We've only observed zero byte partial transfers in practice.

Collective MPIIO transfers aren't retried. Any one rank having a short access would require all ranks re-trying the collective call. Supporting that would require an additional synchronous MPI call to exchange all rank's bytes transferred to all ranks for every transfer. Any short MPIIO collective call returns the actual bytes transferred without retry.

adilger · 2024-12-13T18:05:19Z

Is there a reason not to do the incremental write retry instead of doing a full overwrite each time? The benefit of doing the incremental write is that it makes forward progress each time, and if the write is deliberately being split for some underlying reason (e.g. it is too large for the underlying storage or something) then it may be the first part will be done repeatedly without actually trying later parts.

mtmoore-hpe · 2024-12-13T19:06:12Z

Is there a reason not to do the incremental write retry instead of doing a full overwrite each time?

I agree it would be ideal to simply retry remaining bytes if counts other than 0 or expected length occur in the wild.

It wasn't clear if it was safe to assume it was only the first N bytes accessed for any given MPI implementation particularly in the --mpiio.useStridedDatatype and --mpiio.useFileView case. In that case the initial Xfer call accesses hints->segmentCount * hints->transferSize bytes of data in a single MPIIO call and the backend returns a single segment worth of data was accessed. Subsequent calls immediately return hints->transferSize bytes since the access of all segments already happened in the first call (src/aiori-MPIIO.c:468). Any partial retry would be blockSize aligned due to the file view even if a part of a 2nd or later segment had been successfully accessed. If during the first call any number of bytes greater than hints->transferSize are accessed and didn't progress further the return value would be greater than length and cause a (perhaps) confusing abort. Then again, any backend that doesn't return length from it's Xfer aborts.

To avoid that ambiguity/mixed behavior between options and in the event it's possible some MPI implementation could access segments out of order in the strided+file view case I went with an all-or-nothing approach that would be consistent across all possible paths.

Fix to check bytes accessed by MPIIO xfers and retry if possible

b98f10f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to check bytes accessed by MPIIO xfers and retry if possible #501

Fix to check bytes accessed by MPIIO xfers and retry if possible #501

mtmoore-hpe commented Dec 13, 2024

adilger commented Dec 13, 2024

mtmoore-hpe commented Dec 13, 2024 •

edited

Loading

Fix to check bytes accessed by MPIIO xfers and retry if possible #501

Are you sure you want to change the base?

Fix to check bytes accessed by MPIIO xfers and retry if possible #501

Conversation

mtmoore-hpe commented Dec 13, 2024

adilger commented Dec 13, 2024

mtmoore-hpe commented Dec 13, 2024 • edited Loading

mtmoore-hpe commented Dec 13, 2024 •

edited

Loading