Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsync generates MPI error when I'm not the owner of the source path #550

Open
Aelmazaty opened this issue Jun 13, 2023 · 4 comments
Open

Comments

@Aelmazaty
Copy link

Hello,

I've installed mpifileutils version 0.11.1 using spack.
I always get an MPI error when I am not the owner of the source file/directory. Although I have at least read permissions.
The files are copied however this error is still generated. It's annoying as it was submitted as an LSF or SLURM job it will be mrked as failed.
No errors are generated if I am the owner of the source.

Example:
[aelmazaty@codon-dm-06 lsf-hx-wp]# ls -l /hps/scratch/sysinf/power_usage
-rw-r--r-- 1 root root 17035 Sep 5 2022 /hps/scratch/sysinf/power_usage

[aelmazaty@codon-dm-06 lsf-hx-wp]# mpirun -np 4 dsync -v --progress 1 /hps/scratch/sysinf/power_usage /hps/scratch/sysinf/aelmazaty/
[2023-06-13T16:01:14] Walking source path
[2023-06-13T16:01:14] Walking /hps/scratch/sysinf/power_usage
[2023-06-13T16:01:14] Walked 1 items in 0.001 secs (882.196 items/sec) ...
[2023-06-13T16:01:14] Walked 1 items in 0.001 seconds (818.132 items/sec)
[2023-06-13T16:01:14] Walking destination path
[2023-06-13T16:01:14] Walking /hps/scratch/sysinf/aelmazaty
[2023-06-13T16:01:14] Walked 1 items in 0.002 secs (617.520 items/sec) ...
[2023-06-13T16:01:14] Walked 1 items in 0.002 seconds (606.374 items/sec)
[2023-06-13T16:01:14] Comparing file sizes and modification times of 1 items
[2023-06-13T16:01:14] Started : Jun-13-2023, 16:01:14
[2023-06-13T16:01:14] Completed : Jun-13-2023, 16:01:14
[2023-06-13T16:01:14] Seconds : 0.000
[2023-06-13T16:01:14] Items : 1
[2023-06-13T16:01:14] Item Rate : 1 items in 0.000158 seconds (6310.263012 items/sec)
[2023-06-13T16:01:14] Updating timestamps on newly copied files

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[11234,1],0]
Exit code: 1

The file is copied however an error is generated

When I try with a file I own:
[aelmazaty@codon-dm-06 lsf-hx-wp]# ls -l /hps/scratch/sysinf/power_usage_aelmazaty
-rw-r--r-- 1 aelmazaty systems 17035 Jun 13 13:54 /hps/scratch/sysinf/power_usage_aelmazaty
[aelmazaty@codon-dm-06 lsf-hx-wp]# mpirun -np 4 dsync -v --progress 1 /hps/scratch/sysinf/power_usage_aelmazaty /hps/scratch/sysinf/aelmazaty/
[2023-06-13T16:02:17] Walking source path
[2023-06-13T16:02:17] Walking /hps/scratch/sysinf/power_usage_aelmazaty
[2023-06-13T16:02:17] Walked 1 items in 0.001 secs (872.339 items/sec) ...
[2023-06-13T16:02:17] Walked 1 items in 0.001 seconds (804.228 items/sec)
[2023-06-13T16:02:17] Walking destination path
[2023-06-13T16:02:17] Walking /hps/scratch/sysinf/aelmazaty
[2023-06-13T16:02:17] Walked 1 items in 0.000 secs (2210.726 items/sec) ...
[2023-06-13T16:02:17] Walked 1 items in 0.000 seconds (2045.349 items/sec)
[2023-06-13T16:02:17] Comparing file sizes and modification times of 1 items
[2023-06-13T16:02:17] Started : Jun-13-2023, 16:02:17
[2023-06-13T16:02:17] Completed : Jun-13-2023, 16:02:17
[2023-06-13T16:02:17] Seconds : 0.000
[2023-06-13T16:02:17] Items : 1
[2023-06-13T16:02:17] Item Rate : 1 items in 0.000162 seconds (6177.720668 items/sec)
[2023-06-13T16:02:17] Deleting items from destination
[2023-06-13T16:02:17] Removing 1 items
[2023-06-13T16:02:17] Removed 1 items in 0.003 seconds (327.228 items/sec)
[2023-06-13T16:02:17] Copying items to destination
[2023-06-13T16:02:17] Copying to /hps/scratch/sysinf/aelmazaty
[2023-06-13T16:02:17] Items: 1
[2023-06-13T16:02:17] Directories: 0
[2023-06-13T16:02:17] Files: 1
[2023-06-13T16:02:17] Links: 0
[2023-06-13T16:02:17] Data: 16.636 KiB (16.636 KiB per file)
[2023-06-13T16:02:17] Creating 1 files.
[2023-06-13T16:02:17] Copying data.
[2023-06-13T16:02:17] Copy data: 16.636 KiB (17035 bytes)
[2023-06-13T16:02:17] Copy rate: 1.207 MiB/s (17035 bytes in 0.013 seconds)
[2023-06-13T16:02:17] Syncing data to disk.
[2023-06-13T16:02:17] Sync completed in 0.020 seconds.
[2023-06-13T16:02:17] Setting ownership, permissions, and timestamps.
[2023-06-13T16:02:17] Updated 1 items in 0.003 seconds (298.208 items/sec)
[2023-06-13T16:02:17] Syncing directory updates to disk.
[2023-06-13T16:02:17] Sync completed in 0.001 seconds.
[2023-06-13T16:02:17] Started: Jun-13-2023,16:02:17
[2023-06-13T16:02:17] Completed: Jun-13-2023,16:02:17
[2023-06-13T16:02:17] Seconds: 0.043
[2023-06-13T16:02:17] Items: 1
[2023-06-13T16:02:17] Directories: 0
[2023-06-13T16:02:17] Files: 1
[2023-06-13T16:02:17] Links: 0
[2023-06-13T16:02:17] Data: 16.636 KiB (17035 bytes)
[2023-06-13T16:02:17] Rate: 391.203 KiB/s (17035 bytes in 0.043 seconds)
[2023-06-13T16:02:17] Updating timestamps on newly copied files

It works normally without getting any errors.

I tried different openmpi versions. All installed via spack. The latest is 4.1.5. I get the same error on all of them.

Is that a know issue? How can I avoid these errors?
Best regards,
Ahmed

@carbonneau1
Copy link
Collaborator

carbonneau1 commented Nov 26, 2024

The dsync utility is used to synchronize file and directories. Which means, once the files/directories are copied over, the process will check for differences as owner (uid) and group id (did) among other things and will set them (in the destination) identically to the source.
If you have permission to read the source you might not have permission to set owner, group and other attributes on the files/directories. If you run the process as root, it will update these attributes if you're not the owner.

ecxample output: [2024-11-26T15:32:36] Setting ownership, permissions, and timestamps.

in this case you would be better off to use 'dcp'. which will do only the copy part. To preserve the owner and other attribute with dcp you would have to specify the following flag:

-p, --preserve - preserve permissions, ownership, timestamps (see also --xattrs)

@carbonneau1
Copy link
Collaborator

If there is no objection, I will close this issue.

@adilger
Copy link
Contributor

adilger commented Nov 27, 2024

AFAICS, one reason to use dsync instead of dcp is to avoid copying all of the files again (if they already exist in the target), and to allow deleting old files in the target, like the difference between rsync and cp.

Running rsync as a non-root user will silently ignore the differences in the target file ownership if the user is not running as root (strace didn't show any failed attempts at fchown() or lchown() when running as a non-root user).

@carbonneau1
Copy link
Collaborator

I totally agree with you on the use of dsync instead of dcp. I was trying to explain the error he saw. I was not able to reproduce it using another user directory with read permission on it. Shall we close the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants