Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart with a different number of ranks #535

Open
adammoody opened this issue Feb 22, 2023 · 0 comments
Open

Restart with a different number of ranks #535

adammoody opened this issue Feb 22, 2023 · 0 comments

Comments

@adammoody
Copy link
Contributor

adammoody commented Feb 22, 2023

SCR currently allows an application to restart with a different number of ranks. However, one cannot call the SCR restart API in that case.

https://scr.readthedocs.io/en/latest/users/integration.html#restart-without-scr

This is awkward for applications that can otherwise use the SCR restart API when restarting with the same number of ranks, since they then need to have two code paths:

  1. if restarting with same number of ranks --> use SCR restart API
  2. if restarting with different number of ranks --> do not use the SCR restart API

It would be nice to merge these. It should be possible when leaving the files on the parallel file system, but there are checks and logic in the fetch process that currently do not support it.

One known problem is in reading the rank2file map. This scatters the files using kvtree, and it currently requires the exact same number of ranks to read the file which wrote it.

if (kvtree_read_scatter(rank2file, filelist, scr_comm_world) != KVTREE_SUCCESS) {

We could work around that to distribute the file info to the ranks in the current run. We could just have kvtree decide how the info gets spread out, or we'd need to modify the kvtree API so that the calling ranks can specify the new mapping.

For the remainder of the function, we stat each file to verify that it exists. It would be nice to keep that, and it's easy to handle.

scr/src/scr_fetch.c

Lines 251 to 258 in 79ff7ed

/* just stat the file to check that it exists */
for (i = 0; i < num_files; i++) {
if (access(src_filelist[i], R_OK) < 0) {
/* either can't read this file or it doesn't exist */
success = 0;
break;
}
}

The trickier part is that we then fill in the local filemap data structure with info about each file that a rank "owns". It's not clear what to do in this case. One option would be to have each rank register every file as though all files are shared by all ranks. This is not exactly scalable, but perhaps it's the safest option, since we don't know how they will be accessed.

scr/src/scr_fetch.c

Lines 267 to 272 in 79ff7ed

/* create a filemap for the files we just read in */
scr_filemap* map = scr_filemap_new();
for (i = 0; i < num_files; i++) {
/* get source and destination file names */
const char* src_file = src_filelist[i];
const char* dest_file = dest_filelist[i];

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant