Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lustre DataIn Data Movements need a way to set striping #220

Closed
bdevcich opened this issue Oct 28, 2024 · 11 comments
Closed

Lustre DataIn Data Movements need a way to set striping #220

bdevcich opened this issue Oct 28, 2024 · 11 comments
Assignees

Comments

@bdevcich
Copy link
Contributor

Rabbit lustre filesystems default to a default stripe count of 1. When creating lustre filesystems with multiple OSTs, DataIn Data Movements will only be able to transfer data up to the size of 1 OST. This will cause dcp to error out once it fills up:

Copied 16.462 TiB (59%!)(MISSING) in 2001.381 secs (8.423 GiB/s) 1403 secs left ...
ABORT: rank X on HOST: Failed to write file /mnt/nnf/ab40f16f-cc36-4f2d-9d78-0c53495bfb1b-0/testfile errno=28 (No space left on device) @ /deps/mpifileutils/src/common/mfu_io.c:1051

We need a way to set the stripe count appropriately prior to Data Movement performed in the DataIn state.

@bdevcich
Copy link
Contributor Author

Setting a stripe count of -1 prior to data movement in DataIn resolves the issue.

The initial plan is to make this part of the NnfDataMovementProfile to run a command/commands prior to data movement.
We need to determine if this is something we want to do prior for every data movement or just for DataIn. In DataOut situations, the compute application/script can essentially do whatever it needs to do on the target filesystem to prepare for DataOut, but there is no parallel for that on DataIn.

I think we also need to consider if doing something like setstripe -c -1 just to suite the needs of data movement in DataIn is going to cause any issues with the running compute node application if that has its own striping ideas.

Does this mean that users will want to use a different DM profile for DataIn? Or do we just limit these commands (if supplied) to only run on DataIn.

@weloewe
Copy link

weloewe commented Oct 28, 2024

Another consideration might be to use Progressive File Layout (PFL) to avoid the situation of filling a single OST. In that case, very small files might go to the MDT and then as file extents grew they would stripe across more and more OSTs:

lfs setstripe -E 64K -L mdt -E 16m -c 1 -S 16m -E 1G -c 2 -E 4G -c 4 -E 16G -c 8 -E 64G -c 16 -E -1 -c -1

In any case, the specifics of the striping could be determined as needed, but maintaining a default (PFL) to avoid filling an individual OST.

@behlendorf
Copy link
Collaborator

Setting a PFL layout here would be a nice way to handle this since then files created by a users of the filesystem would also benefit from reasonable striping defaults. We should be able to use the new postActivate functionality for this, or we could add a new dedicated field to the NnfStorageProfile.

@bdevcich
Copy link
Contributor Author

Would this be something that is default for all rabbit lustre filesystems or do we need to do something different for DataIn situations?

@bdevcich
Copy link
Contributor Author

We should be able to use the new postActivate functionality for this, or we could add a new dedicated field to the NnfStorageProfile.

@behlendorf Unfortunately not. postActivate commands are run on the lustre server side, so we're going to have to add a new field to the NnfStorageProfile and some new behavior to mount and run commands from the lustre client.

The postActivate field already does this for XFS/GFS, but not for lustre.

@behlendorf
Copy link
Collaborator

We'd likely want to set this for all Rabbit Lustre filesystems although perhaps slightly differently depending on how many rabbits are part of the filesystem. You're right, I forget this was a client side thing. It does seem like we'll need some more machinery for that.

@bdevcich
Copy link
Contributor Author

bdevcich commented Nov 5, 2024

PostMount and PreUnmount commands are being added to the NnfStorageProfiles to support this.

Additionally, for XFS and GFS2, the existing PreActivate and PostDeactivate commands will be renamed to PostMount and PreUnmount since those actions are already taking place on the client side.

For lustre, all 4 will be supported.

perhaps slightly differently depending on how many rabbits are part of the filesystem

Would something like $NUM_RABBITS in the commands work or does it need to be more dynamic than that?

@bdevcich bdevcich self-assigned this Nov 5, 2024
@bdevcich bdevcich moved this from 📋 Open to 🏗 In progress in Issues Dashboard Nov 5, 2024
@behlendorf
Copy link
Collaborator

behlendorf commented Nov 7, 2024

Would something like $NUM_RABBITS in the commands work or does it need to be more dynamic than that?

We should also add $NUM_MDTS and $NUM_OSTS since they're relevant when deciding on a layout.

@bdevcich
Copy link
Contributor Author

Addressed via NearNodeFlash/nnf-sos#416

We should also add $NUM_MDTS and $NUM_OSTS since they're relevant when deciding on a layout.

This will be addressed in a subsequent PR.

@bdevcich
Copy link
Contributor Author

bdevcich commented Dec 4, 2024

PR to add $NUM_MDTS and $NUM_OSTS here: NearNodeFlash/nnf-sos#424

@bdevcich bdevcich moved this from 🏗 In progress to 👀 In review in Issues Dashboard Dec 4, 2024
@bdevcich
Copy link
Contributor Author

bdevcich commented Dec 9, 2024

@bdevcich bdevcich closed this as completed Dec 9, 2024
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Closed in Issues Dashboard Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

No branches or pull requests

3 participants