Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: fixed the init_module and deepspeed #20175

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

alyakin314
Copy link

@alyakin314 alyakin314 commented Aug 7, 2024

What does this PR do?

Fixes the init_module docs that suggest to do the following:

# Recommended for FSDP, TP and DeepSpeed
with fabric.init_module(empty_init=True):
    model = GPT3()  # parameters are placed on the meta-device

model = fabric.setup(model)  # parameters get sharded and initialized at once

# Make sure to create the optimizer only after the model has been set up
optimizer = torch.optim.Adam(model.parameters())
optimizer = fabric.setup_optimizers(optimizer)

However, setting up the optimizer and the model separately breaks with DeepSpeed because of the following:

def _validate_setup_optimizers(self, optimizers: Sequence[Optimizer]) -> None:
self._validate_launched()
if isinstance(self._strategy, (DeepSpeedStrategy, XLAStrategy)):
raise RuntimeError(
f"The `{type(self._strategy).__name__}` requires the model and optimizer(s) to be set up jointly"
" through `.setup(model, optimizer, ...)`."
)

Changed the docs to reflect the correct syntax.

Furthermore, discussed the necessity to use this with DeepSpeed Stage 3 as is stated here:
#17792 (comment)

I only discussed DeepSpeed Stage 3 as I cannot find a statement whether this needs to be done for Stage 2 to work correctly. From personal experience - this led to an improvement by allowing a slightly larger batch (2 -> 4) for me. Personally confused as to why this helps since Stage 2 is not sharding the parameters, but my understanding of these strategies is highly limited. Potentially @awaelchli or someone else could enlighten me if/why it does? Happy to change to include Stage 2 (or also 1?).

Only discussed for Stage 3 as that is where this is necessary. As an update - I believe i traced down what improved my performance to a larger batch and it wasn't the inclusion of the init module.

Just as a note - I read the Docs editing README and that building locally is required. Had some local device issues with building them (my local device issues) but I triple checked that it follows the .rst format, and this is a simple change.

Fixes: No issue as this a (likely straightforward) documentation fix.

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

[x] Yes!
Debugging why DeepSpeed wasn't working wasn't as fun though. Potentially could benefit from an article in docs, even though this is an experimental feature in fabric?


📚 Documentation preview 📚: https://pytorch-lightning--20175.org.readthedocs.build/en/20175/

@github-actions github-actions bot added docs Documentation related fabric lightning.fabric.Fabric labels Aug 7, 2024
@alyakin314 alyakin314 marked this pull request as draft August 7, 2024 06:57
@alyakin314 alyakin314 marked this pull request as ready for review August 7, 2024 06:59
@alyakin314
Copy link
Author

alyakin314 commented Aug 7, 2024

Just as a note - I myself was able to test that this is the correct code to .setup with Stage 2 only (where it helped). I cannot run Stage 3 due to the old non-deepspeed non-fabric checkpoints issue.

@Borda Borda changed the title fixed the init_module and deepspeed docs docs: fixed the init_module and deepspeed Aug 7, 2024
@@ -79,6 +79,17 @@ When training distributed models with :doc:`FSDP/TP <model_parallel/index>` or D
optimizer = torch.optim.Adam(model.parameters())
optimizer = fabric.setup_optimizers(optimizer)

With DeepSpeed Stage 3, the use of :meth:`~lightning.fabric.fabric.Fabric.init_module` context manager is necessesary for the model to be sharded correctly instead of attempted to be put on the GPU in its entirety. Deepspeed requires the models and optimizer to be set up jointly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence, not wrong but it's not entirely different from FSDP and therefore the code above was already appropriate.

I think what you wanted to mention was the second sentence "Deepspeed requires the models and optimizer to be set up jointly", right? Can we do it so we don't repeat ourselves, e.g., just a comment about calling model, optimizer = fabric.setup(model, optimizer) for DeepSpeed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally. i wrote that sentence by rewriting your words from an issue linked above, but i agree that in general it applies to fsdp as well.

just to be clear i understood you - old block of code with a text comment re: deepspeed below? or still two blocks of code? or one but with two ways (one commented out?)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one code block if possible, with a code comment or text comment outside the block would be perfect. Great idea thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related fabric lightning.fabric.Fabric
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants