Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoder training: new regimes and specs #506

Merged
merged 10 commits into from
Nov 28, 2023
Merged

Conversation

@nkemnitz nkemnitz changed the title Encoder training WIP: encoder training Sep 5, 2023
@nkemnitz nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from 315a580 to 38418b0 Compare September 6, 2023 12:49
@codecov
Copy link

codecov bot commented Sep 6, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8b754d3) 100.00% compared to head (8f0872a) 100.00%.

❗ Current head 8f0872a differs from pull request most recent head eb16f5a. Consider uploading reports for the commit eb16f5a to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #506   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          128       129    +1     
  Lines         4284      4301   +17     
=========================================
+ Hits          4284      4301   +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nkemnitz nkemnitz force-pushed the nkem/train-encoder branch 2 times, most recently from d124e82 to 541552e Compare September 6, 2023 19:42
@nkemnitz nkemnitz force-pushed the nkem/train-encoder branch 6 times, most recently from 42da668 to 01eb94f Compare September 20, 2023 10:30
@nkemnitz nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from ee4af7f to b563936 Compare October 18, 2023 13:38
@nkemnitz nkemnitz force-pushed the nkem/train-encoder branch 2 times, most recently from 94aff8f to f03235b Compare October 27, 2023 16:26
@nkemnitz nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from 281dceb to 1d7a3eb Compare November 8, 2023 10:12
@nkemnitz nkemnitz marked this pull request as ready for review November 8, 2023 10:13
@nkemnitz nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from c1811a0 to cc794e3 Compare November 8, 2023 11:39
@nkemnitz nkemnitz changed the title WIP: encoder training Encoder training: new regimes and specs Nov 8, 2023
Copy link
Member

@supersergiy supersergiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the .py specs won't be too intimidating for whoever has to pick this up!
Let's stick with .cue in the future for now...

zetta_utils/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@supersergiy supersergiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

else builder.build(val_dataloader),
full_state_ckpt_path=full_state_ckpt_path,
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no statements after the else so we can return and remove this block, one less level of indentation

Comment on lines 299 to 310
if train_args["trainer"]["accelerator"] == "gpu":
num_devices = int(resource_limits["nvidia.com/gpu"]) # type: ignore
trainer_devices = train_args["trainer"]["devices"]
if (
isinstance(trainer_devices, int)
and trainer_devices != -1
and trainer_devices != num_devices
):
raise ValueError(
f"Trainer specification uses {trainer_devices} devices, "
f"while `nvidia.com/gpu` limit is {num_devices}."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check applies to num_nodes=1 as well, rest looks good

@supersergiy supersergiy merged commit ec335cf into main Nov 28, 2023
11 checks passed
@@ -267,13 +293,31 @@ def _lightning_train_remote(
Creates a volume mount for `train.cue` in `/opt/zetta_utils/specs`.
Runs the command `zetta run specs/train.cue` on one or more worker pods.
"""
if train_args["trainer"]["accelerator"] == "gpu":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [example DDP spec] sets accelerator to "cuda", resulting in a NotImplementedError.

supersergiy added a commit that referenced this pull request Feb 2, 2024
Encoder training: new regimes and specs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants