-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoder training: new regimes and specs #506
Conversation
nkemnitz
commented
Sep 5, 2023
•
edited
Loading
edited
- includes / requires feat: imgaug library support #446
- includes / requires Support mapping as input for most tensor ops #498
- includes / requires fix/feat(lightning_train_remote): accept dict / PartialBuilder spec #513
- includes / requires feat: preserve tuples in JSON #514
- requires fix(convblock): skip connection beyond last convolution with post-act… #537
315a580
to
38418b0
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #506 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 128 129 +1
Lines 4284 4301 +17
=========================================
+ Hits 4284 4301 +17 ☔ View full report in Codecov by Sentry. |
d124e82
to
541552e
Compare
42da668
to
01eb94f
Compare
ee4af7f
to
b563936
Compare
94aff8f
to
f03235b
Compare
281dceb
to
1d7a3eb
Compare
c1811a0
to
cc794e3
Compare
cc794e3
to
fff0522
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the .py
specs won't be too intimidating for whoever has to pick this up!
Let's stick with .cue
in the future for now...
fff0522
to
f88e713
Compare
f88e713
to
de91eb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
7701678
to
59b2491
Compare
55d3981
to
02e1e88
Compare
02e1e88
to
46f534a
Compare
else builder.build(val_dataloader), | ||
full_state_ckpt_path=full_state_ckpt_path, | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are no statements after the else so we can return and remove this block, one less level of indentation
if train_args["trainer"]["accelerator"] == "gpu": | ||
num_devices = int(resource_limits["nvidia.com/gpu"]) # type: ignore | ||
trainer_devices = train_args["trainer"]["devices"] | ||
if ( | ||
isinstance(trainer_devices, int) | ||
and trainer_devices != -1 | ||
and trainer_devices != num_devices | ||
): | ||
raise ValueError( | ||
f"Trainer specification uses {trainer_devices} devices, " | ||
f"while `nvidia.com/gpu` limit is {num_devices}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check applies to num_nodes=1
as well, rest looks good
25560c9
to
d8dda04
Compare
8f0872a
to
eb16f5a
Compare
@@ -267,13 +293,31 @@ def _lightning_train_remote( | |||
Creates a volume mount for `train.cue` in `/opt/zetta_utils/specs`. | |||
Runs the command `zetta run specs/train.cue` on one or more worker pods. | |||
""" | |||
if train_args["trainer"]["accelerator"] == "gpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [example DDP spec] sets accelerator
to "cuda", resulting in a NotImplementedError
.
Encoder training: new regimes and specs