Encoder training: new regimes and specs #506

nkemnitz · 2023-09-05T08:26:44Z

includes / requires feat: imgaug library support #446
includes / requires Support mapping as input for most tensor ops #498
includes / requires fix/feat(lightning_train_remote): accept dict / PartialBuilder spec #513
includes / requires feat: preserve tuples in JSON #514
requires fix(convblock): skip connection beyond last convolution with post-act… #537

codecov · 2023-09-06T13:00:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8b754d3) 100.00% compared to head (8f0872a) 100.00%.

❗ Current head 8f0872a differs from pull request most recent head eb16f5a. Consider uploading reports for the commit eb16f5a to get more accurate results

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #506   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          128       129    +1     
  Lines         4284      4301   +17     
=========================================
+ Hits          4284      4301   +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

supersergiy

Hopefully the .py specs won't be too intimidating for whoever has to pick this up!
Let's stick with .cue in the future for now...

zetta_utils/__init__.py

supersergiy

💯

akhileshh · 2023-11-28T23:39:42Z

zetta_utils/training/lightning/train.py

+            else builder.build(val_dataloader),
+            full_state_ckpt_path=full_state_ckpt_path,
+        )
+    else:


there are no statements after the else so we can return and remove this block, one less level of indentation

akhileshh · 2023-11-28T23:39:53Z

zetta_utils/training/lightning/train.py

+        if train_args["trainer"]["accelerator"] == "gpu":
+            num_devices = int(resource_limits["nvidia.com/gpu"])  # type: ignore
+            trainer_devices = train_args["trainer"]["devices"]
+            if (
+                isinstance(trainer_devices, int)
+                and trainer_devices != -1
+                and trainer_devices != num_devices
+            ):
+                raise ValueError(
+                    f"Trainer specification uses {trainer_devices} devices, "
+                    f"while `nvidia.com/gpu` limit is {num_devices}."
+                )


This check applies to num_nodes=1 as well, rest looks good

torms3 · 2023-11-30T01:01:39Z

zetta_utils/training/lightning/train.py

@@ -267,13 +293,31 @@ def _lightning_train_remote(
    Creates a volume mount for `train.cue` in `/opt/zetta_utils/specs`.
    Runs the command `zetta run specs/train.cue` on one or more worker pods.
    """
+    if train_args["trainer"]["accelerator"] == "gpu":


The [example DDP spec] sets accelerator to "cuda", resulting in a NotImplementedError.

Encoder training: new regimes and specs

nkemnitz changed the title ~~Encoder training~~ WIP: encoder training Sep 5, 2023

nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from 315a580 to 38418b0 Compare September 6, 2023 12:49

nkemnitz force-pushed the nkem/train-encoder branch 2 times, most recently from d124e82 to 541552e Compare September 6, 2023 19:42

nkemnitz force-pushed the nkem/train-encoder branch 6 times, most recently from 42da668 to 01eb94f Compare September 20, 2023 10:30

nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from ee4af7f to b563936 Compare October 18, 2023 13:38

nkemnitz force-pushed the nkem/train-encoder branch 2 times, most recently from 94aff8f to f03235b Compare October 27, 2023 16:26

nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from 281dceb to 1d7a3eb Compare November 8, 2023 10:12

nkemnitz marked this pull request as ready for review November 8, 2023 10:13

nkemnitz force-pushed the nkem/train-encoder branch 3 times, most recently from c1811a0 to cc794e3 Compare November 8, 2023 11:39

nkemnitz changed the title ~~WIP: encoder training~~ Encoder training: new regimes and specs Nov 8, 2023

nkemnitz force-pushed the nkem/train-encoder branch from cc794e3 to fff0522 Compare November 8, 2023 12:55

nkemnitz requested a review from supersergiy November 8, 2023 13:19

supersergiy requested changes Nov 8, 2023

View reviewed changes

zetta_utils/__init__.py Outdated Show resolved Hide resolved

nkemnitz force-pushed the nkem/train-encoder branch from fff0522 to f88e713 Compare November 8, 2023 18:59

nkemnitz force-pushed the nkem/train-encoder branch from f88e713 to de91eb8 Compare November 15, 2023 10:33

supersergiy approved these changes Nov 21, 2023

View reviewed changes

supersergiy force-pushed the nkem/train-encoder branch from 7701678 to 59b2491 Compare November 21, 2023 19:10

nkemnitz force-pushed the nkem/train-encoder branch 5 times, most recently from 55d3981 to 02e1e88 Compare November 24, 2023 13:04

nkemnitz added 8 commits November 27, 2023 14:26

fix(ddp): ensure zu modules registered in DDP subprocesses

f47a09a

feat(training): custom SamplerWrapper to avoid PL DDP Sampler override

4265285

specs(training): training data generation for base encoder

756a57f

specs+regimes(training): update base encoder, deprecate old regimes

092c9bf

feat(inference): mixed precision support for base encoder/coarsener

5e1af02

feat(inference): base coarsener with output channel support

5feb436

chore: minor version updates

31a864a

specs+regimes(training): misd prep + training specs + regime

46f534a

supersergiy force-pushed the nkem/train-encoder branch from 02e1e88 to 46f534a Compare November 27, 2023 22:27

supersergiy requested review from supersergiy and akhileshh November 28, 2023 23:03

akhileshh reviewed Nov 28, 2023

View reviewed changes

fix: update training code to work with python specs

d8dda04

supersergiy force-pushed the nkem/train-encoder branch from 25560c9 to d8dda04 Compare November 28, 2023 23:45

supersergiy approved these changes Nov 28, 2023

View reviewed changes

Merge branch 'main' into nkem/train-encoder

eb16f5a

supersergiy force-pushed the nkem/train-encoder branch from 8f0872a to eb16f5a Compare November 28, 2023 23:53

supersergiy merged commit ec335cf into main Nov 28, 2023
11 checks passed

torms3 reviewed Nov 30, 2023

View reviewed changes

supersergiy added a commit that referenced this pull request Feb 2, 2024

Merge pull request #506 from ZettaAI/nkem/train-encoder

ae9496c

Encoder training: new regimes and specs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoder training: new regimes and specs #506

Encoder training: new regimes and specs #506

nkemnitz commented Sep 5, 2023 •

edited

Loading

codecov bot commented Sep 6, 2023 •

edited

Loading

supersergiy left a comment

supersergiy left a comment

akhileshh Nov 28, 2023

akhileshh Nov 28, 2023

torms3 Nov 30, 2023

Encoder training: new regimes and specs #506

Encoder training: new regimes and specs #506

Conversation

nkemnitz commented Sep 5, 2023 • edited Loading

codecov bot commented Sep 6, 2023 • edited Loading

Codecov Report

supersergiy left a comment

Choose a reason for hiding this comment

supersergiy left a comment

Choose a reason for hiding this comment

akhileshh Nov 28, 2023

Choose a reason for hiding this comment

akhileshh Nov 28, 2023

Choose a reason for hiding this comment

torms3 Nov 30, 2023

Choose a reason for hiding this comment

nkemnitz commented Sep 5, 2023 •

edited

Loading

codecov bot commented Sep 6, 2023 •

edited

Loading