Implementing loss scaling scheduler callback and schedulers #270

laserkelvin · 2024-08-08T19:16:41Z

This PR adds a new callback, called LossScalingScheduler, and scheduler classes that will modify the relative loss weights over the course of training.

Two types of schedulers are implemented: LinearScalingSchedule and SigmoidScalingSchedule. The former will generate a linear ramp from start to end over steps or epochs, and the latter gives a gradual ramp up in the form of a sigmoid curve.
The LossScalingScheduler is configured by passing schedules that are mapped to task keys, and for every training step or epoch (set by the schedules), applies a new value of the weighting to the appropriate task.
The example script examples/callbacks/loss_scheduling.py is provided to show how it is configured.

This is useful for implementing curricula, where we prioritize learning of different properties over time.

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

This serves as the main interface for controlling the schedules during training

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

We can set the task scaling values, but if the task key isn't set the data may not be present

Seems like most schedules will need the same configuration

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

laserkelvin · 2024-08-08T19:25:13Z

I think the same tests fail as in #266

melo-gonzo

Just one general comment, otherwise looks good!

melo-gonzo · 2024-08-09T17:55:59Z

matsciml/lightning/callbacks.py

+                target_key = schedule.key
+                self._logger.debug(
+                    f"Attempting to advance {target_key} schedule on step."
+                )
+                try:
+                    new_scaling_value = schedule.step()
+                    pl_module.task_loss_scaling[target_key] = new_scaling_value
+                    self._logger.debug(
+                        f"Advanced {target_key} to new value: {new_scaling_value}"
+                    )
+                except StopIteration:
+                    self._logger.warning(
+                        f"{target_key} has run out of scheduled values; this may be unintentional."
+                    )


Looks like you could combine this common bit between the on_x_end functions, and pass in a 'step' or 'epoch' string variable to use in the if statement and log message.

Done in d3b97da

laserkelvin added 30 commits August 7, 2024 11:11

feat: implemented loss scaling base class

9d933d4

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

feat: added property support for schedule grid

9f5f062

feat: added key property for base class

6d3b910

feat: adding linear schedule child

ba2995f

refactor: added linear schedule ramp

dd32486

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

feat: implemented setup method for linear schedule

2050da1

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

refactor: defining __all__ for loss scaling module

ad706d8

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

refactor: making schedule a cached property

d51ea6d

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

fix: correcting linear step logic by applying negative sign as needed

0581f9a

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

test: added parametrized linear scheduler test

ce2f30f

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

feat: implemented loss schedule callback class

764477a

This serves as the main interface for controlling the schedules during training

test: added unit test to ensure end-to-end linear scaling passes

67c8d3d

fix: correcting type annotation for callback

778b692

refactor: add exception handling for unexpected task keys

b5a70d1

test: added unit test to catch unexpected task keys

93cdce3

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

refactor: setting initial value of scaling as part of setup

c967e31

refactor: making setup check task keys, not scaling keys

eae15c3

We can set the task scaling values, but if the task key isn't set the data may not be present

refactor & test: adding check to make sure task scaling has changed

d5e1ef7

test: added end value check to linear schedule as well

a81eae2

test: added unit test for epoch stepping

bd1f675

docs: updated unit test docstrings

37d7393

refactor: moved setup to base class

240ed17

Seems like most schedules will need the same configuration

feat: implemented sigmoid schedule

11812f3

docs: added docstring for sigmoid scaling schedule

227c059

feat: adding sigmoid scaling to __all__

5e50058

refactor: added messages and assertions for sigmoid values

26ee19a

test: added unit test for sigmoid schedule

38c1c76

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

test: added unit test for sigmoid scaling in training loop

f08a2a4

refactor & tests: making egnn much smaller

28e8ba5

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

refactor: adding __all__ definition in callbacks

61ba436

laserkelvin added 5 commits August 8, 2024 12:03

refactor: making sure that trainer sets up dataloaders

232837c

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

refactor & fix: fixed step count condition

ccb4174

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

fix: remapping initial and end values in sigmoid equation

bf9cbf6

refactor: added step value logging

a270156

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

feat: added script that demonstrates loss schedule usage

b8acf7b

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

laserkelvin added enhancement New feature or request training Issues related to model training labels Aug 8, 2024

laserkelvin requested review from melo-gonzo and smiret-intel August 8, 2024 19:16

melo-gonzo approved these changes Aug 9, 2024

View reviewed changes

laserkelvin added 3 commits August 9, 2024 14:53

refactor: combining functionality into unified function

d3b97da

fix & test: correcting assertion for end scheduler value check

4887b1c

feat: added repr method for schedules

75ed137

laserkelvin merged commit 3e8aa98 into IntelLabs:main Aug 9, 2024
2 of 3 checks passed

laserkelvin deleted the loss-scaling-scheduler branch August 9, 2024 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing loss scaling scheduler callback and schedulers #270

Implementing loss scaling scheduler callback and schedulers #270

laserkelvin commented Aug 8, 2024

laserkelvin commented Aug 8, 2024

melo-gonzo left a comment

melo-gonzo Aug 9, 2024

laserkelvin Aug 9, 2024

Implementing loss scaling scheduler callback and schedulers #270

Implementing loss scaling scheduler callback and schedulers #270

Conversation

laserkelvin commented Aug 8, 2024

laserkelvin commented Aug 8, 2024

melo-gonzo left a comment

Choose a reason for hiding this comment

melo-gonzo Aug 9, 2024

Choose a reason for hiding this comment

laserkelvin Aug 9, 2024

Choose a reason for hiding this comment