Let users join federated training mid session #741

JulienVig · 2024-08-06T15:43:50Z

Fixes the federated case of #718

Have server store the last aggregated model and send it back to participants joining late, or that have fallen behind
Participants notify server when disconnecting from a task
Implement a "waiting room" to have at least 2 participants before starting a federated training
Handle the case when a user ends up alone during in a training session (all the others either stopped or already finished). This user could pause their training and wait until someone else joins
Handle receiving an invalid server update. Currently, a client waits for a single server update per round. If it receives an old or invalid server updates, the client drops it and skips to the next round. However, the update may simply be delayed. Should the client skip to the next round or wait for another potential update?
Display the current status in the trainingboard, especially when waiting for new participants.

Refactoring:

Split server/router into routes and controllers as commonly done in Express. routes setup the router API and controllers handle the actual federated and decentralized logic (via initTask and handle)
Renamed some files into more explicit names imo (e.g. client/client.ts, client/federated_client.ts, client/decentralized_client.ts)
Renamed 'feai' and 'deai' occurrences (e.g. server API) into 'federated' and 'decentralized'
Renamed the TasksAndModels object to TaskInitializer and added doc

martinjaggi · 2024-08-10T10:19:08Z

BTW for delayed updates during training, the easiest way (and algorithmically valid) is to define a maximum threshold delay, so a time delta (in number of steps/rounds): if an update arrives which was computed based on a model older than this delta, the update is dropped. otherwise it's included.

the aggregation is simply triggered once the number of present valid update candidates (buffer size) is sufficient (this part is already implemented).

that would work well both in decentralized and federated

JulienVig · 2024-08-12T07:58:12Z

Yes! This should already be implemented, although current defaults always set the time delta to 0. We can probably increase the cutoff to 1 or 2.

…round

…lients

…object

JulienVig · 2024-08-29T16:33:11Z

Sorry for the ugly commit history I had to debug a test case that failed only in the github actions but passed locally.
FYI it was because of network delay letting a client send its contribution before they received the server message to wait for more participants (which should prevent them from sending their contribution). I've solved it by letting the server accept contributions even when below the minimum number of participants but preventing the aggregation until above the threshold.

tharvik

wouhou, great work, that's a very neat fix for an important feature. thanks for tests also, very useful to avoid breaking subtle status changes 🎉

a few beautifulization comments, nothing really important

.github/workflows/lint-test-build.yml

discojs/src/aggregator/mean.ts

discojs/src/client/messages.ts

server/src/task_initializer.ts

server/tests/e2e/federated.spec.ts

martinjaggi · 2024-09-02T20:31:28Z

impressive work, big thanks!

JulienVig force-pushed the 718-join-session-julien branch 2 times, most recently from dce0d20 to 5a9ce14 Compare August 12, 2024 16:05

JulienVig added this to the v4.0.0 milestone Aug 13, 2024

JulienVig self-assigned this Aug 13, 2024

JulienVig added federated For the federated setting discojs Related to Disco.js decentralized For the decentralized setting and removed decentralized For the decentralized setting labels Aug 13, 2024

JulienVig added 20 commits August 14, 2024 17:37

Fix merge conflicts

eb5b533

Simplify isValidContribution flow

17ec537

Enable server to send latest global model to stale participants

dc43227

Make client rely on the aggregator's round rather than the trainer's …

7044836

…round

Rm unused server logs

dd162fd

Prevent overflowing text for narrow screens

b25bfc9

Enable multiple tries for clients to receive a server update

6718fc9

Fix node ans browser ws API incompatibility

67ecb81

Make client disconnect gracefully when leaving training session

f87b60e

Add a flag with min participant threshold when client connects to server

3d1791a

Refactor server router following Express conventions

6f450c8

Rename 'feai' and 'deai' routes into 'federated' and 'decentralized'

7d20de2

Split controller abstraction following MVC pattern

f8f07ca

Create one training controller per task rather than one for all tasks

0a2f5b6

Add a space before DISCOllaboratives

10088a8

Make info toaster less aggressive

15c2730

Enable waiting for more participants when below a minimum number of c…

fd554bd

…lients

Rename Base class and files into FederatedClient and DecentralizedClient

6fdf569

Rename Base abstract class into Client

713ab20

Let client and trainer update the training status through the Logger …

e323bc9

…object

JulienVig added 10 commits August 29, 2024 16:13

Fix disco object undefined during cleanup

804ed49

increase beforeEach timeout

a5a40a6

Increase wait time for status to update

3b62c55

Increase wait time for status to update

6a40184

Debug github action test

f06dbae

Show debug statements in github actions

9d902ae

debug github actions test

5864334

Reject contribution when not enough participants

44ce301

Decrease wait time for status update

21314a3

Don't aggregate when below task minNbOfParticipants

97b35c0

JulienVig requested a review from tharvik August 29, 2024 16:30

tharvik approved these changes Sep 2, 2024

View reviewed changes

JulienVig added 14 commits September 2, 2024 16:06

Rm verbose debug

699645b

Fix attribute increment

b236578

Rm unused import

94bfd7c

Rm useless assertion

a262b34

Clean chai expects

25d9dfb

make emit a private method

f02df76

Expose a task getter

9f926a8

Rely on task initializer set rather than duplicate it in router

fac2708

Export EventEmitter

03c989a

Support async callbacks

4285832

Rm Digest feature

98f1df5

Fix event emitter callbacks

954b6e1

Rm promise type callbacks

6ea8dbf

Prevent duplicate round increment

e6f09e1

JulienVig merged commit 21cfc55 into develop Sep 2, 2024
23 checks passed

JulienVig deleted the 718-join-session-julien branch September 2, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let users join federated training mid session #741

Let users join federated training mid session #741

JulienVig commented Aug 6, 2024 •

edited

Loading

martinjaggi commented Aug 10, 2024

JulienVig commented Aug 12, 2024

JulienVig commented Aug 29, 2024

tharvik left a comment

martinjaggi commented Sep 2, 2024

Let users join federated training mid session #741

Let users join federated training mid session #741

Conversation

JulienVig commented Aug 6, 2024 • edited Loading

martinjaggi commented Aug 10, 2024

JulienVig commented Aug 12, 2024

JulienVig commented Aug 29, 2024

tharvik left a comment

Choose a reason for hiding this comment

martinjaggi commented Sep 2, 2024

JulienVig commented Aug 6, 2024 •

edited

Loading