Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let users join federated training mid session #741

Merged
merged 98 commits into from
Sep 2, 2024

Conversation

JulienVig
Copy link
Collaborator

@JulienVig JulienVig commented Aug 6, 2024

Fixes the federated case of #718

  • Have server store the last aggregated model and send it back to participants joining late, or that have fallen behind
  • Participants notify server when disconnecting from a task
  • Implement a "waiting room" to have at least 2 participants before starting a federated training
  • Handle the case when a user ends up alone during in a training session (all the others either stopped or already finished). This user could pause their training and wait until someone else joins
  • Handle receiving an invalid server update. Currently, a client waits for a single server update per round. If it receives an old or invalid server updates, the client drops it and skips to the next round. However, the update may simply be delayed. Should the client skip to the next round or wait for another potential update?
  • Display the current status in the trainingboard, especially when waiting for new participants.
Screenshot 2024-08-14 at 17 30 34

Refactoring:

  • Split server/router into routes and controllers as commonly done in Express. routes setup the router API and controllers handle the actual federated and decentralized logic (via initTask and handle)
  • Renamed some files into more explicit names imo (e.g. client/client.ts, client/federated_client.ts, client/decentralized_client.ts)
  • Renamed 'feai' and 'deai' occurrences (e.g. server API) into 'federated' and 'decentralized'
  • Renamed the TasksAndModels object to TaskInitializer and added doc

@martinjaggi
Copy link
Member

BTW for delayed updates during training, the easiest way (and algorithmically valid) is to define a maximum threshold delay, so a time delta (in number of steps/rounds): if an update arrives which was computed based on a model older than this delta, the update is dropped. otherwise it's included.

the aggregation is simply triggered once the number of present valid update candidates (buffer size) is sufficient (this part is already implemented).

that would work well both in decentralized and federated

@JulienVig
Copy link
Collaborator Author

Yes! This should already be implemented, although current defaults always set the time delta to 0. We can probably increase the cutoff to 1 or 2.

@JulienVig JulienVig force-pushed the 718-join-session-julien branch 2 times, most recently from dce0d20 to 5a9ce14 Compare August 12, 2024 16:05
@JulienVig JulienVig added this to the v4.0.0 milestone Aug 13, 2024
@JulienVig JulienVig self-assigned this Aug 13, 2024
@JulienVig JulienVig added federated For the federated setting discojs Related to Disco.js decentralized For the decentralized setting and removed decentralized For the decentralized setting labels Aug 13, 2024
@JulienVig JulienVig requested a review from tharvik August 29, 2024 16:30
@JulienVig
Copy link
Collaborator Author

Sorry for the ugly commit history I had to debug a test case that failed only in the github actions but passed locally.
FYI it was because of network delay letting a client send its contribution before they received the server message to wait for more participants (which should prevent them from sending their contribution). I've solved it by letting the server accept contributions even when below the minimum number of participants but preventing the aggregation until above the threshold.

Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouhou, great work, that's a very neat fix for an important feature. thanks for tests also, very useful to avoid breaking subtle status changes 🎉

a few beautifulization comments, nothing really important

.github/workflows/lint-test-build.yml Outdated Show resolved Hide resolved
discojs/src/aggregator/mean.ts Outdated Show resolved Hide resolved
discojs/src/client/messages.ts Outdated Show resolved Hide resolved
server/src/task_initializer.ts Outdated Show resolved Hide resolved
server/src/task_initializer.ts Outdated Show resolved Hide resolved
server/tests/e2e/federated.spec.ts Outdated Show resolved Hide resolved
server/tests/e2e/federated.spec.ts Outdated Show resolved Hide resolved
server/tests/e2e/federated.spec.ts Outdated Show resolved Hide resolved
server/tests/e2e/federated.spec.ts Outdated Show resolved Hide resolved
server/tests/e2e/federated.spec.ts Outdated Show resolved Hide resolved
@JulienVig JulienVig merged commit 21cfc55 into develop Sep 2, 2024
23 checks passed
@JulienVig JulienVig deleted the 718-join-session-julien branch September 2, 2024 16:28
@martinjaggi
Copy link
Member

impressive work, big thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discojs Related to Disco.js federated For the federated setting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants