-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support branch parallel for evoformer #14
base: main
Are you sure you want to change the base?
Conversation
Thank you, I will review this in the weekend. |
@@ -49,6 +49,7 @@ def main(args) -> None: | |||
), "Must specify batch size either with --batch-size" | |||
metrics.reset() | |||
|
|||
args.seed += args.dp_rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using a hybrid distributed parallel strategy, such as DP-BP, the parameters and data in the same BP group need to be the same, so the seeds need to be the same.
@@ -137,6 +139,9 @@ def distributed_init(args): | |||
if torch.cuda.is_available(): | |||
dist.all_reduce(torch.zeros(1).cuda()) | |||
|
|||
scg.init_group(bp_degree=args.bp_degree, dap_degree=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this affect the normal c10d, no_c10d mode?
Can we make "bp" a choice, like currently c10d, no_c10d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure about this question. This PR is just to show how to use BP, not to merge this PR into UniCore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I may miss some contexts.
|
||
return outer_grad.clone(), msa_grad.clone(), pair_grad.clone() | ||
|
||
def sync_evoformer_results(outer, msa, pair, training): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the functions in this file are better to be in Uni-Fold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem as above. It is necessary to design the code together and merge them into UniFold and UniCore respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support Branch Parallelism in paper Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism