-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF cum_concat_step simplify and other RF things #1665
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other backends? It seems strange to me that only one backend can pad w/ non-scalar values. If we only implement this for torch I'd rather adjust the RF cum_conat_step
impl to not rely on rf.pad
but a more general operation that works across all backends, no?.
There is not really any other backend currently which fully works and is actively used, so I don't think this has such a high priority now. |
But is it much work to change the impl to be generic by default? If we ever add more tier 1 backends that is one less thing to worry about? |
I don't exactly understand? Via this PR, |
I mean wrt. |
But I still don't understand? |
Specifically for cross attention, it could happen that max(q_seq_len+k_seq_len-1) != shape.
77135fa
to
7b2882a
Compare
(This is just here to at least run the test once before I merge it into main.)
(The other commits accidentally went into here, but it cannot hurt. Let's just not squash them together but rebase then.)