-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add skew-symmetric BLAS operations #805
base: master
Are you sure you want to change the base?
Conversation
This operation negates a scalar (both real and imaginary parts).
- Add `bli_?negs/bli_?negris` to negate a scalar. - Add `bli_?setr0s` to zero out only the real part of a complex scalar. - Add (void) to silence unused variable warnings in several level-0 macros.
Add `mkskewsymm` and `mkskewherm` operations to explicit skew-symmetrize or skew-hermitize a matrix. For a skew-symmetric matrix, the diagonal is explicitly set to zero, while for a skew-hermitian matrix the real part of the diagonal is set to zero.
Add `BLIS_SKEW_SYMMETRIC` and `BLIS_SKEW_HERMITIAN` matrix structures along with associated help functions and macros. Note that this requires increasing the number of bits used to represent a `struc_t` in the `obj_t::info` member. A compile-time check has also been added to prevent against accidental bit overflow in the future.
This operation sets only the real part of a matrix diagonal to the given value.
The function signature for dotaxpyf has been changed to allow different `alpha` values for the dot and axpy sub-problems. This is needed to support skew-symmetric operations which differ in more than just conjugation of A and A^T.
Add `skmv` (skew-symmetric matrix times vector), `shmv` (skew-hermitian matrix times vector), `skr2` (skew-symmetric rank-2 update), and `shr2` (skew-hermitian rank-2 update) operations. Note that a rank-1 skew-symmetric update is not possible, and a rank-1 skew-hermitian update is not particularly useful.
The reference packing kernels have been updated to support skew-symmetric and skew-hermitian matrix structures. No updates to the dense reference packing kernel (`bli_?packm_ckx_<arch>_ref`) or to any optimized packing kernels, since `bli_?packm_struc_cxk` handles the negation of the unstored region by modifying `kappa`.
Add `skmm` (skew-symmetric matrix times dense matrix), `shmm` (skew-hermitian matrix times dense matrix), `skr2k` (skew-symmetric rank-2k update), and `shr2k` (skew-hermitian rank-2k update) operations. Note that a rank-k skew-symmetric update is not possible, and a rank-k skew-hermitian update is not particularly useful.
[ci skip]
@myeh01 @nick-knight @Aaron-Hutchinson can the SiFive team please review commit b986782? I had to delve into the RISC-V assembly there and I'm only ~80% sure I did it right. |
@fgvanzee again the Travis CI build failed to trigger... |
I don't remember if there was anything we could do to fix it on our end. Might be that we just have to wait and then make a dummy commit to try to trigger again? |
I don't remember either but it is annoying |
There we go |
@devinamatthews Confirming I got your message. It looks like the register allocation in |
Travis CI failed for x280 so I guess I did do something wrong. |
Running the testsuite, it looks like |
After looking at the objdump, it looks like the compiler is using |
@myeh01 A quicker fix might be to add clobbers. We should be using these anyway whenever we use inline asm with explicit register allocation, whether X-, F-, or V-registers. Going the other direction, I think we might be better off using generic C for all the scalar stuff, and only using inline asm for the vector stuff (when intrinsics don't suffice). I don't think we'll lose much in performance, and it would make the code much more maintainable and retargetable. |
@nick-knight I originally tried just adding the output register to the clobber list of any floating-point load, e.g.
But the compiler still uses Edit: Yeah, I think replacing the scalar stuff with generic C would be more robust. I started doing this for some parts of Edit 2: After looking through some of the code again, I think I would also like to replace some inline asm code with intrinsics. |
Correct. If you don't want the compiler to overwrite
Correct.
I agree. Our code still has lurking risks related to our explicit allocation of V-registers: we are trusting the compiler not to generate any vector code between each pair of |
@devinamatthews How would you like to proceed? There are a few short-term solutions we discussed above. Longer-term, I'd like to rewrite the inline assembly files to be more robust (probably using intrinsics where it won't significantly impact performance). Please let me know what I can do to help. |
One perspective is that this ukernel interface change exposed a bug in SiFive's implementation of the legacy ukernel interface. To proceed, the defective ukernel implementation could be deleted in this PR --- reverting to a generic implementation --- and then reintroduced, upgraded and corrected, in a subsequent PR. This might be the cleanest way forward. |
I'm not in a huge rush to merge. If it takes say a month or less to fix it properly then I can wait. Otherwise yes we could revert to generic and fix later. This wouldn't require deleting anything, just commenting out the kernel registration. |
Mind if we sync up in a week or two? I'll start working on it this week and hopefully by then I'll have a sense of how much more time it would take. |
Sure. |
@devinamatthews I'm steadily working through cleaning up all the kernels, but I don't think I'll be able to finish it in the next two weeks. I'm also trying to balance this work with other projects I need to work on, so it may take a few more weeks. It may be best to follow Nick's suggestion and temporarily disable the |
I don't propose disabling the whole configuration, just removing the one ukernel that's causing issues. IIUC, this will cause BLIS to default to a generic/reference implementation. |
@myeh01 We've decided not to include this PR in the next release so there's not much time pressure. |
@devinamatthews I opened a PR (#822) converting all our inline assembly to intrinsics, not sure if you've seen it yet. |
This PR adds a number of level-2 and level-3 skew-symmetric (and skew-hermitian) BLAS operations, defining the essential operations of a "Skew-BLAS" interface. These operations have been added as full 1st-class citizens of the BLIS API complete with testsuite and mixed-precision/mixed-domain support (level-3 only).