-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorOps kernels refactoring #3346
base: develop
Are you sure you want to change the base?
Conversation
const void* alpha0_, | ||
const void* alpha1_, | ||
const void* beta_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this conversation
https://github.com/ROCm/MIOpen/pull/3346/files#r1824480257
Probably alpha0/1
must not be a part of the PD, ideally beta
as well, but right now it has to be there..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a bool marking if alpha0/... has a "default" value meaning no additional work required suffice?
size_t Aoffset; | ||
size_t Boffset; | ||
size_t Coffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle this internally? IIRC it should be possible to externally pass any subtensor via changing pointer+descriptor. If so this is a duplicated functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the main point is the pointer is void *
and actual type is an miopen_Type_t
enum. That's why you can't just add them without special helpers.
const void* alpha0_, | ||
const void* alpha1_, | ||
const void* beta_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a bool marking if alpha0/... has a "default" value meaning no additional work required suffice?
…arate some of them into unique solvers and tidy the code
Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration. |
Here is a comparison of average host time between old and new structure
New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call |
The results are very strange; we need to obtain the experiment protocol. How was the program executed, and what was used for measurement? |
…tentially boost performance on host side
I did some profiling to compare old and new structure and saw that creation of network_config is slower than before, this is more visible for bigger dimension tensors and it is consequence of the new format of network_config. Network_config creation for 5d tensors is around 4 times slower than in the old structure and around 3 times slower compared to 1d tensor network_config in the new structure. Because of all of that I switched to using string and got speed up of around 2.2 times compared to using stream. After that I run 500 iterations of old and new structure for all tensor kernels and got the result that the new version is faster for 0.0005ms on average, which is around 20% faster than old structure. |
I am not seeing a lot of testing coverage for OpTensor before the changes. Would it be possible to add new tests to the gtest suite to ensure correctness for the new solvers being added? |
src/CMakeLists.txt
Outdated
solver/tensorOp/Op2dTensorLite.cpp | ||
solver/tensorOp/Op2dTensorSquash.cpp | ||
solver/tensorOp/Op3dTensorGeneric.cpp | ||
solver/tensorOp/OpTensorFwdBias.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you alphabetize these?
if(is4dLite) | ||
{ | ||
// for naive tensor ops | ||
const std::string data_type = GetDataType(bTensorDesc.GetType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effectively unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why either compilers or clang-tidy does not complain about it. Usually, they do..
|
||
size_t TENS_LEN = cTensorDesc.GetElementSize(); | ||
size_t RD_BLCK = (TENS_LEN % 4 == 0) ? 4 : (TENS_LEN % 2 == 0) ? 2 : 1; | ||
const std::string READ_TYPE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused--is something missing?
Current test for tensorOp is covering all solvers except for Op2dTensorSquash but I did some changes and tested it locally and it worked fine. There is a plan to switch this test to gtest and then those improvements of testing tensorOps will be implemented. As a part of this PR I will add some unit tests for Problem Descriptor, so please do not merge this yet |
This is draft PR for refactoring tensor ops kernels to solver structure, so far only Op1dTensorGeneric kernel is switched