【新算子】- logcumsumexp 算子开发 #1006

PetrelYy · 2024-04-19T06:41:50Z

开发计划可参考以下节点：

方案撰写，xx.xx~xx.xx
开发自测，xx.xx~xx.xx
提出 PR/MR，xx.xx~xx.xx
review（ 3个赞），xx.xx~xx.xx
maintainer 合入

PetrelYy · 2024-05-15T03:48:19Z

@shouhoo 麻烦更新进展

shouhoo · 2024-05-16T01:37:41Z

已提交PR，等待评审中。如有其它需要提交的内容请回复。

shouhoo · 2024-05-20T03:30:51Z

#996

shouhoo · 2024-05-22T05:59:10Z

已重新提交PR：
#1027

shouhoo · 2024-06-02T03:11:06Z

算子参数注册提交，请通过
Cambricon/mlu-ops-proto#96

PetrelYy · 2024-06-12T08:31:51Z

https://github.com/Cambricon/mlu-ops/pull/1027/files 中包含文档与代码，是否已完成测试，可以review？

PetrelYy · 2024-06-12T08:52:12Z

proto 已review
与 @shouhoo 沟通，文档还需修改。还请 @shouhoo 更新文档完成时间及进展

shouhoo · 2024-06-14T04:13:13Z

comments已查阅，下周会统一解决并更新

shouhoo · 2024-06-21T04:41:42Z

代码修改内容：
1.mluOpLogcumsumexp接口的参数顺序已调整；
2.min、max等已改用common.h中的inline func；
3.测试程序logcumsumexp.ccp中头文件排序已更改；
4.测试程序中已用GTEST_CHECK代替assert；
5.已添加负数dim输入；
6.PARAM_CHECK检查条目已补充

shouhoo · 2024-06-28T02:45:41Z

修改完成，请再次评审。

shouhoo · 2024-07-19T01:18:48Z

comments处理完毕，准备测试报告中

shouhoo · 2024-07-25T19:08:05Z

Logcumsumexp测试报告

测试环境：

tesla V100：

GPU名称：Tesla V100-SXM2-16GB

CUDA版本：12.1

Pytorch版本：2.2.1

MLU 372

MLU名称：MLU370-X4[mtp_372.42]

mluop版本：1.1.1

驱动版本：5.10.10

数据测试：

典型规模：

硬件时间小于GPU15倍为合格，小于10倍为良好。

FP32:

输入	V100耗时(微秒)	MLU耗时（微秒）	性能评估
[2, 135, 45, 256]dim=2	52.22	275	良好
[21, 41, 44]dim=0	21.5	9	良好
[10, 60, 8, 43]dim=1	43.01	168	良好
[648, 50]dim=1	19.46	117	良好
[1160, 28]dim=1	18.43	112	良好
[15200, 15]dim=1	368.64	162	良好
[16, 166]dim=1	19.46	82	良好
[9388608]dim=0	229.38	523	良好
[4194304]dim=0	119.81	272	良好
[1048576]dim=0	48.13	93	良好

FP16：

输入	V100耗时(微秒)	MLU耗时（微秒）	性能评估
[2, 135, 45, 256]dim=2	47.1	235	良好
[21, 41, 44]dim=0	21.5	9	良好
[10, 60, 8, 43]dim=1	41.98	81	良好
[648, 50]dim=1	19.46	80	良好
[1160, 28]dim=1	18.43	78	良好
[15200, 15]dim=1	377.86	137	良好
[16, 166]dim=1	19.46	45	良好

其他规模（正确性检验）：

FP32:

输入	结果
[1]dim=0	通过
[147457]dim=0	通过
[200000000]dim=0	通过

测试结果分析：

精度局限

由于FP16在精度上的局限性，“[9388608]dim = 0"、“[4194304]dim = 0"、“[1048576]dim = 0"三个典型规模无法达到精度要求。使用__mluop_exp()_和__mluop_log()_的高精度模式并没有解决，误差的来源主要来自FP16的有效数字有限无法承受过多的累加。

在dimOneKernel下，每个nram每次计算累加的最大规模为36,864个半精度浮点数，计算过程中视为128×288的矩阵，于是连续累加次数最多为288，并不多。因此我们可以认为精度问题很难用改进算法的方式来处理了。

经过测试，能满足精度的最大规模约为[55000]。

性能分析

典型测例中，大部分测例规模较小，无法充分利用带宽和算力，这里重点分析FP32的“[9388608]dim = 0"测例。数据如下（ComputeForce：1.024e+12 (op/s)，IoBandWidth ：307.2 (GB/s)）：

规模	硬件时间	Theory_Ops	Theory_IOs	计算效率	IO效率
[9388608]dim=0	523微秒	56331648ops	75108864Bytes	10.5%	46.8%

可以看出，算子为memory bound。随着规模进一步增大，IO效率能达到50%左右。限制IO效率的主要因素是cores之间的同步和clusters之间的同步，由于cores和clusters间的数据依赖无法避免，效率很难进一步提升。

其他算法

由于算法中涉及到大量逐行累加，前后指令存在大量数据依赖，无法形成指令级流水。因此尝试了Blelloch算法以代替算法中的逐行累加。以下为部分测例改用Blelloch后的硬件时间：

测例	逐行累加（微秒）	blelloch（微秒）
[648, 50]dim=1	117	169
[1160, 28]dim=1	112	160
[15200, 15]dim=1	162	158
[16, 166]dim=1	82	213

可以看到，只有在axis_size较小的第三个测例，blelloch算法获得了与逐行累加接近的效率，而其他测例里均有明显的劣势。虽然blelloch能避免一部分数据依赖，但其计算量明显更大，遂放弃。

shouhoo self-assigned this Apr 22, 2024

PetrelYy added New Op Contribute a new operator ICT labels Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【新算子】- logcumsumexp 算子开发 #1006

【新算子】- logcumsumexp 算子开发 #1006

PetrelYy commented Apr 19, 2024

PetrelYy commented May 15, 2024

shouhoo commented May 16, 2024

shouhoo commented May 20, 2024

shouhoo commented May 22, 2024

shouhoo commented Jun 2, 2024

PetrelYy commented Jun 12, 2024

PetrelYy commented Jun 12, 2024

shouhoo commented Jun 14, 2024

shouhoo commented Jun 21, 2024

shouhoo commented Jun 28, 2024

shouhoo commented Jul 19, 2024

shouhoo commented Jul 25, 2024

【新算子】- logcumsumexp 算子开发 #1006

【新算子】- logcumsumexp 算子开发 #1006

Comments

PetrelYy commented Apr 19, 2024

PetrelYy commented May 15, 2024

shouhoo commented May 16, 2024

shouhoo commented May 20, 2024

shouhoo commented May 22, 2024

shouhoo commented Jun 2, 2024

PetrelYy commented Jun 12, 2024

PetrelYy commented Jun 12, 2024

shouhoo commented Jun 14, 2024

shouhoo commented Jun 21, 2024

shouhoo commented Jun 28, 2024

shouhoo commented Jul 19, 2024

shouhoo commented Jul 25, 2024

Logcumsumexp测试报告

测试环境：

tesla V100：

MLU 372

数据测试：

典型规模：

FP32:

FP16：

其他规模（正确性检验）：

FP32:

测试结果分析：

精度局限

性能分析

其他算法