Transform: support sdpa to flash attention kernel conversion #131

yifeizh2 · 2024-06-13T07:51:44Z

Tracking issue #147.

TODO:

Check correctness
Align performance
Allow tuning for default config

yifeizh2 · 2024-07-16T05:41:49Z

As for performance evaluation, there are two issues

deep tiled matmul will not be in effect if the parent op of matmul is scf::forall, so we need to either support this scenario or directly invoke the deep tiled matmul in flash attention kernel
5x performance gap

SEQ LENGTH / DTYPE	graph compiler v1 (ms)	v2, block 32, brgemm invoked (ms)	Ratio
384 / fp32	8.28744	48.996	5.912079
768 / fp32	39.1193	190.557	4.871176
1536 / fp32	177.446	752.762	4.242203
2304 / fp32	389.382	1837.45	4.718888
3072 / fp32	682.228	3273.3813	4.798075

Next steps for performance alignment is

Compare the precise brgemm config used in both cases (v1 v.s. mlir)
Perform more detailed performance breakdown

ZhennanQin · 2024-07-17T00:21:59Z

As for performance evaluation, there are two issues

deep tiled matmul will not be in effect if the parent op of matmul is scf::forall, so we need to either support this scenario or directly invoke the deep tiled matmul in flash attention kernel

5x performance gap

<style> </style>
SEQ LENGTH / DTYPE graph compiler v1 (ms) v2, block 32, brgemm invoked (ms) Ratio
384 / fp32 8.28744 48.996 5.912079
768 / fp32 39.1193 190.557 4.871176
1536 / fp32 177.446 752.762 4.242203
2304 / fp32 389.382 1837.45 4.718888
3072 / fp32 682.228 3273.3813 4.798075
Next steps for performance alignment is

Compare the precise brgemm config used in both cases (v1 v.s. mlir)

Perform more detailed performance breakdown

Please try brgemm instead of matmul, which can provide better performance result.

yifeizh2 · 2024-07-17T01:40:00Z

As for performance evaluation, there are two issues

deep tiled matmul will not be in effect if the parent op of matmul is scf::forall, so we need to either support this scenario or directly invoke the deep tiled matmul in flash attention kernel

5x performance gap

<style> </style>
SEQ LENGTH / DTYPE graph compiler v1 (ms) v2, block 32, brgemm invoked (ms) Ratio
384 / fp32 8.28744 48.996 5.912079
768 / fp32 39.1193 190.557 4.871176
1536 / fp32 177.446 752.762 4.242203
2304 / fp32 389.382 1837.45 4.718888
3072 / fp32 682.228 3273.3813 4.798075
Next steps for performance alignment is

Compare the precise brgemm config used in both cases (v1 v.s. mlir)

Perform more detailed performance breakdown

Please try brgemm instead of matmul, which can provide better performance result.

I dumped the final llvm IR, and verified that the current performance is collected with brgemm invoked. Previously when brgemm was not in effect, the performance is 10x worse. I think I need to do more detailed analysis to find where the performance gap exists exactly.

yifeizh2 · 2024-08-02T08:44:01Z

Latest performance:

SEQ LENGTH / DTYPE	graph compiler v1 (ms)	v2, block 64, brgemm invoked
384 / fp32	8.28744	22.482	2.71
768 / fp32	39.1193	93.392	2.387
1536 / fp32	177.446	377.7458	2.128
2304 / fp32	389.382	810.249	2.080
3072 / fp32	682.228	1514.491	2.220

yifeizh2 · 2024-08-02T08:49:25Z

Current observed gap from v1 are the following:

No vectorization
No fast transpose
No post op fusion
linalg.exp takes 1/3 of total execution time; needs to convert to optimized version

yifeizh2 changed the base branch from main to zhicong/deep_tile_matmul June 13, 2024 07:52

zhczhong force-pushed the zhicong/deep_tile_matmul branch 3 times, most recently from 206fead to 65dfab8 Compare June 14, 2024 03:10

yifeizh2 force-pushed the yifei/flash_attention branch from 0049c9c to c3fcf97 Compare June 20, 2024 05:38

zhczhong force-pushed the zhicong/deep_tile_matmul branch 3 times, most recently from 79b277d to d69856f Compare June 27, 2024 01:35

yifeizh2 force-pushed the yifei/flash_attention branch from 738986c to 97f85b4 Compare June 28, 2024 06:23

zhczhong force-pushed the zhicong/deep_tile_matmul branch from d69856f to 823be69 Compare July 2, 2024 02:53

yifeizh2 linked an issue Jul 3, 2024 that may be closed by this pull request

[Experimental] Scaled Dot Product Attention FlashAttention Algorithm Conversion #147

Open

zhczhong force-pushed the zhicong/deep_tile_matmul branch from 823be69 to 02f519b Compare July 3, 2024 08:38

yifeizh2 force-pushed the yifei/flash_attention branch from 8716a96 to 0457df5 Compare July 8, 2024 02:48

yifeizh2 force-pushed the yifei/flash_attention branch from 9ed0b84 to e13ec10 Compare July 15, 2024 04:04

zhczhong force-pushed the zhicong/deep_tile_matmul branch 12 times, most recently from f959a73 to ed5180d Compare July 27, 2024 02:09

zhczhong force-pushed the zhicong/deep_tile_matmul branch from 974b8ca to fd013ca Compare July 30, 2024 08:03

zhczhong force-pushed the zhicong/deep_tile_matmul branch 3 times, most recently from b3bf8dc to 23dfa97 Compare August 1, 2024 01:27

yifeizh2 changed the base branch from zhicong/deep_tile_matmul to main August 2, 2024 08:27

support flash attention

f741fbd

yifeizh2 force-pushed the yifei/flash_attention branch from 52164c4 to f741fbd Compare August 2, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform: support sdpa to flash attention kernel conversion #131

Transform: support sdpa to flash attention kernel conversion #131

yifeizh2 commented Jun 13, 2024 •

edited

Loading

yifeizh2 commented Jul 16, 2024 •

edited

Loading

ZhennanQin commented Jul 17, 2024

yifeizh2 commented Jul 17, 2024

yifeizh2 commented Aug 2, 2024 •

edited

Loading

yifeizh2 commented Aug 2, 2024

Transform: support sdpa to flash attention kernel conversion #131

Are you sure you want to change the base?

Transform: support sdpa to flash attention kernel conversion #131

Conversation

yifeizh2 commented Jun 13, 2024 • edited Loading

yifeizh2 commented Jul 16, 2024 • edited Loading

ZhennanQin commented Jul 17, 2024

yifeizh2 commented Jul 17, 2024

yifeizh2 commented Aug 2, 2024 • edited Loading

yifeizh2 commented Aug 2, 2024

yifeizh2 commented Jun 13, 2024 •

edited

Loading

yifeizh2 commented Jul 16, 2024 •

edited

Loading

yifeizh2 commented Aug 2, 2024 •

edited

Loading