Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add loop unrolling for matmul #11

Merged
merged 3 commits into from
Jul 16, 2024

Conversation

junjihashimoto
Copy link
Collaborator

@junjihashimoto junjihashimoto commented Jul 15, 2024

WGSL of dawn/tint does not have the optimization of loop unrolling.
So this PR implements loop unrolling for matmul to check the performance of it.
In my environment(m2 pro), the result is as follows:

Matmul version GFLOPS
1D tiling without unrolling 779.57 GFLOPS
2D tiling without unrolling 475.57 GFLOPS
1D tiling with unrolling 862.77 GFLOPS
2D tiling with unrolling 1288.09 GFLOPS

@austinvhuang
Copy link
Contributor

Wow if that holds up the performance jump is a lot. Side note - any idea why your m2 pro underperforming my m1 macbook pro for 1D tiling without unrolling (I get ~ 1.3 tflops)?

On one hand, if these performance improvements hold up it's probably worth it. On the other hand:

  • For user-facing example code this is a bit hairy and i'm not sure we'd want to suggest regex optimization passes as a common practice everyone should be doing. We might hide unroll() in experimental/wgsl.h and if it's worth this rabbit hole maybe explore going beyond regex for code transformations if this is generally useful and eventually use that method consistently for the simple template substitutions like {{precision}} as well.
  • I would add a comment explaining the regex a bit (chatgpt/claude could potentially help). How fragile is it - eg if there's an extra space here and there or other minor code variations?
  • How large does K need to be before this breaks?

My suggestion for now is:

  • move the code transformation functions to experimental/wgsl.h
  • check failure conditions with a focus on accidental firings that lead valid code to transform to invalid code, as this would be more annoying than missing an optimization opportunity
  • verify the breaking point of K. I would have a conservative check (that should be reasonably safe across different platforms) and skip the optimization (possibly with a warning) if the value is over it.

@junjihashimoto
Copy link
Collaborator Author

junjihashimoto commented Jul 15, 2024

Thank you for your reviewing!

any idea why your m2 pro underperforming my m1 macbook pro for 1D tiling without unrolling (I get ~ 1.3 tflops)?

I also got 1.3 tflops, but it was due to rounding the timer precision to seconds.

How large does K need to be before this breaks?

It should be a trade-off between instruction cache misses and pipeline stall.
When the pipeline is 5 stages, then there may be a loss of about 5 cycles if there is a loop.
(The number of Classic RISC stages is 5.)
When we are expecting an improvement of about 10%, the unroll limit should be around 32, because 5 / 32 == 15 %.
I think K is greater than 32.

check failure conditions with a focus on accidental firings that lead valid code to transform to invalid code, as this would
be more annoying than missing an optimization opportunity

Since the AST is not checked, it is difficult to check failure conditions.
Comments and tokens are ignored in regular expressions. They would cause failure.
Maybe we could add some tests that run both the rolled version and the unrolled version and compare the results.
We might want to use a template engine like smatry.

Comment on lines 10 to 46
// Loop-unrolling optimization with regex
//
// Note: Be cautious, as it does not correctly recognize comments or lexical tokens.
std::string loopUnrolling(const std::string& code) {
// This regex pattern matches a for loop with the following structure:
// for (var <varName>: u32 = <start>; <varName> < <end>; <varName>++) { <loopBody> }
std::regex forLoopPattern(R"(for\s*\(\s*var\s+(\w+):\s*u32\s*=\s*(\d+)\s*;\s*\1\s*<\s*(\d+)\s*;\s*\1\+\+\s*\)\s*\{\s*([^{}]*)\})");
// Explanation of the regex:
// for\s*\( : Matches 'for (' with optional whitespace
// \s*var\s+ : Matches 'var ' with optional whitespace
// (\w+) : Captures the variable name (alphanumeric characters and underscores)
// :\s*u32\s*=\s* : Matches ': u32 = ' with optional whitespace
// (\d+) : Captures the start value (one or more digits)
// \s*;\s* : Matches ';' with optional whitespace
// \1\s*<\s* : Matches the captured variable name followed by '<' with optional whitespace
// (\d+) : Captures the end value (one or more digits)
// \s*;\s* : Matches ';' with optional whitespace
// \1\+\+\s* : Matches the captured variable name followed by '++' with optional whitespace
// \)\s*\{\s* : Matches ')' followed by '{' with optional whitespace
// ([^{}]*) : Captures the loop body (anything except '{' or '}')
// \} : Matches the closing '}'

// Example:
//
// Input code:
// for (var i: u32 = 0; i < 3; i++) { std::cout << i << std::endl; }
//
// Matches:
// varName = "i"
// start = "0"
// end = "3"
// loopBody = "std::cout << i << std::endl;"
//
// Unrolled:
// std::cout << 0 << std::endl;
// std::cout << 1 << std::endl;
// std::cout << 2 << std::endl;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments for the regex.

@austinvhuang
Copy link
Contributor

Thanks for the updates. Will go ahead and merge this.

I might further move some things around after merging to keep examples/ code pedagogically focused while build up more experimental sota matmul implementations in experimental/ (eventually transferring to examples when they're sufficiently mature).

Great idea, clearly this makes a big difference. If you have other ideas feel free to make a PR (BTW one of my TODOs after the release tasks are taken care of is to implement some autotuning functionality).

@austinvhuang austinvhuang merged commit af6e1e0 into AnswerDotAI:main Jul 16, 2024
1 check passed
@junjihashimoto junjihashimoto deleted the feature/unrolling branch July 16, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants