Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use graph optimizer for gpu tensor prepack #19814

Closed
wants to merge 5 commits into from

Conversation

chenfucn
Copy link
Contributor

@chenfucn chenfucn commented Mar 7, 2024

Description

Use graph optimizer for cuda operator prepacking.

Motivation and Context

Our first sm80 quantized gemm kernel requires prepacking. Using current prepack infrastructure in gpu operator causes memory bloat -- old memory buffer not released.

So we define prepacking logic in a new graph optimizer. This solves the memory bloat problem. However, it also introduces the following problems:

  1. Rewriting of the initializer tensors is restricted by operator shape inferencing rules. For example, there are 3 initializers for MatMulNBits, we can't combine them into a single tensor even if we want to. And we have to make sure the operator's shape inference logic does NOT verify the initializer's shape.
  2. These rewriting logic is tightly coupled to each GPU operators. It really should be defined together with those operators, instead of defining them in a graph optimizer module.
  3. The logic of prepacking depends on underlying GPU hardware. Currently this part is hard-coded for SM80.
  4. In the operator, we have to expose a property to remember whether the operator is prepacked by the optimizer. This property is exposed to the user. But it should never be set by user.

@chenfucn chenfucn requested review from yufenglee and wejoncy March 7, 2024 05:33
@chenfucn chenfucn force-pushed the cfu_transform_prepack branch 3 times, most recently from 01a6521 to 4b6097b Compare March 8, 2024 21:32
bool column_wise_blocking,
bool small_m,
bool has_offsets>
Status blkq4_gemm_sm80(int m, int n, int k, cudaStream_t stream,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this can be supported by sm86, sm89?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should

}
break;

case 64:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you support case=128? which is used widely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's difficult for this kernel. I am working on another version which hopefully can support that.

@chenfucn chenfucn force-pushed the cfu_transform_prepack branch 3 times, most recently from d7bce4d to 65573be Compare March 13, 2024 23:17
@chenfucn chenfucn force-pushed the cfu_transform_prepack branch 2 times, most recently from 63d5836 to 44a95f8 Compare March 14, 2024 23:17
@chenfucn chenfucn force-pushed the cfu_transform_prepack branch from 44a95f8 to 38d1851 Compare March 15, 2024 16:12
@chenfucn
Copy link
Contributor Author

Restarting in another PR

@chenfucn chenfucn closed this Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants