Skip to content

Bug fixes and Llama 2 inference support

Compare
Choose a tag to compare
@cli99 cli99 released this 18 Aug 06:30
· 41 commits to main since this release

This release:

  • adds group query attention (GQA) support
  • changes the activation memory calculation in inference to assume maximum tensor buffer
  • fixes the kv cache size calculation
  • adds a gpu cost analysis in the inference
  • adds llama2 inference case study