Bug fixes and Llama 2 inference support

cli99 released this 18 Aug 06:30

· 41 commits to main since this release

b9caffc

This release:

adds group query attention (GQA) support
changes the activation memory calculation in inference to assume maximum tensor buffer
fixes the kv cache size calculation
adds a gpu cost analysis in the inference
adds llama2 inference case study

Assets 2