Update TensorRT-LLM Release branch (#1192)

* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <[email protected]>
NVIDIA · Feb 29, 2024 · 5955b8a · 5955b8a
1 parent 2f169d1
commit 5955b8a
Show file tree

Hide file tree

Showing 1,337 changed files with 3,804,632 additions and 2,009,981 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -1,5 +1,7 @@
 build
 cpp/*build*
+cpp/cmake-*
+cpp/.ccache
 cpp/tests/resources/models
 tensorrt_llm/libs
 **/__pycache__

diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -0,0 +1,116 @@
+name: "Bug Report"
+description: Submit a bug report to help us improve TensorRT-LLM
+labels: [ "bug" ]
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info
+      description: Please share your system info with us.
+      placeholder: |
+        - CPU architecture (e.g., x86_64, aarch64)
+        - CPU/Host memory size (if known)
+        - GPU properties
+          - GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
+          - GPU memory size (if known)
+          - Clock frequencies used (if applicable)
+        - Libraries
+          - TensorRT-LLM branch or tag (e.g., main, v0.7.1)
+          - TensorRT-LLM commit (if known)
+          - Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
+          - Container used (if running TensorRT-LLM in a container)
+        - NVIDIA driver version
+        - OS (Ubuntu 22.04, CentOS 7, Windows 10)
+        - Any other information that may be useful in reproducing the bug
+    validations:
+      required: true
+
+  - type: textarea
+    id: who-can-help
+    attributes:
+      label: Who can help?
+      description: |
+        To expedite the response to your issue, it would be helpful if you could identify the appropriate person
+        to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
+
+        Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
+        you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
+
+        Please tag fewer than 3 people.
+
+        Quantization: @Tracin
+
+        Documentation: @juney-nvidia
+
+        Feature request: @ncomly-nvidia
+
+        Performance: @kaiyux
+
+        Others: @byshiue
+
+      placeholder: "@Username ..."
+
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information
+      description: 'The problem arises when using:'
+      options:
+        - label: "The official example scripts"
+        - label: "My own modified scripts"
+
+  - type: checkboxes
+    id: information-tasks
+    attributes:
+      label: Tasks
+      description: "The tasks I am working on are:"
+      options:
+        - label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
+        - label: "My own task or dataset (give details below)"
+
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction
+      description: |
+        Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
+        Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
+
+        Remember to use code tags to properly format your code. You can refer to the
+        link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
+
+        Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
+        It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
+
+      placeholder: |
+        Steps to reproduce the behavior:
+
+          1.
+          2.
+          3.
+
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior
+      description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."
+
+  - type: textarea
+    id: actual-behavior
+    validations:
+      required: true
+    attributes:
+      label: actual behavior
+      description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."
+
+  - type: textarea
+    id: additioanl-notes
+    validations:
+      required: true
+    attributes:
+      label: additional notes
+      description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -15,7 +15,8 @@ repos:
     rev: v4.1.0
     hooks:
     -   id: check-added-large-files
-        exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
+        exclude: |
+            (?x)^(.*cubin.cpp)$
     -   id: check-merge-conflict
     -   id: check-symlinks
     -   id: detect-private-key
@@ -45,4 +46,5 @@ repos:
         args:
         - --skip=".git,3rdparty"
         - --exclude-file=examples/whisper/tokenizer.py
-        - --ignore-words-list=rouge,inout,atleast,strat
+        - --ignore-words-list=rouge,inout,atleast,strat,nd
+        exclude: 'tests/llm-test-defs/turtle/test_input_files'
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,41 @@
 # Change Log
 
+## Versions 0.7.0 / 0.7.1
+
+* Models
+  - BART and mBART support in encoder-decoder models
+  - FairSeq Neural Machine Translation (NMT) family
+  - Mixtral-8x7B model
+    - Support weight loading for HuggingFace Mixtral model
+  - OpenAI Whisper
+  - Mixture of Experts support
+  - MPT - Int4 AWQ / SmoothQuant support
+  - Baichuan FP8 quantization support
+* Features
+  - [Preview] Speculative decoding
+  - Add Python binding for `GptManager`
+  - Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
+  - System prompt caching
+  - Enable split-k for weight-only cutlass kernels
+  - FP8 KV cache support for XQA kernel
+  - New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
+  - Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
+  - fMHA support for chunked attention and paged kv cache
+* Bug fixes
+  - Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
+  - Fix LLaMa with LoRA error #637
+  - Fix LLaMA GPTQ failure #580
+  - Fix Python binding for InferenceRequest issue #528
+  - Fix CodeLlama SQ accuracy issue #453
+* Performance
+  - MMHA optimization for MQA and GQA
+  - LoRA optimization: cutlass grouped gemm
+  - Optimize Hopper warp specialized kernels
+  - Optimize AllReduce for parallel attention on Falcon and GPT-J
+  - Enable split-k for weight-only cutlass kernel when SM>=75
+* Documentation
+  - Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
+
 ## Versions 0.6.0 / 0.6.1
 
   * Models