Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong output on Llama 3.2 1B, but 3B ok #2492

Open
lucasavila00 opened this issue Nov 24, 2024 · 4 comments
Open

Wrong output on Llama 3.2 1B, but 3B ok #2492

lucasavila00 opened this issue Nov 24, 2024 · 4 comments
Assignees
Labels
triaged Issue has been triaged by maintainers waiting for feedback

Comments

@lucasavila00
Copy link

lucasavila00 commented Nov 24, 2024

System Info

Both RTX 2070 and RTX A6000

Reproduction

I'm using the latest main (535c9cc)

I'm using the make wheel image, from main.

I built the 3B model with

cd /workspace/TensorRT-LLM/examples/llama
python convert_checkpoint.py --model_dir /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/0cb88a4f764b7a12671c53f0838cd831a0843b95 \
                              --output_dir ./tllm_checkpoint_1gpu_fp16 \
                              --dtype float16 \
                              --use_embedding_sharing

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \
            --output_dir ./engine3b \
            --gemm_plugin auto \
            --max_batch_size 1

 python run.py --engine_dir ./llama/engine3b/ --tokenizer_dir meta-llama/Llama-3.2-3B-Instruct  --max_output_len 100

And it runs as expected

root@genius-maxwell:/workspace/TensorRT-LLM/examples#  python run.py --engine_dir ./llama/engine3b/ --tokenizer_dir meta-llama/Llama-3.2-3B-Instruct  --max_output_len 100
[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[11/24/2024-19:46:55] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4194304
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4194304) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 6949 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 6943 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 96.52 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.02 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 47.54 GiB, available: 39.82 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 5244
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 335616
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 5244
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 35.85 GiB for max tokens in paged KV cache (335616).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[11/24/2024-19:47:03] [TRT-LLM] [I] Load engine takes: 8.084513187408447 sec
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter and engraver before turning to sculpture. He was a prominent figure in the development of the Art Nouveau movement, and his work is characterized by its sinuous, organic forms and its use of natural materials such as wood and stone. Soyer's sculptures often feature fantastical creatures and hybrid beings, which he depicted in a highly detailed and realistic style. He was also known for his work in the field of decorative arts, and his designs for furniture, textiles, and other objects are highly prized"
[TensorRT-LLM][INFO] Refreshed the MPI local session

However, when I do the same for Llama 1B:

python convert_checkpoint.py --model_dir /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6 \
                              --output_dir ./tllm_checkpoint_1gpu_fp16_1b \
                              --dtype float16 \
                              --use_embedding_sharing


trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_1b \
            --output_dir ./engine1b \
            --gemm_plugin auto \
            --max_batch_size 1

It just repeats the same token

root@genius-maxwell:/workspace/TensorRT-LLM/examples# python ./run.py --engine_dir ./llama/engine1b/ --max_output_len 100
[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900
[11/24/2024-19:42:47] [TRT-LLM] [W] tokenizer_dir is not specified. Try to infer from model_name, but this may be incorrect.
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[11/24/2024-19:42:48] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4194304
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4194304) * 16
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2893 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 448.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2890 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 96.52 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.02 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 47.54 GiB, available: 43.82 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 20193
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1292352
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 20193
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 39.44 GiB for max tokens in paged KV cache (1292352).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[11/24/2024-19:42:51] [TRT-LLM] [I] Load engine takes: 3.7401185035705566 sec
Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: ";ther  ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther    ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;"
[TensorRT-LLM][INFO] Refreshed the MPI local session

This happens on an RTX 2070 and on a RTX A6000.

Expected behavior

3b and 1b should work

actual behavior

just 3b works

additional notes

no additional notes

@lucasavila00 lucasavila00 added the bug Something isn't working label Nov 24, 2024
@lucasavila00 lucasavila00 changed the title Wrong output on Llama 3.2 1B, but 3b ok Wrong output on Llama 3.2 1B, but 3B ok Nov 24, 2024
@hello-11 hello-11 added the triaged Issue has been triaged by maintainers label Nov 25, 2024
@nv-guomingz
Copy link
Collaborator

Hi @lucasavila00

The LLAMA 3.2 1B works well on my side with local latest code base.
Here is the output log.

 Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[11/26/2024-03:20:37] [TRT-LLM] [I]
 Output : [[' James Best, best known for his role as Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," has died at 88 after a brief illness.']]
[11/26/2024-03:20:37] [TRT-LLM] [I] ---------------------------------------------------------
[TensorRT-LLM][INFO] Refreshed the MPI local session
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.646428108215332 sec)
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1328)
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 235.19293517043315)
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[11/26/2024-03:20:43] [TRT-LLM] [I]   rouge1 : 28.354788894782246
[11/26/2024-03:20:43] [TRT-LLM] [I]   rouge2 : 9.131720488724032
[11/26/2024-03:20:43] [TRT-LLM] [I]   rougeL : 21.17537259072882
[11/26/2024-03:20:43] [TRT-LLM] [I]   rougeLsum : 24.939229523350047

Could u please have a try with today's update?
If the issue still exists, please share us your package details, like transformers version(I'm using 4.45.1)

@nv-guomingz nv-guomingz added waiting for feedback and removed bug Something isn't working labels Nov 26, 2024
@jayakommuru
Copy link

@nv-guomingz Using tensor rt llm version 0.14.0 and facing the same issue with a fine tuned llama 3.2 1B instruct model

python3 run.py --engine_dir=model_16bit_engine --max_output_len 100 --tokenizer_dir model_16bit/ --input_text "10 sal ki ladki ke kapde ladies\n"

Input [Text 0]: "<|begin_of_text|>10 sal ki ladki ke kapde ladies\n"
Output [Text 0 Beam 0]: "\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10"

@jayakommuru
Copy link

@lucasavila00 were you able to find a solution for this ?

@nv-guomingz is this issue similar to #121 ?

@jayakommuru
Copy link

@hello-11 @nv-guomingz have tried with tensor RT LLM version: 0.15.0 as well, but still facing the issue, the output repeats till max tokens. Is there any solution for this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers waiting for feedback
Projects
None yet
Development

No branches or pull requests

4 participants