Wrong output on Llama 3.2 1B, but 3B ok #2492

lucasavila00 · 2024-11-24T19:53:17Z

System Info

Both RTX 2070 and RTX A6000

Reproduction

I'm using the latest main (535c9cc)

I'm using the make wheel image, from main.

I built the 3B model with

cd /workspace/TensorRT-LLM/examples/llama
python convert_checkpoint.py --model_dir /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/0cb88a4f764b7a12671c53f0838cd831a0843b95 \
                              --output_dir ./tllm_checkpoint_1gpu_fp16 \
                              --dtype float16 \
                              --use_embedding_sharing

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \
            --output_dir ./engine3b \
            --gemm_plugin auto \
            --max_batch_size 1

 python run.py --engine_dir ./llama/engine3b/ --tokenizer_dir meta-llama/Llama-3.2-3B-Instruct  --max_output_len 100

And it runs as expected

root@genius-maxwell:/workspace/TensorRT-LLM/examples#  python run.py --engine_dir ./llama/engine3b/ --tokenizer_dir meta-llama/Llama-3.2-3B-Instruct  --max_output_len 100
[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[11/24/2024-19:46:55] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4194304
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4194304) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 6949 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 6943 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 96.52 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.02 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 47.54 GiB, available: 39.82 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 5244
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 335616
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 5244
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 35.85 GiB for max tokens in paged KV cache (335616).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[11/24/2024-19:47:03] [TRT-LLM] [I] Load engine takes: 8.084513187408447 sec
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter and engraver before turning to sculpture. He was a prominent figure in the development of the Art Nouveau movement, and his work is characterized by its sinuous, organic forms and its use of natural materials such as wood and stone. Soyer's sculptures often feature fantastical creatures and hybrid beings, which he depicted in a highly detailed and realistic style. He was also known for his work in the field of decorative arts, and his designs for furniture, textiles, and other objects are highly prized"
[TensorRT-LLM][INFO] Refreshed the MPI local session

However, when I do the same for Llama 1B:

python convert_checkpoint.py --model_dir /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6 \
                              --output_dir ./tllm_checkpoint_1gpu_fp16_1b \
                              --dtype float16 \
                              --use_embedding_sharing


trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_1b \
            --output_dir ./engine1b \
            --gemm_plugin auto \
            --max_batch_size 1

It just repeats the same token

root@genius-maxwell:/workspace/TensorRT-LLM/examples# python ./run.py --engine_dir ./llama/engine1b/ --max_output_len 100
[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900
[11/24/2024-19:42:47] [TRT-LLM] [W] tokenizer_dir is not specified. Try to infer from model_name, but this may be incorrect.
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[11/24/2024-19:42:48] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024111900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4194304
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4194304) * 16
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2893 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 448.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2890 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 96.52 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.02 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 47.54 GiB, available: 43.82 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 20193
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1292352
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 20193
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 39.44 GiB for max tokens in paged KV cache (1292352).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[11/24/2024-19:42:51] [TRT-LLM] [I] Load engine takes: 3.7401185035705566 sec
Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: ";ther  ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther    ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;ther   ;"
[TensorRT-LLM][INFO] Refreshed the MPI local session

This happens on an RTX 2070 and on a RTX A6000.

Expected behavior

3b and 1b should work

actual behavior

just 3b works

additional notes

no additional notes

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-11-26T03:23:15Z

Hi @lucasavila00

The LLAMA 3.2 1B works well on my side with local latest code base.
Here is the output log.

 Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[11/26/2024-03:20:37] [TRT-LLM] [I]
 Output : [[' James Best, best known for his role as Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," has died at 88 after a brief illness.']]
[11/26/2024-03:20:37] [TRT-LLM] [I] ---------------------------------------------------------
[TensorRT-LLM][INFO] Refreshed the MPI local session
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.646428108215332 sec)
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1328)
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 235.19293517043315)
[11/26/2024-03:20:43] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[11/26/2024-03:20:43] [TRT-LLM] [I]   rouge1 : 28.354788894782246
[11/26/2024-03:20:43] [TRT-LLM] [I]   rouge2 : 9.131720488724032
[11/26/2024-03:20:43] [TRT-LLM] [I]   rougeL : 21.17537259072882
[11/26/2024-03:20:43] [TRT-LLM] [I]   rougeLsum : 24.939229523350047

Could u please have a try with today's update?
If the issue still exists, please share us your package details, like transformers version(I'm using 4.45.1)

jayakommuru · 2024-12-07T17:12:54Z

@nv-guomingz Using tensor rt llm version 0.14.0 and facing the same issue with a fine tuned llama 3.2 1B instruct model

python3 run.py --engine_dir=model_16bit_engine --max_output_len 100 --tokenizer_dir model_16bit/ --input_text "10 sal ki ladki ke kapde ladies\n"

Input [Text 0]: "<|begin_of_text|>10 sal ki ladki ke kapde ladies\n"
Output [Text 0 Beam 0]: "\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10 year old girls clothes\n\n10"

jayakommuru · 2024-12-07T17:13:52Z

@lucasavila00 were you able to find a solution for this ?

@nv-guomingz is this issue similar to #121 ?

jayakommuru · 2024-12-11T09:54:33Z

@hello-11 @nv-guomingz have tried with tensor RT LLM version: 0.15.0 as well, but still facing the issue, the output repeats till max tokens. Is there any solution for this ?

lucasavila00 added the bug Something isn't working label Nov 24, 2024

lucasavila00 changed the title ~~Wrong output on Llama 3.2 1B, but 3b ok~~ Wrong output on Llama 3.2 1B, but 3B ok Nov 24, 2024

hello-11 added the triaged Issue has been triaged by maintainers label Nov 25, 2024

nv-guomingz added waiting for feedback and removed bug Something isn't working labels Nov 26, 2024

hello-11 assigned nv-guomingz Dec 10, 2024

jayakommuru mentioned this issue Dec 11, 2024

Output repeating in finetuned llama 3.2 instruct model #2564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong output on Llama 3.2 1B, but 3B ok #2492

Wrong output on Llama 3.2 1B, but 3B ok #2492

lucasavila00 commented Nov 24, 2024 •

edited

Loading

nv-guomingz commented Nov 26, 2024

jayakommuru commented Dec 7, 2024

jayakommuru commented Dec 7, 2024

jayakommuru commented Dec 11, 2024

Wrong output on Llama 3.2 1B, but 3B ok #2492

Wrong output on Llama 3.2 1B, but 3B ok #2492

Comments

lucasavila00 commented Nov 24, 2024 • edited Loading

System Info

Reproduction

Expected behavior

actual behavior

additional notes

nv-guomingz commented Nov 26, 2024

jayakommuru commented Dec 7, 2024

jayakommuru commented Dec 7, 2024

jayakommuru commented Dec 11, 2024

lucasavila00 commented Nov 24, 2024 •

edited

Loading