Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding error in stream response from Triton server #2544

Open
3 of 4 tasks
Wonder-donbury opened this issue Dec 6, 2024 · 0 comments
Open
3 of 4 tasks

Encoding error in stream response from Triton server #2544

Wonder-donbury opened this issue Dec 6, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Wonder-donbury
Copy link

Wonder-donbury commented Dec 6, 2024

System Info

  • CPU architecture : x86_64
  • GPU properties
    • GPU name : NVIDIA L4
    • 24GiB x 4
  • Libraries
    • TensorRT-LLM branch : v0.14.0
    • Versions of CUDA : 12.6
  • NVIDIA driver version : currently using container : nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
  • OS : Ubuntu 22.04

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

launch docker

docker run --rm -it --net host --shm-size=2g
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all
-v /home/trtis/tensorrtllm_backend:/tensorrtllm_backend
-v /home/tensorrt/Qwen2.5_7B_instruct_240920:/Qwen2.5_7B_instruct
-v /home/trtis/engines:/bash_commands
-v /home/trtis/tensorrtllm_openai/TensorRT-LLM:/tensorrtllm_openai
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

load presets to /opt/tritonserver/

cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/

checkpointing

cd /tensorrtllm_backend/tensorrt_llm/examples/qwen
python3 convert_checkpoint.py --model_dir /Qwen2.5_7B_instruct --dtype float16 --tp_size 4 --pp_size 1 --output_dir /c-model/Qwen2.5_7B_instruct/fp16/4-gpu

build engine

trtllm-build --checkpoint_dir /c-model/Qwen2.5_7B_instruct/fp16/4-gpu --gemm_plugin float16 --max_num_tokens 131072 --paged_kv_cache enable --output_dir /engines

preprocessing

TOKENIZER_DIR=/Qwen2.5_7B_instruct/
TOKENIZER_TYPE=qwen2
ENGINE_DIR=/engines
DECOUPLED_MODE=true
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=2
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=10000
TRITON_BACKEND=tensorrtllm
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
KV_CACHE_FREE_GPUP_FRACTION=0.8
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPUP_FRACTION}

Launch Triton Server

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=4 --model_repo=/opt/tritonserver/inflight_batcher_llm

Send Streaming Request to Triton Server

import requests
import json
import time
import sys
sys.stdout.reconfigure(line_buffering=False)


url = "http://localhost:8000/v2/models/ensemble/generate_stream"

headers = {
    "Content-Type": "application/json"
}
max_tokens = 3000
system_prompt_en = f"""<|im_start|>system
    당신은 친절하고 똑똑한 AI 어시스턴트입니다. 질문의 요지에 맞는 한국어 답변을 해주세요.<|im_end|>"""
user_prompt = f"""<|im_start|>user
    공부 잘 하는 방법 좀 알려줘<|im_end|>"""
ai_trigger_prompt = "<|im_start|>assistant"

prompt = system_prompt_en+user_prompt+ai_trigger_prompt

system_prompt = system_prompt_en
raw_data = json.dumps({
    "text_input": prompt,
    "maximum_input_length": 16384,
    "max_tokens": max_tokens,
    "bad_words": "",
    "stop_words": "<|endoftext|>",
    "temperature":1,
    "repetition_penalty":1.2,
    "no_repeat_ngram_size":5,
    "pad_id": 151643, # Qwen2 models use <|endoftext|>:151643 for padding token <https://github.com/QwenLM/Qwen2.5/issues/486>
    "end_id": 151645,
    "top_k": 40,
    "top_p":0.95,
    "min_p":0.1,
    "stream":True
})

response = requests.post(url, headers=headers, data=raw_data,stream=True)


if response.status_code == 200:
    text_aggr = ''
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:  
            chunk = chunk.decode('utf-8')
            chunk = chunk.lstrip('data: ')
            chunk = json.loads(chunk)
            chunk = chunk['text_output']
            text_aggr += chunk
            print(text_aggr,flush=True)

Expected behavior

In text_output, there shouldn't be any byte sequences that are �(\xEF\xBF\xBD)

actual behavior

I've tested the responses through multiple languages, but so far I have found the error occuring specifically only in the korean language.

In the non-streaming generation response from the Triton server, I haven't encountered this error at all, so I assume that it's less likely to be coming from qwen tokenizer. I'm suspecting the text encoding(utf-8) part in the triton streaming server's end.

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"\n"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":""}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"assistant"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"\n"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"공"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"부"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"를"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" 효"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"율"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"적으로"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" 하는"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" 방법"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"들을"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" 소개"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"해"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" �"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"�"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"리"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"겠습니다"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":".\n\n"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"###"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" 기"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"초"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" �"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"�"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"도"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"\n\n"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"####"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":" �"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"�"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"습"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_index":0,"sequence_start":false,"text_output":"의"}

additional notes

I really appreciate your team's work.

@Wonder-donbury Wonder-donbury added the bug Something isn't working label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant