You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importrequestsimportjsonimporttimeimportsyssys.stdout.reconfigure(line_buffering=False)
url="http://localhost:8000/v2/models/ensemble/generate_stream"headers= {
"Content-Type": "application/json"
}
max_tokens=3000system_prompt_en=f"""<|im_start|>system 당신은 친절하고 똑똑한 AI 어시스턴트입니다. 질문의 요지에 맞는 한국어 답변을 해주세요.<|im_end|>"""user_prompt=f"""<|im_start|>user 공부 잘 하는 방법 좀 알려줘<|im_end|>"""ai_trigger_prompt="<|im_start|>assistant"prompt=system_prompt_en+user_prompt+ai_trigger_promptsystem_prompt=system_prompt_enraw_data=json.dumps({
"text_input": prompt,
"maximum_input_length": 16384,
"max_tokens": max_tokens,
"bad_words": "",
"stop_words": "<|endoftext|>",
"temperature":1,
"repetition_penalty":1.2,
"no_repeat_ngram_size":5,
"pad_id": 151643, # Qwen2 models use <|endoftext|>:151643 for padding token <https://github.com/QwenLM/Qwen2.5/issues/486>"end_id": 151645,
"top_k": 40,
"top_p":0.95,
"min_p":0.1,
"stream":True
})
response=requests.post(url, headers=headers, data=raw_data,stream=True)
ifresponse.status_code==200:
text_aggr=''forchunkinresponse.iter_content(chunk_size=1024):
ifchunk:
chunk=chunk.decode('utf-8')
chunk=chunk.lstrip('data: ')
chunk=json.loads(chunk)
chunk=chunk['text_output']
text_aggr+=chunkprint(text_aggr,flush=True)
Expected behavior
In text_output, there shouldn't be any byte sequences that are �(\xEF\xBF\xBD)
actual behavior
I've tested the responses through multiple languages, but so far I have found the error occuring specifically only in the korean language.
In the non-streaming generation response from the Triton server, I haven't encountered this error at all, so I assume that it's less likely to be coming from qwen tokenizer. I'm suspecting the text encoding(utf-8) part in the triton streaming server's end.
System Info
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
launch docker
docker run --rm -it --net host --shm-size=2g
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all
-v /home/trtis/tensorrtllm_backend:/tensorrtllm_backend
-v /home/tensorrt/Qwen2.5_7B_instruct_240920:/Qwen2.5_7B_instruct
-v /home/trtis/engines:/bash_commands
-v /home/trtis/tensorrtllm_openai/TensorRT-LLM:/tensorrtllm_openai
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
load presets to /opt/tritonserver/
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/
checkpointing
cd /tensorrtllm_backend/tensorrt_llm/examples/qwen
python3 convert_checkpoint.py --model_dir /Qwen2.5_7B_instruct --dtype float16 --tp_size 4 --pp_size 1 --output_dir /c-model/Qwen2.5_7B_instruct/fp16/4-gpu
build engine
trtllm-build --checkpoint_dir /c-model/Qwen2.5_7B_instruct/fp16/4-gpu --gemm_plugin float16 --max_num_tokens 131072 --paged_kv_cache enable --output_dir /engines
preprocessing
TOKENIZER_DIR=/Qwen2.5_7B_instruct/
TOKENIZER_TYPE=qwen2
ENGINE_DIR=/engines
DECOUPLED_MODE=true
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=2
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=10000
TRITON_BACKEND=tensorrtllm
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
KV_CACHE_FREE_GPUP_FRACTION=0.8
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPUP_FRACTION}
Launch Triton Server
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=4 --model_repo=/opt/tritonserver/inflight_batcher_llm
Send Streaming Request to Triton Server
Expected behavior
In
text_output
, there shouldn't be any byte sequences that are �(\xEF\xBF\xBD)actual behavior
I've tested the responses through multiple languages, but so far I have found the error occuring specifically only in the korean language.
In the non-streaming generation response from the Triton server, I haven't encountered this error at all, so I assume that it's less likely to be coming from qwen tokenizer. I'm suspecting the text encoding(utf-8) part in the triton streaming server's end.
additional notes
I really appreciate your team's work.
The text was updated successfully, but these errors were encountered: