You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running inference on a 72B model with long context lengths(40960), the process is extremely slow, taking approximately 40 minutes to generate results. However, using a standard transformer package, the same task takes only about 5 minutes.
Details:
Model name: qwen2.5:72b-instruct-q8_0
Model Size: 72B
Context Length: Long
Issue: Inference time is around 40 minutes, compared to 5 minutes using a standard transformer package.
Expected Behavior: Inference should complete within a reasonable time frame, closer to the 5-minute benchmark seen with the transformer package.
Actual Behavior: Severe slowdown, resulting in 40-minute inference times when JSON formatting is enforced.
Steps to Reproduce:
Set up the 72B model with long context inputs and enforce multiple fields in JSON format.
Run the inference.
Compare the time taken against the standard transformer package.
llama_cpp_python 0.3.0
torch 2.4.1+cu121
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
TestCode
fromtypingimportOptionalfromllama_index.llms.llama_cppimportLlamaCPPfromllama_cppimportLogitsProcessorListfromlmformatenforcerimportCharacterLevelParser, JsonSchemaParserfromlmformatenforcer.integrations.llamacppimportbuild_llamacpp_logits_processorllm_llamacpp=LlamaCPP(model_path="/root/Qwen/qwen2.5:72b-instruct-q8_0.gguf",
model_kwargs={
"n_gpu_layers": -1,
}, # if compiled to use GPUmax_new_tokens=40960 , # 131072context_window=40960,
temperature=0,
verbose=True
)
frompydanticimportBaseModelDEFAULT_SYSTEM_PROMPT="""\You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\"""defget_prompt(message: str, system_prompt: str=DEFAULT_SYSTEM_PROMPT) ->str:
returnf'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{message} [/INST]'classAnswerFormat(BaseModel):
first_name: strlast_name: stryear_of_birth: intnum_seasons_in_nba: int
...........
About50fieldsquestion=<4klengthcontent>question_with_schema=f'{question}{AnswerFormat.schema_json()}'prompt=get_prompt(question_with_schema)
defllamaindex_llamacpp_lm_format_enforcer(llm: LlamaCPP, prompt: str, character_level_parser: Optional[CharacterLevelParser]) ->str:
logits_processors: Optional[LogitsProcessorList] =Noneifcharacter_level_parser:
logits_processors=LogitsProcessorList([build_llamacpp_logits_processor(llm._model, character_level_parser)])
# If changing the character level parser each call, inject it before calling complete. If its the same format# each time, you can set it once after creating the LlamaCPP modelllm.generate_kwargs['logits_processor'] =logits_processorsoutput=llm.complete(prompt)
text: str=output.textreturntextresult=llamaindex_llamacpp_lm_format_enforcer(llm_llamacpp, prompt, JsonSchemaParser(AnswerFormat.schema()))
print(result)
The text was updated successfully, but these errors were encountered:
Description:
When running inference on a 72B model with long context lengths(40960), the process is extremely slow, taking approximately 40 minutes to generate results. However, using a standard transformer package, the same task takes only about 5 minutes.
Details:
Steps to Reproduce:
Environment:
TestCode
The text was updated successfully, but these errors were encountered: