-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error Occurs After Asking Consecutive Questions in LLM-Chatbot #2421
Comments
Are you talking about "https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot" (where Can you provide more details about your system, please (SoC, amount of memory, OS, version of Python, etc.)? Can you provide example prompts, please? |
Thank you for your response. I am using the model (llama-3-8b-instruct) and code from this project: https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot/llm-chatbot.ipynb Here are my system details: OS: Ubuntu 24.04 The prompts I am using also come from the examples in the project, such as: "hello there! How are you doing?" Please let me know if you need any further information. |
Have you seen errors or warnings in the steps for conversion and compression? Do you see the same when using the INT8 or FP16 variant instead of the INT4 variant? Do you start the Jupyter-notebook from within an virtual-environment (with a "guaranteed" set of versions of components), or "global-local", using the components installed globally on your local machine)? Do you use a specific version or branch of the OpenVINO-Notebooks repo, or the "latest head revision"? When running under MS-Win11 with the latest version I can query multiple prompts without problems using the INT4 model... (but my Laptop has 64MB of RAM, Core-Ultra-7-155H) |
Thank you for your suggestions. I did not see any errors or warnings during the model conversion and compression steps. We have not yet tried using the non-INT4 variants, as the focus of our research project is primarily on INT4 models. We are running the Jupyter notebook in a Python virtual environment and following the steps outlined in the llm-chatbot.ipynb notebook. This research project requires the use of the Ubuntu 24.04 system, so we are hoping to resolve the issue within this setup. (During the execution of the chatbot, the memory usage is approximately 7GB, so the errors are not due to insufficient memory.) |
For conversion and compression I would anyway expect the OperatingSystem to start swapping memory to HDD/SSD if the system memory is not big enough... Let's see if someone else can reproduce it under a similar environment... sorry, I don't see your described problems. Can you reproduce it with another model? |
Thank you for your follow-up. I have also tried using the llama-2-7b-chatbot model with INT4, INT8, and FP16, and I encountered the same issue in all cases. Additionally, I would like to clarify that I have not made any modifications to the code or the model. |
@tim102187S, looks like it due to 30 seconds timeout. Could you try to increase the value or to delete it at all in this row and check: |
I am using OpenVINO 2024.4.0 and have downloaded the llama-3-8b-instruct model for use. When I run multiple consecutive queries (usually on the third query), an error occurs. I have checked my device’s memory usage, and it has not exceeded 100%.
Here is the error report I received:
Selected model llama-3-8b-instruct
Checkbox(value=True, description='Prepare INT4 model')
Checkbox(value=False, description='Prepare INT8 model')
Checkbox(value=False, description='Prepare FP16 model')
Size of model with INT4 compressed weights is 5085.79 MB
Loading model from /home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama-3-8b-instruct/INT4_compressed_weights
Compiling the model to CPU ...
Running on local URL: http://127.0.0.1:7861
To create a public link, set
share=True
inlaunch()
.The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results.Setting
pad_token_id
toeos_token_id
:128001 for open-end generation.The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results.The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results.Setting
pad_token_id
toeos_token_id
:128001 for open-end generation.The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results.Setting
pad_token_id
toeos_token_id
:128001 for open-end generation.Traceback (most recent call last):
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/blocks.py", line 1935, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/blocks.py", line 1532, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 671, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 664, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2405, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 914, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 647, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 809, in gen_wrapper
response = next(iterator)
^^^^^^^^^^^^^^
File "/home/adv/Downloads/EAS_GenAI_Intel14th/docker_build/llm_chatbot/run_chatbot.py", line 532, in bot
for new_text in streamer:
File "/home/adv/openvino-llm/lib/python3.12/site-packages/transformers/generation/streamers.py", line 223, in next
value = self.text_queue.get(timeout=self.timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/queue.py", line 179, in get
raise Empty
_queue.Empty
The text was updated successfully, but these errors were encountered: