[BUG] Server cant handle two streaming connections in same time #897

ArtyomZemlyak · 2023-11-10T02:58:37Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Start llama-cpp-server in docker container
Open 2 terminals
In each start streaming completion sessinon
Either each chunk of both streams processed and returned (each terminal get streaming response in same time)
Or firstly one terminal get all full result, and after another terminal get all full result (as its working now for not-streaming)

Current Behavior

Start llama-cpp-server in docker container
Open 2 terminals
In each start streaming completion sessinon
First terminal crashed with error ValueError: invalid literal for int() with base 16: b''
Second terminal not crashed and get streaming response.

Environment and Context

Physical (or virtual) hardware you are using, e.g. for Linux:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  20
  On-line CPU(s) list:   0-19
Vendor ID:               GenuineIntel
  Model name:            12th Gen Intel(R) Core(TM) i7-12700F
    CPU family:          6
    Model:               151
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           1
    Stepping:            2
    CPU max MHz:         4900.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4224.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmu
                         lqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad
                          fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req umip pku ospke waitpkg gfni vae
                         s vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   512 KiB (12 instances)
  L1i:                   512 KiB (12 instances)
  L2:                    12 MiB (9 instances)
  L3:                    25 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-19
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Operating System, e.g. for Linux:

Linux gpu-serv-2 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

SDK version:

main branch 
https://github.com/abetlen/llama-cpp-python/commit/82072802ea0eb68f7f226425e5ea434a3e8e60a0

Failure Information (for bugs)

Error on client side (first terminal):

text = "12345 67890 10Traceback (most recent call last):
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kuruhuru/dev/llm/gguf-server/examples/streaming.py", line 21, in <module>
    for a in response:
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 166, in <genexpr>
    return (
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/openai/api_requestor.py", line 692, in <genexpr>
    return (
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/openai/api_requestor.py", line 115, in parse_stream
    for line in rbody:
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/requests/models.py", line 865, in iter_lines
    for chunk in self.iter_content(
  File "/home/kuruhuru/micromamba/envs/promptflow/lib/python3.9/site-packages/requests/models.py", line 818, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

Error on server side:

ERROR:    ASGI callable returned without completing response.
Llama.generate: prefix-match hit
disconnected
Disconnected from client (via refresh/close) Address(host='172.19.0.1', port=53080)

The text was updated successfully, but these errors were encountered:

ArtyomZemlyak · 2023-11-10T03:01:37Z

microsoft/promptflow#1052

ArtyomZemlyak · 2023-11-10T03:38:44Z

Find this parameter of server:

llama-cpp-python/llama_cpp/server/app.py

Lines 165 to 168 in 8207280

 interrupt_requests: bool = Field( 

 default=True, 

 description="Whether to interrupt requests when a new request is received.", 

 )

Changed this to false and issue resolved - streaming working from 2 terminals.

abetlen · 2023-11-10T16:13:31Z

@ArtyomZemlyak that should be better documented or maybe not the default behaviour. Currently working on #771 which will improve this by allowing multiple requests to efficiently be processed in parallel.

wac81 · 2024-02-14T18:39:23Z

@ArtyomZemlyak that should be better documented or maybe not the default behaviour. Currently working on #771 which will improve this by allowing multiple requests to efficiently be processed in parallel.

agree，because must be response until stream request have to end up

ArtyomZemlyak mentioned this issue Nov 10, 2023

[BUG] Cant handle multiple requests with streaming output microsoft/promptflow#1052

Closed

abetlen added bug Something isn't working documentation Improvements or additions to documentation labels Dec 22, 2023

AayushSameerShah mentioned this issue Jan 4, 2024

Concurrent request handling #1062

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Server cant handle two streaming connections in same time #897

[BUG] Server cant handle two streaming connections in same time #897

ArtyomZemlyak commented Nov 10, 2023

ArtyomZemlyak commented Nov 10, 2023

ArtyomZemlyak commented Nov 10, 2023

abetlen commented Nov 10, 2023

wac81 commented Feb 14, 2024

[BUG] Server cant handle two streaming connections in same time #897

[BUG] Server cant handle two streaming connections in same time #897

Comments

ArtyomZemlyak commented Nov 10, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

ArtyomZemlyak commented Nov 10, 2023

ArtyomZemlyak commented Nov 10, 2023

abetlen commented Nov 10, 2023

wac81 commented Feb 14, 2024