You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/ubuntu/.local/lib/python3.8/site-packages/llama_cpp/llama.py", line 506, in _create_completion
prompt_tokens: List[llama_cpp.llama_token] = self.tokenize(
File "/home/ubuntu/.local/lib/python3.8/site-packages/llama_cpp/llama.py", line 189, in tokenize
raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')
RuntimeError: Failed to tokenize: text="b" ### Human:Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n\xe6\xa8\xaa\xe5\xba\x97\xe9\x9b\x86\xe5\x9b\xa2\xe4\xb8\x9c\xe7\xa3\x81\xe8\x82\xa1\xe4\xbb\xbd\xe6\x9c\x89\xe9\x99\x90\xe5\x85\xac\xe5\x8f\xb8 \n \n \n \n1
and use your pdf cannot generate answer or too slow to generate:
Seems like the main problem is the exceeded context length. First, try to edit those lines in app.py:
Line 59: try lower values for chunk_size and chunk_overlap. Like 800 and 150, for example.
If that doesn't help:
Line 78: lower the k value from 4 to 3 (this is a number of retrieved text chunks)
By those logs I'm also assuming you are using Chinese. I haven't tested if it even works, and I expect the model to be even slower than usual with it, and the quality of the results probably be poor.
Perhaps some other LLM models with more emphasis on multi-language or specifically Chinese would be better suited for this.
File "/home/ubuntu/.local/lib/python3.8/site-packages/llama_cpp/llama.py", line 506, in _create_completion
prompt_tokens: List[llama_cpp.llama_token] = self.tokenize(
File "/home/ubuntu/.local/lib/python3.8/site-packages/llama_cpp/llama.py", line 189, in tokenize
raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')
RuntimeError: Failed to tokenize: text="b" ### Human:Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n\xe6\xa8\xaa\xe5\xba\x97\xe9\x9b\x86\xe5\x9b\xa2\xe4\xb8\x9c\xe7\xa3\x81\xe8\x82\xa1\xe4\xbb\xbd\xe6\x9c\x89\xe9\x99\x90\xe5\x85\xac\xe5\x8f\xb8 \n \n \n \n1
and use your pdf cannot generate answer or too slow to generate:
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from ./ggml-vicuna-13b-1.1-q4_2.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 5 (mostly Q4_2)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 85.08 KB
llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Token indices sequence length is longer than the specified maximum sequence length for this model (1104 > 1024). Running this sequence through the model will result in indexing errors
always waiting here
The text was updated successfully, but these errors were encountered: