WARNING - qlinear_old.py:16 - CUDA extension not installed. - gptq models #745

mhaustria2 · 2024-02-11T00:33:14Z

mhaustria2
Feb 11, 2024

Hello guys,

first of all, I really like lomalgpt and worked already with it for some time to analyse log files.
So far I work only with gruff models and cuda enabled. ingest is fast, but prompting could be faster compared when I use other tools with gptq models.
But whenever I try to load a qptg model I get some kind of funny CUDA extension not installed error.

One example:

2024-02-11 00:35:03,695 - INFO - run_localGPT.py:244 - Running on: cuda
2024-02-11 00:35:03,695 - INFO - run_localGPT.py:245 - Display Source Documents set to: False
2024-02-11 00:35:03,695 - INFO - run_localGPT.py:246 - Use history set to: False
2024-02-11 00:35:03,955 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
D:\manualai\anaconda\envs\localgptq\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
max_seq_length 512
2024-02-11 00:35:04,761 - INFO - run_localGPT.py:132 - Loaded embeddings from hkunlp/instructor-large
2024-02-11 00:35:04,823 - INFO - run_localGPT.py:60 - Loading Model: TheBloke/Llama-2-7B-Chat-GPTQ, on: cuda
2024-02-11 00:35:04,823 - INFO - run_localGPT.py:61 - This action can take a few minutes!
2024-02-11 00:35:04,823 - INFO - load_models.py:94 - Using AutoGPTQForCausalLM for quantized models
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████| 727/727 [00:00<?, ?B/s]
D:\manualai\anaconda\envs\localgptq\lib\site-packages\huggingface_hub\file_download.py:149: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\max.cache\huggingface\hub\models--TheBloke--Llama-2-7B-Chat-GPTQ. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer.model: 100%|██████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 9.42MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.45MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████| 411/411 [00:00<?, ?B/s]
2024-02-11 00:35:07,210 - INFO - load_models.py:101 - Tokenizer loaded
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 789/789 [00:00<?, ?B/s]
quantize_config.json: 100%|███████████████████████████████████████████████████████████████████| 188/188 [00:00<?, ?B/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████| 3.90G/3.90G [01:43<00:00, 37.8MB/s]
2024-02-11 00:36:51,728 - INFO - _base.py:727 - lm_head not been quantized, will be ignored when make_quant.
2024-02-11 00:36:51,737 - WARNING - qlinear_old.py:16 - CUDA extension not installed.
2024-02-11 00:36:52,789 - INFO - modeling.py:940 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
2024-02-11 00:36:54,726 - WARNING - fused_llama_mlp.py:306 - skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.

another one:

2024-02-11 00:49:58,074 - INFO - run_localGPT.py:244 - Running on: cuda
2024-02-11 00:49:58,075 - INFO - run_localGPT.py:245 - Display Source Documents set to: False
2024-02-11 00:49:58,075 - INFO - run_localGPT.py:246 - Use history set to: False
2024-02-11 00:49:58,397 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
D:\manualai\anaconda\envs\localgptq\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
max_seq_length 512
2024-02-11 00:49:59,462 - INFO - run_localGPT.py:132 - Loaded embeddings from hkunlp/instructor-large
2024-02-11 00:49:59,522 - INFO - run_localGPT.py:60 - Loading Model: TheBloke/gpt4-alpaca-lora_mlp-65B-GPTQ, on: cuda
2024-02-11 00:49:59,522 - INFO - run_localGPT.py:61 - This action can take a few minutes!
2024-02-11 00:49:59,522 - INFO - load_models.py:94 - Using AutoGPTQForCausalLM for quantized models
2024-02-11 00:50:00,019 - INFO - load_models.py:101 - Tokenizer loaded
model.safetensors: 100%|██████████████████████████████████████████████████████████| 33.5G/33.5G [14:50<00:00, 37.6MB/s]
D:\manualai\anaconda\envs\localgptq\lib\site-packages\huggingface_hub\file_download.py:149: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\max.cache\huggingface\hub\models--TheBloke--gpt4-alpaca-lora_mlp-65B-GPTQ. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
2024-02-11 01:04:52,604 - INFO - _base.py:727 - lm_head not been quantized, will be ignored when make_quant.
2024-02-11 01:04:52,604 - WARNING - qlinear_old.py:16 - CUDA extension not installed.
2024-02-11 01:04:58,960 - INFO - modeling.py:940 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
2024-02-11 01:05:40,462 - WARNING - fused_llama_mlp.py:306 - skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 132/132 [00:00<?, ?B/s]
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'Qwen2ForCausalLM', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'WhisperForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
2024-02-11 01:05:40,827 - INFO - run_localGPT.py:95 - Local LLM Loaded

Enter a query: write a short summary please
D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\generation\configuration_utils.py:392: UserWarning: do_sample is set to False. However, temperature is set to 0.2 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
warnings.warn(
Traceback (most recent call last):
File "D:\localgpt\localGPT\run_localGPT.py", line 285, in
main()
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "D:\localgpt\localGPT\run_localGPT.py", line 259, in main
res = qa(query)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\base.py", line 282, in call
raise e
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\retrieval_qa\base.py", line 139, in _call
answer = self.combine_documents_chain.run(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\base.py", line 480, in run
return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\base.py", line 282, in call
raise e
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\combine_documents\base.py", line 105, in _call
output, extra_return_dict = self.combine_docs(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\combine_documents\stuff.py", line 171, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\llm.py", line 255, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\base.py", line 282, in call
raise e
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\llm.py", line 91, in _call
response = self.generate([inputs], run_manager=run_manager)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\chains\llm.py", line 101, in generate
return self.llm.generate_prompt(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\llms\base.py", line 467, in generate_prompt
return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\llms\base.py", line 598, in generate
output = self._generate_helper(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\llms\base.py", line 504, in _generate_helper
raise e
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\llms\base.py", line 491, in _generate_helper
self._generate(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\llms\base.py", line 977, in _generate
self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\langchain\llms\huggingface_pipeline.py", line 167, in _call
response = self.pipeline(prompt)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\pipelines\text_generation.py", line 219, in call
return super().call(text_inputs, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\pipelines\base.py", line 1162, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\pipelines\base.py", line 1169, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\pipelines\base.py", line 1068, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\pipelines\text_generation.py", line 295, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\auto_gptq\modeling_base.py", line 423, in generate
return self.model.generate(**kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\generation\utils.py", line 1479, in generate
return self.greedy_search(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\generation\utils.py", line 2340, in greedy_search
outputs = self(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1183, in forward
outputs = self.model(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1070, in forward
layer_outputs = decoder_layer(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\accelerate\hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\models\llama\modeling_llama.py", line 798, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\auto_gptq\nn_modules\fused_llama_attn.py", line 62, in forward
kv_seq_len += past_key_value[0].shape[-2]
File "D:\manualai\anaconda\envs\localgptq\lib\site-packages\transformers\cache_utils.py", line 78, in getitem
raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

nvidia information:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

I found solutions in other discussions about uninstalling and reinstalling autogpt, but that did not help

any idea what I am doing wrong?

Thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARNING - qlinear_old.py:16 - CUDA extension not installed. - gptq models #745

{{title}}

Replies: 0 comments

Select a reply

WARNING - qlinear_old.py:16 - CUDA extension not installed. - gptq models #745

mhaustria2 Feb 11, 2024

Replies: 0 comments

mhaustria2
Feb 11, 2024