Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 #28

obriensystems · 2024-06-29T14:25:11Z

A6000
NVIDIA-SMI 551.86 Driver Version: 551.86 CUDA Version: 12.4

4090
NVIDIA-SMI 555.99 Driver Version: 555.99 CUDA Version: 12.5

A4500


import os, torch
# default dual GPU - either PCIe bus or NVidia bus - slowdowns
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# specific GPU - model must fit entierely in memory RTX-3500 ada = 12G, A4000=16G, A4500=20, A6000=48, 4000 ada = 20, 5000 ada = 32, 6000 ada = 48
#os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from transformers import AutoTokenizer, AutoModelForCausalLM
from datetime import datetime

access_token='hf....CQqH'

model = "google/gemma-2-27b"#7b"
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# GPU
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", 
                                             torch_dtype=torch.bfloat16,
                                             token=access_token)
# CPUi
#model = AutoModelForCausalLM.from_pretrained(model,token=access_token)

input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))

# GPU
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# CPU
#input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, 
                         max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))

print("end", datetime.now().strftime("%H:%M:%S"))
time_end = datetime.now().strftime("%H:%M:%S")


michael@14900c MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|##########| 8/8 [00:08<00:00,  1.09s/it]
C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [1,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
...
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [0,0,0], thread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [0,0,0], thread: [63,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
genarate start:  10:11:41
Traceback (most recent call last):
  File "C:\wse_github\obrienlabsdev\machine-learning\environments\windows\src\google-gemma\gemma-gpu.py", line 30, in <module>
    outputs = model.generate(**input_ids,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\generation\utils.py", line 1914, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\generation\utils.py", line 2651, in _sample
    outputs = self(
              ^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 1058, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 898, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 654, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 164, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
                                                           ^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

The text was updated successfully, but these errors were encountered:

obriensystems · 2024-12-05T03:24:56Z

switch to pipeline = ok
11g
https://huggingface.co/google/gemma-2-2b
36g for gemma-2-9b

import os, torch
# default dual GPU - either PCIe bus or NVidia bus - slowdowns
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# specific GPU - model must fit entierely in memory RTX-3500 ada = 12G, A4000=16G, A4500=20, A6000=48, 4000 ada = 20, 5000 ada = 32, 6000 ada = 48
#os.environ["CUDA_VISIBLE_DEVICES"] = "0"

#from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline
from datetime import datetime

access_token='hf....CQqH'

#https://huggingface.co/google/gemma-2-2b
#11g
amodel = "google/gemma-2-2b"#7b"
#https://huggingface.co/google/gemma-2-9b
#36g
#amodel = "google/gemma-2-9b"#7b"
#https://huggingface.co/google/gemma-2-27b
#amodel = "google/gemma-2-27b"#7b"

#tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# GPU
pipe = pipeline(
    "text-generation",
    model=amodel,
    device="cuda",  # replace with "mps" to run on a Mac device
)
#model = AutoModelForCausalLM.from_pretrained(amodel, device_map="auto", 
#                                             torch_dtype=torch.float16,
#                                             token=access_token)
# CPU
#model = AutoModelForCausalLM.from_pretrained(amodel,token=access_token)

input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))

# GPU
#input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# CPU
#input_ids = tokenizer(input_text, return_tensors="pt")
#outputs = model.generate(**input_ids, max_new_tokens=10000)
#print(tokenizer.decode(outputs[0]))

outputs = pipe(input_text, max_new_tokens=256)
response = outputs[0]["generated_text"]
print(response)

print("end", datetime.now().strftime("%H:%M:%S"))
time_end = datetime.now().strftime("%H:%M:%S")

obriensystems · 2024-12-05T03:27:27Z

michael@14900c MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma2-pipeline-gpu.py
Loading checkpoint shards: 100%|##########| 8/8 [00:08<00:00,  1.10s/it]
C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
genarate start:  22:26:40
how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.

What is the ratio in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in
end 22:26:58

obriensystems changed the title ~~Google gemma-2-9b cublas error running on 48G A6000 ampere under 12.4~~ Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under Jun 29, 2024

obriensystems changed the title ~~Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under~~ Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 Jun 29, 2024

obriensystems self-assigned this Jun 29, 2024

obriensystems mentioned this issue Jun 29, 2024

Google Gemma 2 27B is out - setup inference and upgrade transformers - run on 48G A6000 Ada and 128G 14900K #27

Open

obriensystems added a commit that referenced this issue Dec 5, 2024

#28 - Create gemma2-pipeline.gpu

1328a34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 #28

Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 #28

obriensystems commented Jun 29, 2024 •

edited

Loading

obriensystems commented Dec 5, 2024 •

edited

Loading

obriensystems commented Dec 5, 2024

Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 #28

Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 #28

Comments

obriensystems commented Jun 29, 2024 • edited Loading

obriensystems commented Dec 5, 2024 • edited Loading

obriensystems commented Dec 5, 2024

obriensystems commented Jun 29, 2024 •

edited

Loading

obriensystems commented Dec 5, 2024 •

edited

Loading