Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 #28

Open
obriensystems opened this issue Jun 29, 2024 · 2 comments
Assignees

Comments

@obriensystems
Copy link
Member

obriensystems commented Jun 29, 2024

A6000
NVIDIA-SMI 551.86 Driver Version: 551.86 CUDA Version: 12.4

4090
NVIDIA-SMI 555.99 Driver Version: 555.99 CUDA Version: 12.5

A4500


import os, torch
# default dual GPU - either PCIe bus or NVidia bus - slowdowns
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# specific GPU - model must fit entierely in memory RTX-3500 ada = 12G, A4000=16G, A4500=20, A6000=48, 4000 ada = 20, 5000 ada = 32, 6000 ada = 48
#os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from transformers import AutoTokenizer, AutoModelForCausalLM
from datetime import datetime

access_token='hf....CQqH'

model = "google/gemma-2-27b"#7b"
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# GPU
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", 
                                             torch_dtype=torch.bfloat16,
                                             token=access_token)
# CPUi
#model = AutoModelForCausalLM.from_pretrained(model,token=access_token)

input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))

# GPU
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# CPU
#input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, 
                         max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))

print("end", datetime.now().strftime("%H:%M:%S"))
time_end = datetime.now().strftime("%H:%M:%S")


michael@14900c MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|##########| 8/8 [00:08<00:00,  1.09s/it]
C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [1,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
...
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [0,0,0], thread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [0,0,0], thread: [63,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
genarate start:  10:11:41
Traceback (most recent call last):
  File "C:\wse_github\obrienlabsdev\machine-learning\environments\windows\src\google-gemma\gemma-gpu.py", line 30, in <module>
    outputs = model.generate(**input_ids,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\generation\utils.py", line 1914, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\generation\utils.py", line 2651, in _sample
    outputs = self(
              ^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 1058, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 898, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 654, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 164, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
                                                           ^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\opt\Python312\Lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

@obriensystems obriensystems changed the title Google gemma-2-9b cublas error running on 48G A6000 ampere under 12.4 Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under Jun 29, 2024
@obriensystems obriensystems changed the title Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under Google gemma-2-9b cublas error running on 48G A6000 ampere under CUDA 12.4 and dual 24G 4090 under 12.5 Jun 29, 2024
@obriensystems obriensystems self-assigned this Jun 29, 2024
@obriensystems
Copy link
Member Author

obriensystems commented Dec 5, 2024

switch to pipeline = ok
11g
https://huggingface.co/google/gemma-2-2b
36g for gemma-2-9b

import os, torch
# default dual GPU - either PCIe bus or NVidia bus - slowdowns
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# specific GPU - model must fit entierely in memory RTX-3500 ada = 12G, A4000=16G, A4500=20, A6000=48, 4000 ada = 20, 5000 ada = 32, 6000 ada = 48
#os.environ["CUDA_VISIBLE_DEVICES"] = "0"

#from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline
from datetime import datetime

access_token='hf....CQqH'

#https://huggingface.co/google/gemma-2-2b
#11g
amodel = "google/gemma-2-2b"#7b"
#https://huggingface.co/google/gemma-2-9b
#36g
#amodel = "google/gemma-2-9b"#7b"
#https://huggingface.co/google/gemma-2-27b
#amodel = "google/gemma-2-27b"#7b"

#tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# GPU
pipe = pipeline(
    "text-generation",
    model=amodel,
    device="cuda",  # replace with "mps" to run on a Mac device
)
#model = AutoModelForCausalLM.from_pretrained(amodel, device_map="auto", 
#                                             torch_dtype=torch.float16,
#                                             token=access_token)
# CPU
#model = AutoModelForCausalLM.from_pretrained(amodel,token=access_token)

input_text = "how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))

# GPU
#input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# CPU
#input_ids = tokenizer(input_text, return_tensors="pt")
#outputs = model.generate(**input_ids, max_new_tokens=10000)
#print(tokenizer.decode(outputs[0]))

outputs = pipe(input_text, max_new_tokens=256)
response = outputs[0]["generated_text"]
print(response)

print("end", datetime.now().strftime("%H:%M:%S"))
time_end = datetime.now().strftime("%H:%M:%S")


@obriensystems
Copy link
Member Author

michael@14900c MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows/src/google-gemma (main)
$ python gemma2-pipeline-gpu.py
Loading checkpoint shards: 100%|##########| 8/8 [00:08<00:00,  1.10s/it]
C:\opt\Python312\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
genarate start:  22:26:40
how is gold made in collapsing neutron stars - specifically what is the ratio created during the beta and r process.

What is the ratio in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in the r-process in
end 22:26:58

obriensystems added a commit that referenced this issue Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant