Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a partitioned A100 GPU via MIG with device_index and faster_index causing ctranslate2 error #1788

Open
johnrisby opened this issue Sep 23, 2024 · 6 comments

Comments

@johnrisby
Copy link

Hi,

I asked this a couple of days ago on the faster-whisper git but given it's a ctranslate2 error, I thought I better post it here too (apologies for the duplication but the original question is at SYSTRAN/faster-whisper#1018

I've previously used faster_whisper with device_index with multiple GPUs, but I'm currently using an A100 and I've partitioned it with MIG so it has 7 partitions (that may be too many but I need to test that first).

Using device_index doesn't seem to work. I'm getting a ctranslate2 error.

nvidia_smi is showing

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                   On |
| N/A   32C    P0             42W /  300W |      88MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    7   0   0  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    8   0   1  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    9   0   2  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   10   0   3  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   11   0   4  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   12   0   5  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   6  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I was hoping I'd be able to put device_index=[0,1,2..] but that doesn't work and neither does using 7,8,9...

ie

model = WhisperModel(model_size, device="cuda",device_index=[0,1],compute_type="bfloat16")

Is this possible?

The error is:

Traceback (most recent call last):
File "/home/xxxxx/xxxxx/test.py", line 30, in
model = WhisperModel(model_size, device="cuda",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxxxxxx/anaconda3/envs/xxxxx/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 145, in init
self.model = ctranslate2.models.Whisper(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA failed with error invalid device ordinal

Many thanks
John

@minhthuc2502
Copy link
Collaborator

I can use device_index=[0,1], but this is not normal and it could be the issues with CUDA on your machine. Could you test multiple GPUs with other examples, apart from CTranslate2, to confirm?

Also, try running export CUDA_VISIBLE_DEVICES=0,1

@johnrisby
Copy link
Author

I can use device_index=[0,1], but this is not normal and it could be the issues with CUDA on your machine. Could you test multiple GPUs with other examples, apart from CTranslate2, to confirm?

Also, try running export CUDA_VISIBLE_DEVICES=0,1

Thanks for the reply and sorry for the delay. I couldn't get it to work at all. "Multiple GPUs" work in general, but to make it work with faster whisper I seem to have to only run one GPU per script rather than use device_index with multiple partitioned GPUs. I'm currently using two V100s instead.

Did you manage to get it working with MIG or with multiple hardware GPUs? Thanks

@minhthuc2502
Copy link
Collaborator

I tested on my server - 2 GPUs V100s with the option device_index=[0,1]. the model can be loaded on 2 GPUs.

@johnrisby
Copy link
Author

I tested on my server - 2 GPUs V100s with the option device_index=[0,1]. the model can be loaded on 2 GPUs.

Ah ok, thanks but as I said in my original post, I've previously used it like this with multiple hardware GPUs so I knew that worked.

It's when using it with MIG partitioning on an A100 (which allows up to 7 "virtual" GPUs) that I'm struggling.

@minhthuc2502
Copy link
Collaborator

Ah sorry for the confusion. This case hasn't been tested before, so it's understandable that it doesn't work. Unfortunately, I don't have any immediate suggestions.

@johnrisby
Copy link
Author

Ah sorry for the confusion. This case hasn't been tested before, so it's understandable that it doesn't work. Unfortunately, I don't have any immediate suggestions.

No worries, I'm using V100s now instead. Hopefully this can be worked out at some point though as MIG is very useful (although obviously it can be worked around by running multiple instances of a script instead of using device_index). Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants