Using a partitioned A100 GPU via MIG with device_index and faster_index causing ctranslate2 error #1788

johnrisby · 2024-09-23T14:00:41Z

Hi,

I asked this a couple of days ago on the faster-whisper git but given it's a ctranslate2 error, I thought I better post it here too (apologies for the duplication but the original question is at SYSTRAN/faster-whisper#1018

I've previously used faster_whisper with device_index with multiple GPUs, but I'm currently using an A100 and I've partitioned it with MIG so it has 7 partitions (that may be too many but I need to test that first).

Using device_index doesn't seem to work. I'm getting a ctranslate2 error.

nvidia_smi is showing

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                   On |
| N/A   32C    P0             42W /  300W |      88MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    7   0   0  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    8   0   1  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    9   0   2  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   10   0   3  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   11   0   4  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   12   0   5  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   6  |              13MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I was hoping I'd be able to put device_index=[0,1,2..] but that doesn't work and neither does using 7,8,9...

ie

model = WhisperModel(model_size, device="cuda",device_index=[0,1],compute_type="bfloat16")

Is this possible?

The error is:

Traceback (most recent call last):
File "/home/xxxxx/xxxxx/test.py", line 30, in
model = WhisperModel(model_size, device="cuda",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxxxxxx/anaconda3/envs/xxxxx/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 145, in init
self.model = ctranslate2.models.Whisper(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA failed with error invalid device ordinal

Many thanks
John

The text was updated successfully, but these errors were encountered:

minhthuc2502 · 2024-09-25T15:46:23Z

I can use device_index=[0,1], but this is not normal and it could be the issues with CUDA on your machine. Could you test multiple GPUs with other examples, apart from CTranslate2, to confirm?

Also, try running export CUDA_VISIBLE_DEVICES=0,1

johnrisby · 2024-10-07T13:17:00Z

I can use device_index=[0,1], but this is not normal and it could be the issues with CUDA on your machine. Could you test multiple GPUs with other examples, apart from CTranslate2, to confirm?

Also, try running export CUDA_VISIBLE_DEVICES=0,1

Thanks for the reply and sorry for the delay. I couldn't get it to work at all. "Multiple GPUs" work in general, but to make it work with faster whisper I seem to have to only run one GPU per script rather than use device_index with multiple partitioned GPUs. I'm currently using two V100s instead.

Did you manage to get it working with MIG or with multiple hardware GPUs? Thanks

minhthuc2502 · 2024-10-08T15:42:11Z

I tested on my server - 2 GPUs V100s with the option device_index=[0,1]. the model can be loaded on 2 GPUs.

johnrisby · 2024-10-08T15:44:38Z

I tested on my server - 2 GPUs V100s with the option device_index=[0,1]. the model can be loaded on 2 GPUs.

Ah ok, thanks but as I said in my original post, I've previously used it like this with multiple hardware GPUs so I knew that worked.

It's when using it with MIG partitioning on an A100 (which allows up to 7 "virtual" GPUs) that I'm struggling.

minhthuc2502 · 2024-10-08T15:48:10Z

Ah sorry for the confusion. This case hasn't been tested before, so it's understandable that it doesn't work. Unfortunately, I don't have any immediate suggestions.

johnrisby · 2024-10-08T15:51:51Z

Ah sorry for the confusion. This case hasn't been tested before, so it's understandable that it doesn't work. Unfortunately, I don't have any immediate suggestions.

No worries, I'm using V100s now instead. Hopefully this can be worked out at some point though as MIG is very useful (although obviously it can be worked around by running multiple instances of a script instead of using device_index). Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a partitioned A100 GPU via MIG with device_index and faster_index causing ctranslate2 error #1788

Using a partitioned A100 GPU via MIG with device_index and faster_index causing ctranslate2 error #1788

johnrisby commented Sep 23, 2024

minhthuc2502 commented Sep 25, 2024

johnrisby commented Oct 7, 2024

minhthuc2502 commented Oct 8, 2024

johnrisby commented Oct 8, 2024

minhthuc2502 commented Oct 8, 2024

johnrisby commented Oct 8, 2024

Using a partitioned A100 GPU via MIG with device_index and faster_index causing ctranslate2 error #1788

Using a partitioned A100 GPU via MIG with device_index and faster_index causing ctranslate2 error #1788

Comments

johnrisby commented Sep 23, 2024

minhthuc2502 commented Sep 25, 2024

johnrisby commented Oct 7, 2024

minhthuc2502 commented Oct 8, 2024

johnrisby commented Oct 8, 2024

minhthuc2502 commented Oct 8, 2024

johnrisby commented Oct 8, 2024