CUDA Out of Memory Error when Running get_embedding.py on Small Dataset #33

00dylan00 · 2024-06-25T10:26:10Z

I encountered a CUDA Out of Memory error when running the script get_embedding.py with a small dataset containing 2 rows.
Below are the details of the error and the command used to run the script.

Also what is your suggested environment for running scFoundation? how much GPU capacity is recommended?

Command Used:

sbatch test.3.sh /home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py --task_name SCAD_bulk_Etoposide --input_type bulk --output_type cell --pool_type all --tgthighres f1 --data_path X_df_sample.csv --save_path ./ --pre_normalized F --version ce --demo
X_df_sample.csv contains the same data asX_df.csvbut with only 2 rows.
Error Log:

Traceback (most recent call last):
  File "/home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py", line 305, in <module>
    main()
  File "/home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py", line 232, in main
    geneemb = pretrainmodel.encoder(x,x_padding)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sbnb/ddalton/projects/scFoundation/model/pretrainmodels/transformer.py", line 42, in forward
    x = mod(x, src_key_padding_mask=padding_mask) # , src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/transformer.py", line 506, in forward
    return torch._transformer_encoder_layer_fwd(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.45 GiB (GPU 0; 23.69 GiB total capacity; 21.57 GiB already allocated; 980.06 MiB free; 21.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

Memory Tracking
I also tracked memory usage with this function:

def get_cuda_info():
  mem_alloc = "%fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024)
  mem_reserved = "%fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024)
  max_memory_reserved = "%fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024)
  return "GPU alloc: {}. Reserved: {}. MaxReserved: {}".format(mem_alloc,mem_reserved,max_memory_reserved)

At various steps in the get_embeddings.py script - jsut before geneemb = pretrainmodel.encoder(x,x_padding):

            #Cell embedding
            if args.output_type=='cell':
                position_gene_ids, _ = gatherData(data_gene_ids, value_labels, pretrainconfig['pad_token_id'])

                print(get_cuda_info())

                x = pretrainmodel.token_emb(torch.unsqueeze(x, 2).float(), output_weight = 0)
                print(x.shape)
                
                print(get_cuda_info())

                position_emb = pretrainmodel.pos_emb(position_gene_ids)
                x += position_emb

                print(get_cuda_info())

With the following output:

  0%|          | 0/2 [00:03<?, ?it/s]
GPU alloc: 0.445247GB. Reserved: 0.494141GB. MaxReserved: 0.494141GB
torch.Size([1, 15291, 768])
GPU alloc: 0.488881GB. Reserved: 0.558594GB. MaxReserved: 0.558594GB
GPU alloc: 0.532629GB. Reserved: 0.603516GB. MaxReserved: 0.603516GB

Environment Details
PyTorch version: 1.13.1+cu117
CUDA version: 11.7
GPU: 24 GB total capacity

Thanks in advance!

The text was updated successfully, but these errors were encountered:

WhirlFirst · 2024-06-27T16:08:06Z

Hi,
The GPU memory required by scFoundation depends on the sparsity of the cell expression vector, not on the number of cells. We recommend using A100 40G or 80G for local inference.

00dylan00 · 2024-07-04T15:15:46Z

Unfortunately I don't currently have access to that large of a size GPU's. I tried running the code by using float16 instead of float32 (i.e using torch package .half() option) . This actually worked very well for most cases - and was able to run the examples. Although for some samples in my own dataset it again crashed in memory.

In light of these memory shortcomings is there any recommendations on your side? And as a follow-up is there any chance you will release the medium and smaller sized models?

Prunoideae · 2024-10-02T14:50:59Z

.half() will make the model output NA on various datasets I used. You can set a cutoff of 3, 5 or 10 TPM to increase the sparsity of the data to reduce RAM usage when using the encoder.

Also it requires ~80GBs of RAM on a normal bulk RNA dataset which is quite crazy. I modified the encoder layers a bit to offload layers to different GPUs, and it worked well. However, it requires you to have a decent number of GPUs to add up to have a large total amount of RAM, which, in my case, is 4x3090s.

To do so, modify the transformer.py to:

import torch
import torch.nn as nn
import torch.nn.functional as F

visible_gpus = torch.cuda.device_count()


class pytorchTransformerModule(nn.Module):
    def __init__(
        self,
        max_seq_len,
        dim,
        depth,
        heads,
        ff_mult=4,
        norm_first=False,
    ):
        super(pytorchTransformerModule, self).__init__()

        self.max_seq_len = max_seq_len
        self.depth = depth
        layers = []
        for i in range(depth):
            device_index = i % visible_gpus
            layers.append(
                nn.TransformerEncoderLayer(
                    d_model=dim,
                    nhead=heads,
                    dim_feedforward=dim * ff_mult,
                    batch_first=True,
                    norm_first=norm_first,
                    # activation="gelu",
                ).to(f"cuda:{device_index}") # add layers in a round-robin manner, but maybe better to make it consistent so no so many swaps happen
            )

        self.transformer_encoder = nn.ModuleList(layers)
        self.norm = nn.LayerNorm(dim).to("cuda:0")

    def forward(self, x, padding_mask):
        b, n, _, device = *x.shape, x.device
        assert (
            n <= self.max_seq_len
        ), f"sequence length {n} must be less than the max sequence length {self.max_seq_len}"

        # x get encodings [B, N, D] , batch_first is True
        for index, mod in enumerate(self.transformer_encoder):
            device_index = (
                index % visible_gpus
            )  # get index of the device and copy x/mask to it
            x = x.to(f"cuda:{device_index}")
            padding_mask = padding_mask.to(f"cuda:{device_index}")
            x = mod(
                x, src_key_padding_mask=padding_mask
            )  # , src_mask=mask, src_key_padding_mask=src_key_padding_mask)
        # x = self.transformer_encoder(x)
        x = self.norm(x.to("cuda:0"))
        return x

And disable the .cuda() of load_model_frommmf in load.py, as well as specify devices for all the modules due to .cuda() is not there.

My NVTop output:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Out of Memory Error when Running get_embedding.py on Small Dataset #33

CUDA Out of Memory Error when Running get_embedding.py on Small Dataset #33

00dylan00 commented Jun 25, 2024

WhirlFirst commented Jun 27, 2024

00dylan00 commented Jul 4, 2024 •

edited

Loading

Prunoideae commented Oct 2, 2024

CUDA Out of Memory Error when Running get_embedding.py on Small Dataset #33

CUDA Out of Memory Error when Running get_embedding.py on Small Dataset #33

Comments

00dylan00 commented Jun 25, 2024

WhirlFirst commented Jun 27, 2024

00dylan00 commented Jul 4, 2024 • edited Loading

Prunoideae commented Oct 2, 2024

00dylan00 commented Jul 4, 2024 •

edited

Loading