Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Out of Memory Error when Running get_embedding.py on Small Dataset #33

Open
00dylan00 opened this issue Jun 25, 2024 · 3 comments
Open

Comments

@00dylan00
Copy link

I encountered a CUDA Out of Memory error when running the script get_embedding.py with a small dataset containing 2 rows.
Below are the details of the error and the command used to run the script.

Also what is your suggested environment for running scFoundation? how much GPU capacity is recommended?

Command Used:

sbatch test.3.sh /home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py --task_name SCAD_bulk_Etoposide --input_type bulk --output_type cell --pool_type all --tgthighres f1 --data_path X_df_sample.csv --save_path ./ --pre_normalized F --version ce --demo
X_df_sample.csv contains the same data asX_df.csvbut with only 2 rows.
Error Log:

Traceback (most recent call last):
  File "/home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py", line 305, in <module>
    main()
  File "/home/sbnb/ddalton/projects/scFoundation/model/get_embedding.py", line 232, in main
    geneemb = pretrainmodel.encoder(x,x_padding)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sbnb/ddalton/projects/scFoundation/model/pretrainmodels/transformer.py", line 42, in forward
    x = mod(x, src_key_padding_mask=padding_mask) # , src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/transformer.py", line 506, in forward
    return torch._transformer_encoder_layer_fwd(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.45 GiB (GPU 0; 23.69 GiB total capacity; 21.57 GiB already allocated; 980.06 MiB free; 21.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

Memory Tracking
I also tracked memory usage with this function:

def get_cuda_info():
  mem_alloc = "%fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024)
  mem_reserved = "%fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024)
  max_memory_reserved = "%fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024)
  return "GPU alloc: {}. Reserved: {}. MaxReserved: {}".format(mem_alloc,mem_reserved,max_memory_reserved)

At various steps in the get_embeddings.py script - jsut before geneemb = pretrainmodel.encoder(x,x_padding):

            #Cell embedding
            if args.output_type=='cell':
                position_gene_ids, _ = gatherData(data_gene_ids, value_labels, pretrainconfig['pad_token_id'])

                print(get_cuda_info())

                x = pretrainmodel.token_emb(torch.unsqueeze(x, 2).float(), output_weight = 0)
                print(x.shape)
                
                print(get_cuda_info())

                position_emb = pretrainmodel.pos_emb(position_gene_ids)
                x += position_emb

                print(get_cuda_info())

With the following output:

  0%|          | 0/2 [00:03<?, ?it/s]
GPU alloc: 0.445247GB. Reserved: 0.494141GB. MaxReserved: 0.494141GB
torch.Size([1, 15291, 768])
GPU alloc: 0.488881GB. Reserved: 0.558594GB. MaxReserved: 0.558594GB
GPU alloc: 0.532629GB. Reserved: 0.603516GB. MaxReserved: 0.603516GB

Environment Details
PyTorch version: 1.13.1+cu117
CUDA version: 11.7
GPU: 24 GB total capacity

Thanks in advance!

@WhirlFirst
Copy link
Collaborator

Hi,
The GPU memory required by scFoundation depends on the sparsity of the cell expression vector, not on the number of cells. We recommend using A100 40G or 80G for local inference.

@00dylan00
Copy link
Author

00dylan00 commented Jul 4, 2024

Unfortunately I don't currently have access to that large of a size GPU's. I tried running the code by using float16 instead of float32 (i.e using torch package .half() option) . This actually worked very well for most cases - and was able to run the examples. Although for some samples in my own dataset it again crashed in memory.

In light of these memory shortcomings is there any recommendations on your side? And as a follow-up is there any chance you will release the medium and smaller sized models?

@Prunoideae
Copy link

.half() will make the model output NA on various datasets I used. You can set a cutoff of 3, 5 or 10 TPM to increase the sparsity of the data to reduce RAM usage when using the encoder.

Also it requires ~80GBs of RAM on a normal bulk RNA dataset which is quite crazy. I modified the encoder layers a bit to offload layers to different GPUs, and it worked well. However, it requires you to have a decent number of GPUs to add up to have a large total amount of RAM, which, in my case, is 4x3090s.

To do so, modify the transformer.py to:

import torch
import torch.nn as nn
import torch.nn.functional as F

visible_gpus = torch.cuda.device_count()


class pytorchTransformerModule(nn.Module):
    def __init__(
        self,
        max_seq_len,
        dim,
        depth,
        heads,
        ff_mult=4,
        norm_first=False,
    ):
        super(pytorchTransformerModule, self).__init__()

        self.max_seq_len = max_seq_len
        self.depth = depth
        layers = []
        for i in range(depth):
            device_index = i % visible_gpus
            layers.append(
                nn.TransformerEncoderLayer(
                    d_model=dim,
                    nhead=heads,
                    dim_feedforward=dim * ff_mult,
                    batch_first=True,
                    norm_first=norm_first,
                    # activation="gelu",
                ).to(f"cuda:{device_index}") # add layers in a round-robin manner, but maybe better to make it consistent so no so many swaps happen
            )

        self.transformer_encoder = nn.ModuleList(layers)
        self.norm = nn.LayerNorm(dim).to("cuda:0")

    def forward(self, x, padding_mask):
        b, n, _, device = *x.shape, x.device
        assert (
            n <= self.max_seq_len
        ), f"sequence length {n} must be less than the max sequence length {self.max_seq_len}"

        # x get encodings [B, N, D] , batch_first is True
        for index, mod in enumerate(self.transformer_encoder):
            device_index = (
                index % visible_gpus
            )  # get index of the device and copy x/mask to it
            x = x.to(f"cuda:{device_index}")
            padding_mask = padding_mask.to(f"cuda:{device_index}")
            x = mod(
                x, src_key_padding_mask=padding_mask
            )  # , src_mask=mask, src_key_padding_mask=src_key_padding_mask)
        # x = self.transformer_encoder(x)
        x = self.norm(x.to("cuda:0"))
        return x

And disable the .cuda() of load_model_frommmf in load.py, as well as specify devices for all the modules due to .cuda() is not there.

My NVTop output:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants