Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batchConverter uses up a lot of RAM #2

Open
joelmeili opened this issue Jul 27, 2023 · 10 comments
Open

batchConverter uses up a lot of RAM #2

joelmeili opened this issue Jul 27, 2023 · 10 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@joelmeili
Copy link

joelmeili commented Jul 27, 2023

Is there a way to run batchConverter in a way to not use up a lot of RAM? When I'm trying to run it, it uses up all RAM and then crashes. The issue seems to come from the pad_sequence when there are a lot of proteins.

@wangleiofficial
Copy link
Member

hi @joelmeili, Can you show your example and error code? In theory, batchConverter does not take up a lot of memory.

@joelmeili
Copy link
Author

joelmeili commented Jul 27, 2023

Hi, Thanks for responding! So I would like to "predict"/"calculate" the embeddings for a list of amino acid sequences, which I extract from a .fasta-file. So the corresponding .fasta-file can be downloaded from here. Attached you can find a python file with the code used. The following is the error message that gets thrown:
RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 40255618000 bytes. Error code 12 (Cannot allocate memory)
May I also ask you whether there's a pre-print available explaining the algorithm?

main.zip

@wangleiofficial
Copy link
Member

wangleiofficial commented Jul 28, 2023

hi,
I found that you have a large number of fasta sequences, which may be caused by insufficient GPU memory. I recommend that you compute protien sequence embeddings in small batches (e.g. run once every 64 sequences), or you can directly integrate ProtFlash into your Pytorch network (in our experiments, this method works best), so that there is no need to store so much sequence embedding information, and in addition, the model of ProtFlash can be fine-tuned.
Regarding our paper, our manuscript is under review and is expected to be released shortly. If you need to cite our paper now, we consider putting the paper in a preprint.

@wangleiofficial wangleiofficial added question Further information is requested help wanted Extra attention is needed labels Jul 28, 2023
@joelmeili
Copy link
Author

joelmeili commented Jul 28, 2023

Hi again,
Yeah makes sense. Where would you put the batchConverter pipeline? I've prepared it now in the following manner:

`from Bio import SeqIO
from ProtFlash.pretrain import load_prot_flash_base
from ProtFlash.utils import batchConverter
from torch.utils.data import Dataset, DataLoader
import torch, numpy

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device: {}".format(device))

model = load_prot_flash_base().to(device)
fasta_file = "train_sequences.fasta"

class ProteinSequenceDataset(Dataset):
def init(self, data_batch):
fasta_parsed = SeqIO.parse(open(fasta_file), "fasta")
data = [(entry.id, str(entry.seq)) for entry in fasta_parsed]
self.id, self.seq = map(list, zip(*data))

def __len__(self):
    return len(self.seq)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
        
    protein_id, seq = self.id[idx], self.seq[idx]
    
    data = [(protein_id, seq)]
    ids, batch_token, lengths = batchConverter(data)
    
    with torch.no_grad():
        token_embedding = model(batch_token.to(device), lengths)
        
    seq_representation = [token_embedding[i, 0:len(seq) + 1].mean(0) for i, (_, seq) in enumerate(data)]
    
    return ids[0], seq_representation[0]

train_set = ProteinSequenceDataset(fasta_file)
train_loader = DataLoader(train_set, batch_size = 64, shuffle = False)

print(next(iter(train_loader)))`

But I guess there might be a smarter way to go about it, for example, how can you apply the transformation in getitem on batch level instead of individual entry level?
Best,
Joël

@joelmeili
Copy link
Author

Okay I think I found this workaround, is this how you'd write it aswell?

`from Bio import SeqIO
from ProtFlash.pretrain import load_prot_flash_base
from ProtFlash.utils import batchConverter
from torch.utils.data import Dataset, DataLoader
import torch, numpy, pandas

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device: {}".format(device))

model = load_prot_flash_base().to(device)
fasta_file = "/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta"

class ProteinSequenceDataset(Dataset):
def init(self, data_batch):
fasta_parsed = SeqIO.parse(open(fasta_file), "fasta")
data = [(entry.id, str(entry.seq)) for entry in fasta_parsed]
self.id, self.seq = map(list, zip(*data))

def __len__(self):
    return len(self.seq)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
        
    protein_id, seq = self.id[idx], self.seq[idx]
    
    return protein_id, seq

def collate_fn(data):
protein_ids, seqs = zip(*data)
data = [(protein_ids[idx], seqs[idx]) for idx in range(len(protein_ids))]

ids, batch_token, lengths = batchConverter(data)

with torch.no_grad():
    token_embedding = model(batch_token.to(device), lengths)

seq_representations = [token_embedding[i, 0:len(seq) + 1].mean(0) for i, (_, seq) in enumerate(data)]

print(seq_representations)

train_set = ProteinSequenceDataset(fasta_file)
train_loader = DataLoader(train_set, batch_size = 2, shuffle = False, collate_fn = collate_fn)

print(next(iter(train_loader)))`

@wangleiofficial
Copy link
Member

@joelmeili Yes, I think your code is reasonable, but I suggest you can finetune the language model, which will bring huge benefits.

Example:

model = your_model()
flash_model = load_prot_flash_base()
.....
optimizer = torch.optim.Adam([{'params': model.parameters()}, {'params': flash_model.parameters(), 'lr':1e-5},], lr=your_model lr)

@joelmeili
Copy link
Author

@wangleiofficial Thanks! I'll look into it when I get to that point. Thanks so far!

@joelmeili
Copy link
Author

joelmeili commented Jul 30, 2023

hi @wangleiofficial,

so I tried to implement the model within a different model trying to predict protein functions, it seems to work for CPU, but when I try to use the CUDA framework it seems to not work anymore. Essentially I get this error message: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) I tried to look up where the tensors might be attached to the cpu, but could not figure it out. Maybe you could have a quick look at my code and give me some ideas?

@wangleiofficial
Copy link
Member

Hello, the pytorch lighting framework does not place the batch_token processed by the batchConverter function on the GPU, you need to implement it manually:

batch_token = batch_token.to(self.device)

If you have any problems using ProtFlash, you can contact me and I will be happy to respond.
Hope ProtFlash can help in the field of protein!

@joelmeili
Copy link
Author

Cool, thanks! I did what you proposed, but also had to put the flash model itself on to (self.device) in the forward step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants