batchConverter uses up a lot of RAM #2

joelmeili · 2023-07-27T13:42:04Z

Is there a way to run batchConverter in a way to not use up a lot of RAM? When I'm trying to run it, it uses up all RAM and then crashes. The issue seems to come from the pad_sequence when there are a lot of proteins.

wangleiofficial · 2023-07-27T16:19:54Z

hi @joelmeili, Can you show your example and error code? In theory, batchConverter does not take up a lot of memory.

joelmeili · 2023-07-27T18:22:39Z

Hi, Thanks for responding! So I would like to "predict"/"calculate" the embeddings for a list of amino acid sequences, which I extract from a .fasta-file. So the corresponding .fasta-file can be downloaded from here. Attached you can find a python file with the code used. The following is the error message that gets thrown:
RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 40255618000 bytes. Error code 12 (Cannot allocate memory)
May I also ask you whether there's a pre-print available explaining the algorithm?

main.zip

wangleiofficial · 2023-07-28T02:12:25Z

hi，
I found that you have a large number of fasta sequences, which may be caused by insufficient GPU memory. I recommend that you compute protien sequence embeddings in small batches (e.g. run once every 64 sequences), or you can directly integrate ProtFlash into your Pytorch network (in our experiments, this method works best), so that there is no need to store so much sequence embedding information, and in addition, the model of ProtFlash can be fine-tuned.
Regarding our paper, our manuscript is under review and is expected to be released shortly. If you need to cite our paper now, we consider putting the paper in a preprint.

joelmeili · 2023-07-28T14:11:04Z

Hi again,
Yeah makes sense. Where would you put the batchConverter pipeline? I've prepared it now in the following manner:

`from Bio import SeqIO
from ProtFlash.pretrain import load_prot_flash_base
from ProtFlash.utils import batchConverter
from torch.utils.data import Dataset, DataLoader
import torch, numpy

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device: {}".format(device))

model = load_prot_flash_base().to(device)
fasta_file = "train_sequences.fasta"

class ProteinSequenceDataset(Dataset):
def init(self, data_batch):
fasta_parsed = SeqIO.parse(open(fasta_file), "fasta")
data = [(entry.id, str(entry.seq)) for entry in fasta_parsed]
self.id, self.seq = map(list, zip(*data))

def __len__(self):
    return len(self.seq)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
        
    protein_id, seq = self.id[idx], self.seq[idx]
    
    data = [(protein_id, seq)]
    ids, batch_token, lengths = batchConverter(data)
    
    with torch.no_grad():
        token_embedding = model(batch_token.to(device), lengths)
        
    seq_representation = [token_embedding[i, 0:len(seq) + 1].mean(0) for i, (_, seq) in enumerate(data)]
    
    return ids[0], seq_representation[0]

train_set = ProteinSequenceDataset(fasta_file)
train_loader = DataLoader(train_set, batch_size = 64, shuffle = False)

print(next(iter(train_loader)))`

But I guess there might be a smarter way to go about it, for example, how can you apply the transformation in getitem on batch level instead of individual entry level?
Best,
Joël

joelmeili · 2023-07-28T14:41:58Z

Okay I think I found this workaround, is this how you'd write it aswell?

`from Bio import SeqIO
from ProtFlash.pretrain import load_prot_flash_base
from ProtFlash.utils import batchConverter
from torch.utils.data import Dataset, DataLoader
import torch, numpy, pandas

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device: {}".format(device))

model = load_prot_flash_base().to(device)
fasta_file = "/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta"

class ProteinSequenceDataset(Dataset):
def init(self, data_batch):
fasta_parsed = SeqIO.parse(open(fasta_file), "fasta")
data = [(entry.id, str(entry.seq)) for entry in fasta_parsed]
self.id, self.seq = map(list, zip(*data))

def __len__(self):
    return len(self.seq)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
        
    protein_id, seq = self.id[idx], self.seq[idx]
    
    return protein_id, seq

def collate_fn(data):
protein_ids, seqs = zip(*data)
data = [(protein_ids[idx], seqs[idx]) for idx in range(len(protein_ids))]

ids, batch_token, lengths = batchConverter(data)

with torch.no_grad():
    token_embedding = model(batch_token.to(device), lengths)

seq_representations = [token_embedding[i, 0:len(seq) + 1].mean(0) for i, (_, seq) in enumerate(data)]

print(seq_representations)

train_set = ProteinSequenceDataset(fasta_file)
train_loader = DataLoader(train_set, batch_size = 2, shuffle = False, collate_fn = collate_fn)

print(next(iter(train_loader)))`

wangleiofficial · 2023-07-28T14:51:31Z

@joelmeili Yes, I think your code is reasonable, but I suggest you can finetune the language model, which will bring huge benefits.

Example:

model = your_model()
flash_model = load_prot_flash_base()
.....
optimizer = torch.optim.Adam([{'params': model.parameters()}, {'params': flash_model.parameters(), 'lr':1e-5},], lr=your_model lr)

joelmeili · 2023-07-28T14:58:05Z

@wangleiofficial Thanks! I'll look into it when I get to that point. Thanks so far!

joelmeili · 2023-07-30T00:06:18Z

hi @wangleiofficial,

so I tried to implement the model within a different model trying to predict protein functions, it seems to work for CPU, but when I try to use the CUDA framework it seems to not work anymore. Essentially I get this error message: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) I tried to look up where the tensors might be attached to the cpu, but could not figure it out. Maybe you could have a quick look at my code and give me some ideas?

wangleiofficial · 2023-07-30T14:35:48Z

Hello, the pytorch lighting framework does not place the batch_token processed by the batchConverter function on the GPU, you need to implement it manually：

batch_token = batch_token.to(self.device)

If you have any problems using ProtFlash, you can contact me and I will be happy to respond.
Hope ProtFlash can help in the field of protein!

joelmeili · 2023-07-30T16:28:27Z

Cool, thanks! I did what you proposed, but also had to put the flash model itself on to (self.device) in the forward step.

wangleiofficial added question Further information is requested help wanted Extra attention is needed labels Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batchConverter uses up a lot of RAM #2

batchConverter uses up a lot of RAM #2

joelmeili commented Jul 27, 2023 •

edited

Loading

wangleiofficial commented Jul 27, 2023

joelmeili commented Jul 27, 2023 •

edited

Loading

wangleiofficial commented Jul 28, 2023 •

edited

Loading

joelmeili commented Jul 28, 2023 •

edited

Loading

joelmeili commented Jul 28, 2023

wangleiofficial commented Jul 28, 2023

joelmeili commented Jul 28, 2023

joelmeili commented Jul 30, 2023 •

edited

Loading

wangleiofficial commented Jul 30, 2023

joelmeili commented Jul 30, 2023

batchConverter uses up a lot of RAM #2

batchConverter uses up a lot of RAM #2

Comments

joelmeili commented Jul 27, 2023 • edited Loading

wangleiofficial commented Jul 27, 2023

joelmeili commented Jul 27, 2023 • edited Loading

wangleiofficial commented Jul 28, 2023 • edited Loading

joelmeili commented Jul 28, 2023 • edited Loading

joelmeili commented Jul 28, 2023

wangleiofficial commented Jul 28, 2023

joelmeili commented Jul 28, 2023

joelmeili commented Jul 30, 2023 • edited Loading

wangleiofficial commented Jul 30, 2023

joelmeili commented Jul 30, 2023

joelmeili commented Jul 27, 2023 •

edited

Loading

joelmeili commented Jul 27, 2023 •

edited

Loading

wangleiofficial commented Jul 28, 2023 •

edited

Loading

joelmeili commented Jul 28, 2023 •

edited

Loading

joelmeili commented Jul 30, 2023 •

edited

Loading