Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine Tuning the Model #21

Open
cv277 opened this issue Feb 10, 2023 · 9 comments
Open

Fine Tuning the Model #21

cv277 opened this issue Feb 10, 2023 · 9 comments

Comments

@cv277
Copy link

cv277 commented Feb 10, 2023

I want to fine-tune ProGen2-small on my own dataset.
See this google colab notebook for an annotated version of the code and the error:
https://colab.research.google.com/drive/1_R0xgf6Kw0K88PYF7-ZOCIh9WRSmXN8C?usp=sharing

First I load the model like this:

import torch
from tokenizers import Tokenizer
from progen.progen2.models.progen.modeling_progen import ProGenForCausalLM

model = ProGenForCausalLM.from_pretrained('/content/drive/MyDrive/progen2-small', torch_dtype=torch.float16, low_cpu_mem_usage=True).to(device)

I am using the huggingface Trainer to fine-tune the model with the DataCollatorForLanguageModeling. I load the tokenizer like this:

def create_tokenizer_custom(file):
    with open(file, 'r') as f:
        return Tokenizer.from_str(f.read())

tokenizer = create_tokenizer_custom(file='/content/progen/progen2/tokenizer.json')

And then convert it to a PreTrainedTokenizerFast as suggested by: huggingface/tokenizers#325

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

tokenizer.save("my-tokenizer.json")
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="my-tokenizer.json")

During fine-tuning, the training loss becomes 0.0000. After training, I attempt to produce new samples:

with torch.no_grad():
  input_ids = torch.tensor(fast_tokenizer.encode("1GRGL")).view([1, -1]).to(device)
  tokens_batch = model.generate(input_ids, do_sample=True, temperature=0.7, max_length=50, top_p=10, num_return_sequences=1, pad_token_id=0)
  as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])]
  print(tokenizer.decode_batch(as_lists(tokens_batch))[0])

However, I get this error: RuntimeError: probability tensor contains either inf, nan or element < 0
Please see the above google colab notebook for the entire code.

@Seaxingzhou
Copy link

top_p=10 might be <1?

@cv277
Copy link
Author

cv277 commented Feb 11, 2023

top_p=10 might be <1?

Unfortunately I still get the error after setting top_p to a value less than one. Thank you though!

@tanuj2212
Copy link

I am getting a warning and an error which are as follows:
Warning: You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
Error: RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'.

@Geraldene
Copy link

@cv277 where you able to resolve the issue?

@msparsa
Copy link

msparsa commented May 4, 2023

To fix this, you should use torch_dtype=torch.float32 instead.

@TC2022lxf
Copy link

我想知道数据集的样式是什么样的,能否提供呢

@oliverfleetwood
Copy link

I've switched to torch_dtype=torch.float32 but am still getting this issue for progen-base and larger models, but not for progen-small when I'm calling
model = ProGenForCausalLM.from_pretrained('/content/drive/MyDrive/progen2-small', torch_dtype=torch.float32, low_cpu_mem_usage=True).to(device)

Has anyone experienced similar issues or is there somewhere else I need to change the dtype?

@Geraldene
Copy link

@oliverfleetwood that works for me, I tried loading the progen2-large model and it loads fine - what error are you encountering?

@oliverfleetwood
Copy link

First I only ran on cpu. After upgrading cuda and reinstalling torch, I was able to run the larger models on a GPU with the same setup.
I still get the same error as I try to run the larger models (ie all except for progen-small) on CPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants