Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect use of an invalid tokenizer cache #614

Open
rjpower opened this issue Jun 5, 2024 · 6 comments
Open

Detect use of an invalid tokenizer cache #614

rjpower opened this issue Jun 5, 2024 · 6 comments

Comments

@rjpower
Copy link
Collaborator

rjpower commented Jun 5, 2024

Originally reported as a training NaN for LoRA, this was due to an invalid tokenizer cache. We should detect this by:

  1. Store metadata in the cache when it’s built and check when it’s loaded

  2. Check if any token id exceeds the vocab size during loss computation
    (using equinox’s debug assert)

@rjpower
Copy link
Collaborator Author

rjpower commented Jun 5, 2024

Random logs indicate we're handling the LoRA params correctly:

2024-06-05T21:51:39 - 0 - __main__ - gsm8k_lora.py:185 - INFO :: Total parameter count: 6758692864
2024-06-05T21:51:39 - 0 - __main__ - gsm8k_lora.py:186 - INFO :: Trainable parameter count: 20277248
2024-06-05T21:51:39 - 0 - __main__ - gsm8k_lora.py:187 - INFO :: Fraction of parameters that are trainable: 0.0030001730228053753

But show NaN on the first step:

train:   2% 11/550 [05:00<48:31,  5.40s/it, loss=nan]   

@rjpower
Copy link
Collaborator Author

rjpower commented Jun 5, 2024

Hrm, I see the NaN even if I strip out the loraize logic, so it must be related to how the model is being loaded or something with the dataset.

@dlwh
Copy link
Member

dlwh commented Jun 5, 2024 via email

@rjpower
Copy link
Collaborator Author

rjpower commented Jun 6, 2024

No hurry. That's a good idea, checking with a fresh cache dir now just in case it was corrupted. (I was doing some experiments with the lora_lm entrypoint, so it's possible I got the e.g. the LLama2-70b or Lllama3 tokenizer in there...)

@rjpower
Copy link
Collaborator Author

rjpower commented Jun 6, 2024

Crap, that was it! I suspected a token out of range but couldn't put my finger on how that would be happening. I wonder if there's a cheap test we can do to detect this in the future.

@dlwh
Copy link
Member

dlwh commented Jun 6, 2024 via email

@rjpower rjpower changed the title GSM8K Lora NaNs on initial step with tpuv5-16 Detect use of an invalid tokenizer cache Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants