Detect use of an invalid tokenizer cache #614

rjpower · 2024-06-05T21:48:14Z

Originally reported as a training NaN for LoRA, this was due to an invalid tokenizer cache. We should detect this by:

Store metadata in the cache when it’s built and check when it’s loaded
Check if any token id exceeds the vocab size during loss computation
(using equinox’s debug assert)

rjpower · 2024-06-05T21:54:02Z

Random logs indicate we're handling the LoRA params correctly:

2024-06-05T21:51:39 - 0 - __main__ - gsm8k_lora.py:185 - INFO :: Total parameter count: 6758692864
2024-06-05T21:51:39 - 0 - __main__ - gsm8k_lora.py:186 - INFO :: Trainable parameter count: 20277248
2024-06-05T21:51:39 - 0 - __main__ - gsm8k_lora.py:187 - INFO :: Fraction of parameters that are trainable: 0.0030001730228053753

But show NaN on the first step:

train:   2% 11/550 [05:00<48:31,  5.40s/it, loss=nan]

rjpower · 2024-06-05T22:07:57Z

Hrm, I see the NaN even if I strip out the loraize logic, so it must be related to how the model is being loaded or something with the dataset.

dlwh · 2024-06-05T23:50:37Z

this is a regression somehow. (We don't run this config on the regular). I'll try to look once I come up for air. double check though: is your cache_die set for this tokenizer?

…

On Wed, Jun 5, 2024 at 3:08 PM Russell Power ***@***.***> wrote: Hrm, I see the NaN even if I strip out the loraize logic, so it must be related to how the model is being loaded or something with the dataset. — Reply to this email directly, view it on GitHub <#614 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACLIPDNJ4BEKQIFOHO4YDZF6D5FAVCNFSM6AAAAABI3RCGGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGAZTSNZXGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rjpower · 2024-06-06T00:00:37Z

No hurry. That's a good idea, checking with a fresh cache dir now just in case it was corrupted. (I was doing some experiments with the lora_lm entrypoint, so it's possible I got the e.g. the LLama2-70b or Lllama3 tokenizer in there...)

rjpower · 2024-06-06T00:05:44Z

Crap, that was it! I suspected a token out of range but couldn't put my finger on how that would be happening. I wonder if there's a cheap test we can do to detect this in the future.

dlwh · 2024-06-06T00:34:21Z

Yeah this isn’t the first time this happened. I want to solve this two ways: 1. Store metadata in the cache when it’s built and check when it’s loaded 2. Check if any token id exceeds the vocab size during loss computation (using equinox’s debug assert)

…

On Wed, Jun 5, 2024 at 5:06 PM Russell Power ***@***.***> wrote: Crap, that was it! I suspected a token out of range but couldn't put my finger on how that would be happening. I wonder if there's a cheap test we can do to detect this in the future. — Reply to this email directly, view it on GitHub <#614 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACLIIVE3UUEPPWY6AI4STZF6RW7AVCNFSM6AAAAABI3RCGGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGE2DONBWHA> . You are receiving this because you commented.Message ID: ***@***.***>

rjpower changed the title ~~GSM8K Lora NaNs on initial step with tpuv5-16~~ Detect use of an invalid tokenizer cache Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect use of an invalid tokenizer cache #614

Detect use of an invalid tokenizer cache #614

rjpower commented Jun 5, 2024 •

edited

Loading

rjpower commented Jun 5, 2024

rjpower commented Jun 5, 2024

dlwh commented Jun 5, 2024 via email

rjpower commented Jun 6, 2024 •

edited

Loading

rjpower commented Jun 6, 2024

dlwh commented Jun 6, 2024 via email

Detect use of an invalid tokenizer cache #614

Detect use of an invalid tokenizer cache #614

Comments

rjpower commented Jun 5, 2024 • edited Loading

rjpower commented Jun 5, 2024

rjpower commented Jun 5, 2024

dlwh commented Jun 5, 2024 via email

rjpower commented Jun 6, 2024 • edited Loading

rjpower commented Jun 6, 2024

dlwh commented Jun 6, 2024 via email

rjpower commented Jun 5, 2024 •

edited

Loading

rjpower commented Jun 6, 2024 •

edited

Loading