-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect use of an invalid tokenizer cache #614
Comments
Random logs indicate we're handling the LoRA params correctly:
But show NaN on the first step:
|
Hrm, I see the NaN even if I strip out the |
this is a regression somehow. (We don't run this config on the regular).
I'll try to look once I come up for air.
double check though: is your cache_die set for this tokenizer?
…On Wed, Jun 5, 2024 at 3:08 PM Russell Power ***@***.***> wrote:
Hrm, I see the NaN even if I strip out the loraize logic, so it must be
related to how the model is being loaded or something with the dataset.
—
Reply to this email directly, view it on GitHub
<#614 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACLIPDNJ4BEKQIFOHO4YDZF6D5FAVCNFSM6AAAAABI3RCGGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGAZTSNZXGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
No hurry. That's a good idea, checking with a fresh cache dir now just in case it was corrupted. (I was doing some experiments with the |
Crap, that was it! I suspected a token out of range but couldn't put my finger on how that would be happening. I wonder if there's a cheap test we can do to detect this in the future. |
Yeah this isn’t the first time this happened. I want to solve this two ways:
1. Store metadata in the cache when it’s built and check when it’s loaded
2. Check if any token id exceeds the vocab size during loss computation
(using equinox’s debug assert)
…On Wed, Jun 5, 2024 at 5:06 PM Russell Power ***@***.***> wrote:
Crap, that was it! I suspected a token out of range but couldn't put my
finger on how that would be happening. I wonder if there's a cheap test we
can do to detect this in the future.
—
Reply to this email directly, view it on GitHub
<#614 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACLIIVE3UUEPPWY6AI4STZF6RW7AVCNFSM6AAAAABI3RCGGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGE2DONBWHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Originally reported as a training NaN for LoRA, this was due to an invalid tokenizer cache. We should detect this by:
Store metadata in the cache when it’s built and check when it’s loaded
Check if any token id exceeds the vocab size during loss computation
(using equinox’s debug assert)
The text was updated successfully, but these errors were encountered: