Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Resume" option for tokenizers #23

Open
ClashLuke opened this issue Apr 30, 2022 · 0 comments
Open

"Resume" option for tokenizers #23

ClashLuke opened this issue Apr 30, 2022 · 0 comments
Labels
downstream Changes code wrapping the core model engineering Software-engineering problems that don't require ML-Expertise

Comments

@ClashLuke
Copy link
Member

Currently, our tokenisers are long-running tasks that cannot be interrupted. If the process is stopped for even just a minute (for example, because GPU or CPU resources are needed elsewhere), the tokenisation has to be restarted from scratch. Instead of enforcing to run a process that can take multiple weeks in one go, we should implement an option to "resume" the state from an earlier checkpoint. This could be done, by, for example, skipping the first few documents or videos.
This issue tracks the progress of implementing such a scheme.

@ClashLuke ClashLuke added the engineering Software-engineering problems that don't require ML-Expertise label Apr 30, 2022
@ClashLuke ClashLuke added the downstream Changes code wrapping the core model label May 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
downstream Changes code wrapping the core model engineering Software-engineering problems that don't require ML-Expertise
Projects
None yet
Development

No branches or pull requests

1 participant