"Resume" option for tokenizers #23

ClashLuke · 2022-04-30T10:40:07Z

Currently, our tokenisers are long-running tasks that cannot be interrupted. If the process is stopped for even just a minute (for example, because GPU or CPU resources are needed elsewhere), the tokenisation has to be restarted from scratch. Instead of enforcing to run a process that can take multiple weeks in one go, we should implement an option to "resume" the state from an earlier checkpoint. This could be done, by, for example, skipping the first few documents or videos.
This issue tracks the progress of implementing such a scheme.

ClashLuke added the engineering Software-engineering problems that don't require ML-Expertise label Apr 30, 2022

ClashLuke added the downstream Changes code wrapping the core model label May 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Resume" option for tokenizers #23

"Resume" option for tokenizers #23

ClashLuke commented Apr 30, 2022

"Resume" option for tokenizers #23

"Resume" option for tokenizers #23

Comments

ClashLuke commented Apr 30, 2022