"Resume" option for tokenizers #23
Labels
downstream
Changes code wrapping the core model
engineering
Software-engineering problems that don't require ML-Expertise
Currently, our tokenisers are long-running tasks that cannot be interrupted. If the process is stopped for even just a minute (for example, because GPU or CPU resources are needed elsewhere), the tokenisation has to be restarted from scratch. Instead of enforcing to run a process that can take multiple weeks in one go, we should implement an option to "resume" the state from an earlier checkpoint. This could be done, by, for example, skipping the first few documents or videos.
This issue tracks the progress of implementing such a scheme.
The text was updated successfully, but these errors were encountered: