Skip to content

v2.1.3 New tokenization workflow, speedups, time signature and PyTorch data loading module

Compare
Choose a tag to compare
@Natooz Natooz released this 17 Aug 11:17
· 175 commits to main since this release

This big update brings a few important changes and improvements.

A new common tokenization workflow for all tokenizers.

We distinguish now three types of tokens:

  1. Global MIDI tokens, which represent attributes and events affecting the music globally, such as the tempo or time signature;
  2. Track tokens, representing values of distinct tracks such as the notes, chords or effects;
  3. Time tokens, which serve to structure and place the previous categories of tokens in time.

All tokenisations now follows the pattern:

  1. Preprocess the MIDI;
  2. Gather global MIDI events (tempo...);
  3. Gather track events (notes, chords);
  4. If "one token stream", concatenate all global and track events and sort them by time of occurrence. Else, concatenate the global events to each sequence of track events;
  5. Deduce the time events for all the sequences of events (only one if "one token stream");
  6. Return the tokens, as a combination of list of strings and list of integers (token ids).

This cleans considerably the code (DRY, less redundant methods), while bringing speedups as the calls to sorting methods has been reduced.

TLDR; other changes

  • New submodule pytorch_data offering PyTorch Dataset objects and a data collator, to be used when training a PyTorch model. Learn more in the documentation of the module;
  • MIDILike, CPWord and Structured now handle natively Program tokens in a multitrack / one_token_stream way;
  • Time signature changes are now handled by TSD, MIDILike and CPWord;
  • The time_signature_range config option is now more flexible / convenient.

Changelog

  • #61 new pytorch_data submodule, with DatasetTok and DatasetJsonIO classes. This module is only loaded if torch is installed in the python environment;
  • #61 tokenize_midi_dataset() method now have a tokenizer_config_file_name argument, allowing to save the tokenizer config with a custom file name;
  • #61 "all-in-one" DataCollator object to be used with PyTorch DataLoaders;
  • #62 Structured and MIDILike now natively handle Program tokens. When setting config.use_programs true, a Program token will be added before each Pitch/NoteOn/NoteOff token to associate its instrument. MIDIs will also be treated as a single stream of tokens in this case, whereas otherwise each track is converted into independent token sequences;
  • #62 miditok.utils.remove_duplicated_notes method can now remove notes with the same pitch and onset time, regardless of their offset time / duration;
  • #62 miditok.utils.merge_same_program_tracks is now called in preprocess_midi when config.use_programs is True;
  • #62 Big refactor of the REMI codebase, that now has all the features of REMIPlus, and code clean and speedups (less calls to sorting). The REMIPlus class is now basically only a wrapped REMI with programs and time signature enabled;
  • #62 TSD and MIDILike now encode and decode time signature changes;
  • #63 @ilya16 The Tempos can now be created with a logarithmic scale, instead of the default linear scale.
  • c53a008 and 5d1c12e track_to_tokens and tokens_to_track methods are now partially removed. They are now protected, for classes that still rely on them, and removed from the others. These methods were made for internal calls and not recommended to use. Instead, the midi_to_tokens method is recommended;
  • #65 @ilya16 changes time_signature_range into a dictionary {denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)};
  • #65 @ilya16 fix in the formula computing the number of ticks per bar.
  • #66 Adds an option to TokenizerConfig to delete the successive tempo / time signature changes carrying the same value during MIDI preprocessing;
  • #66 now using xdist for tests, big speedup on Github actions (ty @ilya16 !);
  • #66 CPWord and Octuple now follow the common tokenization workflow;
  • #66 As a consequence to the previous point, OctupleMono is removed as there was no records of its use. It is now equivalent to Octuple without config.use_programs;
  • #66 CPWord now handling time signature changes;
  • #66 tests for tempo and time signatures changes are now more robust, exceptions were removed and fixed.
  • 5a6378b save_tokens now by default doesn't save programs if config.use_programs is False

Compatibility

  • Calls to track_to_tokens and tokens_to_track methods are not supported anymore. If you used these methods, you may replace them with midi_to_tokens and tokens_to_midi (or just call the tokenizer) while selecting the appropriate token sequences / tracks;
  • time_signature_range now needs to be given as a dictionary;
  • Due to changes in the order of vocabularies of Octuple (as programs are now optional), tokenizers and tokens made with previous versions will not be compatible unless the vocabulary order is swapped, idx 3 moved to 5.