Update Readme for V3 (#122)

* update readme and contribution guide * text update
Natooz · Jan 17, 2024 · 3f2c372 · 3f2c372
1 parent 3cb6b97
commit 3f2c372
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 41 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -6,10 +6,6 @@
 - Proposing new features.
 - Becoming a maintainer.
 
-## We Develop with Github
-
-We use Github to host code, to track issues and feature requests, and accept pull requests.
-
 ## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
 
 Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
@@ -21,11 +17,7 @@ Pull requests are the best way to propose changes to the codebase (we use [Githu
 5. Make sure your code lints.
 6. Issue that pull request!
 
-## Any Contributions You Make will be Under the MIT Software License
-
-In short, when you submit code changes, your submissions are understood to be under the same [MIT License](http://choosealicense.com/licenses/mit/) that covers the project. Feel free to contact the maintainers if that's a concern.
-
-## Report bugs using Github's [issues](https://github.com/briandk/transcriptase-atom/issues)
+## Report bugs using Github's [issues](https://github.com/Natooz/MidiTok/issues)
 
 We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/Natooz/MidiTok/issues/new).
 
@@ -45,18 +37,18 @@ We use GitHub issues to track public bugs. Report a bug by [opening a new issue]
 
 ### Tests
 
-We use `pytest` for testing and `pytest-cov` for measuring coverage. In the test scripts, we use `torch` and `tensorflow` to check functionalities related to token-tensor conversion. You can discard these tests, and thus installing these libraries, if your modifications does not impact them.
+We use `pytest`/`pytest-xdist` for testing and `pytest-cov` for measuring coverage. Running all the tests can take between 10 to 30min depending on your hardware. You don't need to run all of them, but try to run those affected by your changes.
 
 ```bash
-pip install setuptools pytest coverage
-coverage run -m pytest
+pip install pytest-cov "pytest-xdist[psutil]"
+pytest --cov=./ --cov-report=xml -n auto --durations=0 -v tests/
 ```
 
 ### Use a Consistent Coding Style
 
-We use the [ruff](https://github.com/astral-sh/ruff) formatter for Python in this project.
+We use the [ruff](https://github.com/astral-sh/ruff) formatter for Python in this project. Ruff allows to automatically analyze the code and format it according to rules if needed. This is handled by using pre-commit (following section).
 
-### Pre-commit lints
+### Pre-commit Lints
 
 Linting is configured via [pre-commit](https://www.pre-commit.com/). You can set up pre-commit by running:
 

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # MidiTok
 
-Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBD.
+Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBDs.
 
 ![MidiTok Logo](docs/assets/logo.png?raw=true "")
 
@@ -13,7 +13,7 @@ Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBD.
 [![Downloads](https://static.pepy.tech/badge/miditok)](https://pepy.tech/project/MidiTok)
 [![Code style](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
 
-Using Deep Learning with symbolic music ? MidiTok can take care of converting (tokenizing) your MIDI files into tokens, ready to be fed to models such as Transformer, for any generation, transcription or MIR task.
+MidiTok can tokenize MIDI files, i.e. convert them into sequences of tokens ready to be fed to models such as Transformer, for any generation, transcription or MIR task.
 MidiTok features most known [MIDI tokenizations](https://miditok.readthedocs.io/en/latest/tokenizations.html) (e.g. [REMI](https://arxiv.org/abs/2002.00212), [Compound Word](https://arxiv.org/abs/2101.02402)...), and is built around the idea that they all share common parameters and methods. It supports [Byte Pair Encoding (BPE)](https://arxiv.org/abs/2301.11975) and data augmentation.
 
 **Documentation:** [miditok.readthedocs.com](https://miditok.readthedocs.io/en/latest/index.html)
@@ -23,46 +23,48 @@ MidiTok features most known [MIDI tokenizations](https://miditok.readthedocs.io/
 ```shell
 pip install miditok
 ```
-MidiTok uses [MIDIToolkit](https://github.com/YatingMusic/miditoolkit), which itself uses [Mido](https://github.com/mido/mido) to read and write MIDI files, and BPE is backed by [Hugging Face 🤗tokenizers](https://github.com/huggingface/tokenizers) for super-fast encoding.
+MidiTok uses [Symusic](https://github.com/Yikai-Liao/symusic) to read and write MIDI files, and BPE is backed by [Hugging Face 🤗tokenizers](https://github.com/huggingface/tokenizers) for super-fast encoding.
 
 ## Usage example
 
-The most basic and useful methods are summarized here. And [here](colab-notebooks/Full_Example_HuggingFace_GPT2_Transformer.ipynb) is a simple notebook example showing how to use Hugging Face models to generate music, with MidiTok taking care of tokenizing MIDIs.
+Below is a complete yet concise example of how you can use MidiTok. And [here](colab-notebooks/Full_Example_HuggingFace_GPT2_Transformer.ipynb) is a simple notebook example showing how to use Hugging Face models to generate music, with MidiTok taking care of tokenizing MIDIs.
 
 ```python
 from miditok import REMI, TokenizerConfig
-from miditoolkit import MidiFile
+from miditok.pytorch_data import DatasetTok, DataCollator
 from pathlib import Path
+from symusic import Score
 
 # Creating a multitrack tokenizer configuration, read the doc to explore other parameters
 config = TokenizerConfig(num_velocities=16, use_chords=True, use_programs=True)
 tokenizer = REMI(config)
 
 # Loads a midi, converts to tokens, and back to a MIDI
-midi = MidiFile('path/to/your_midi.mid')
+midi = Score("path/to/your_midi.mid")
 tokens = tokenizer(midi)  # calling the tokenizer will automatically detect MIDIs, paths and tokens
 converted_back_midi = tokenizer(tokens)  # PyTorch / Tensorflow / Numpy tensors supported
 
-# Tokenize a whole dataset and save it at Json files
-midi_paths = list(Path("path", "to", "dataset").glob("**/*.mid"))
-data_augmentation_offsets = [2, 1, 1]  # data augmentation on 2 pitch octaves, 1 velocity and 1 duration values
-tokenizer.tokenize_midi_dataset(midi_paths, Path("path", "to", "tokens_noBPE"),
-                                data_augment_offsets=data_augmentation_offsets)
-
-# Constructs the vocabulary with BPE, from the token files
-tokenizer.learn_bpe(
-    vocab_size=10000,
-    tokens_paths=list(Path("path", "to", "tokens_noBPE").glob("**/*.json")),
-    start_from_empty_voc=False,
-)
-
-# Saving our tokenizer, to retrieve it back later with the load_params method
+# Trains the tokenizer with BPE, and save it to load it back later
+midi_paths = list(Path("path", "to", "midis").glob("**/*.mid"))
+tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_paths)
 tokenizer.save_params(Path("path", "to", "save", "tokenizer.json"))
 # And pushing it to the Hugging Face hub (you can download it back with .from_pretrained)
-tokenizer.push_to_hub("username/model-name", private=True, token="your_hugging_face_token")
-
-# Applies BPE to the previous tokens
-tokenizer.apply_bpe_to_dataset(Path('path', 'to', 'tokens_noBPE'), Path('path', 'to', 'tokens_BPE'))
+tokenizer.push_to_hub("username/model-name", private=True, token="your_hf_token")
+
+# Creates a Dataset and a collator to be used with a PyTorch DataLoader to train a model
+dataset = DatasetTok(
+    files_paths=midi_paths,
+    min_seq_len=100,
+    max_seq_len=1024,
+    tokenizer=tokenizer,
+)
+collator = DataCollator(
+    tokenizer["PAD_None"], tokenizer["BOS_None"], tokenizer["EOS_None"]
+)
+from torch.utils.data import DataLoader
+data_loader = DataLoader(dataset=dataset, collate_fn=collator)
+for batch in data_loader:
+    print("Train your model on this batch...")
 ```
 
 ## Tokenizations
@@ -88,9 +90,7 @@ Contributions are gratefully welcomed, feel free to open an issue or send a PR i
 
 * Extend unimplemented additional tokens to all compatible tokenizations;
 * Control Change messages;
-* Option to represent pitch values as pitch intervals, as [it seems to improve performances](https://ismir2022program.ismir.net/lbd_369.html);
-* Speeding up MIDI read / load (using a Rust / C++ io library + Python binding ?);
-* Data augmentation on duration values at the MIDI level.
+* Speeding up the MIDI preprocess + global/track events parsing with Rust or C++ binding.
 
 ## Citation