Skip to content

Commit

Permalink
Update Readme for V3 (#122)
Browse files Browse the repository at this point in the history
* update readme and contribution guide

* text update
  • Loading branch information
Natooz authored Jan 17, 2024
1 parent 3cb6b97 commit 3f2c372
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 41 deletions.
20 changes: 6 additions & 14 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,6 @@
- Proposing new features.
- Becoming a maintainer.

## We Develop with Github

We use Github to host code, to track issues and feature requests, and accept pull requests.

## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests

Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
Expand All @@ -21,11 +17,7 @@ Pull requests are the best way to propose changes to the codebase (we use [Githu
5. Make sure your code lints.
6. Issue that pull request!

## Any Contributions You Make will be Under the MIT Software License

In short, when you submit code changes, your submissions are understood to be under the same [MIT License](http://choosealicense.com/licenses/mit/) that covers the project. Feel free to contact the maintainers if that's a concern.

## Report bugs using Github's [issues](https://github.com/briandk/transcriptase-atom/issues)
## Report bugs using Github's [issues](https://github.com/Natooz/MidiTok/issues)

We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/Natooz/MidiTok/issues/new).

Expand All @@ -45,18 +37,18 @@ We use GitHub issues to track public bugs. Report a bug by [opening a new issue]

### Tests

We use `pytest` for testing and `pytest-cov` for measuring coverage. In the test scripts, we use `torch` and `tensorflow` to check functionalities related to token-tensor conversion. You can discard these tests, and thus installing these libraries, if your modifications does not impact them.
We use `pytest`/`pytest-xdist` for testing and `pytest-cov` for measuring coverage. Running all the tests can take between 10 to 30min depending on your hardware. You don't need to run all of them, but try to run those affected by your changes.

```bash
pip install setuptools pytest coverage
coverage run -m pytest
pip install pytest-cov "pytest-xdist[psutil]"
pytest --cov=./ --cov-report=xml -n auto --durations=0 -v tests/
```

### Use a Consistent Coding Style

We use the [ruff](https://github.com/astral-sh/ruff) formatter for Python in this project.
We use the [ruff](https://github.com/astral-sh/ruff) formatter for Python in this project. Ruff allows to automatically analyze the code and format it according to rules if needed. This is handled by using pre-commit (following section).

### Pre-commit lints
### Pre-commit Lints

Linting is configured via [pre-commit](https://www.pre-commit.com/). You can set up pre-commit by running:

Expand Down
54 changes: 27 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MidiTok

Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBD.
Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBDs.

![MidiTok Logo](docs/assets/logo.png?raw=true "")

Expand All @@ -13,7 +13,7 @@ Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBD.
[![Downloads](https://static.pepy.tech/badge/miditok)](https://pepy.tech/project/MidiTok)
[![Code style](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

Using Deep Learning with symbolic music ? MidiTok can take care of converting (tokenizing) your MIDI files into tokens, ready to be fed to models such as Transformer, for any generation, transcription or MIR task.
MidiTok can tokenize MIDI files, i.e. convert them into sequences of tokens ready to be fed to models such as Transformer, for any generation, transcription or MIR task.
MidiTok features most known [MIDI tokenizations](https://miditok.readthedocs.io/en/latest/tokenizations.html) (e.g. [REMI](https://arxiv.org/abs/2002.00212), [Compound Word](https://arxiv.org/abs/2101.02402)...), and is built around the idea that they all share common parameters and methods. It supports [Byte Pair Encoding (BPE)](https://arxiv.org/abs/2301.11975) and data augmentation.

**Documentation:** [miditok.readthedocs.com](https://miditok.readthedocs.io/en/latest/index.html)
Expand All @@ -23,46 +23,48 @@ MidiTok features most known [MIDI tokenizations](https://miditok.readthedocs.io/
```shell
pip install miditok
```
MidiTok uses [MIDIToolkit](https://github.com/YatingMusic/miditoolkit), which itself uses [Mido](https://github.com/mido/mido) to read and write MIDI files, and BPE is backed by [Hugging Face 🤗tokenizers](https://github.com/huggingface/tokenizers) for super-fast encoding.
MidiTok uses [Symusic](https://github.com/Yikai-Liao/symusic) to read and write MIDI files, and BPE is backed by [Hugging Face 🤗tokenizers](https://github.com/huggingface/tokenizers) for super-fast encoding.

## Usage example

The most basic and useful methods are summarized here. And [here](colab-notebooks/Full_Example_HuggingFace_GPT2_Transformer.ipynb) is a simple notebook example showing how to use Hugging Face models to generate music, with MidiTok taking care of tokenizing MIDIs.
Below is a complete yet concise example of how you can use MidiTok. And [here](colab-notebooks/Full_Example_HuggingFace_GPT2_Transformer.ipynb) is a simple notebook example showing how to use Hugging Face models to generate music, with MidiTok taking care of tokenizing MIDIs.

```python
from miditok import REMI, TokenizerConfig
from miditoolkit import MidiFile
from miditok.pytorch_data import DatasetTok, DataCollator
from pathlib import Path
from symusic import Score

# Creating a multitrack tokenizer configuration, read the doc to explore other parameters
config = TokenizerConfig(num_velocities=16, use_chords=True, use_programs=True)
tokenizer = REMI(config)

# Loads a midi, converts to tokens, and back to a MIDI
midi = MidiFile('path/to/your_midi.mid')
midi = Score("path/to/your_midi.mid")
tokens = tokenizer(midi) # calling the tokenizer will automatically detect MIDIs, paths and tokens
converted_back_midi = tokenizer(tokens) # PyTorch / Tensorflow / Numpy tensors supported

# Tokenize a whole dataset and save it at Json files
midi_paths = list(Path("path", "to", "dataset").glob("**/*.mid"))
data_augmentation_offsets = [2, 1, 1] # data augmentation on 2 pitch octaves, 1 velocity and 1 duration values
tokenizer.tokenize_midi_dataset(midi_paths, Path("path", "to", "tokens_noBPE"),
data_augment_offsets=data_augmentation_offsets)

# Constructs the vocabulary with BPE, from the token files
tokenizer.learn_bpe(
vocab_size=10000,
tokens_paths=list(Path("path", "to", "tokens_noBPE").glob("**/*.json")),
start_from_empty_voc=False,
)

# Saving our tokenizer, to retrieve it back later with the load_params method
# Trains the tokenizer with BPE, and save it to load it back later
midi_paths = list(Path("path", "to", "midis").glob("**/*.mid"))
tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_paths)
tokenizer.save_params(Path("path", "to", "save", "tokenizer.json"))
# And pushing it to the Hugging Face hub (you can download it back with .from_pretrained)
tokenizer.push_to_hub("username/model-name", private=True, token="your_hugging_face_token")

# Applies BPE to the previous tokens
tokenizer.apply_bpe_to_dataset(Path('path', 'to', 'tokens_noBPE'), Path('path', 'to', 'tokens_BPE'))
tokenizer.push_to_hub("username/model-name", private=True, token="your_hf_token")

# Creates a Dataset and a collator to be used with a PyTorch DataLoader to train a model
dataset = DatasetTok(
files_paths=midi_paths,
min_seq_len=100,
max_seq_len=1024,
tokenizer=tokenizer,
)
collator = DataCollator(
tokenizer["PAD_None"], tokenizer["BOS_None"], tokenizer["EOS_None"]
)
from torch.utils.data import DataLoader
data_loader = DataLoader(dataset=dataset, collate_fn=collator)
for batch in data_loader:
print("Train your model on this batch...")
```

## Tokenizations
Expand All @@ -88,9 +90,7 @@ Contributions are gratefully welcomed, feel free to open an issue or send a PR i

* Extend unimplemented additional tokens to all compatible tokenizations;
* Control Change messages;
* Option to represent pitch values as pitch intervals, as [it seems to improve performances](https://ismir2022program.ismir.net/lbd_369.html);
* Speeding up MIDI read / load (using a Rust / C++ io library + Python binding ?);
* Data augmentation on duration values at the MIDI level.
* Speeding up the MIDI preprocess + global/track events parsing with Rust or C++ binding.

## Citation

Expand Down

0 comments on commit 3f2c372

Please sign in to comment.