-
-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Push to Hub functionnality to Model and Pipeline #1699
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General functionality looks good! Left some design thoughts on how we can improve ease-of-use (e.g. saving the embedding/segmentation models when we push the pipeline)
pyannote/audio/core/pipeline.py
Outdated
segmentation_model = self.segmentation_model | ||
|
||
# Modify the config with new segmentation and embedding models: | ||
config["pipeline"]["params"]["embedding"] = embedding_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline: an elegant solution would be to save both the embedding and segmentation models to subfloders in the repo (embedding
and segmentation
respectively), and then load the weights from these subfolders when we call .from_pretrained
Your repo structure for kamilakesbi/spd_pipeline_test
could look something like the following:
├── config.yaml <- Top-level pipeline config
├── embedding <- Subfolder for the embedding model
| ├── config.yaml
| ├── pytorch_model.bin
├── segmentation <- Subfolder for the segmentation model
| ├── config.yaml
| ├── pytorch_model.bin
And your top-level yaml file could have an extra entry:
embedding: kamilakesbi/spd_pipeline_test
embedding_subfolder: embedding
...
segmentation: kamilakesbi/spd_pipeline_test
segmentation_subfolder: segmentation
Note that this would require updating .from_pretrained
to handle this extra subfolder logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I handled this differently:
- If we want to save the checkpoints, we add a
save_checkpoints=True
parameter topipeline.push_to_hub
. We would then get a repo structure like the one you proposed @sanchit-gandhi, but the top yaml file would look like this:
checkpoints: True
params:
- embedding: 'kamilakesbi/speaker-embedding-test'
- embedding: 'kamilakesbi/speaker-segmentation-test'
- If we don't want to store checkpoints on the hub, then we need pointers to the segmentation and embedding models on the hub. In this case the config file would look like this:
checkpoints: False
params:
+ embedding: 'kamilakesbi/speaker-embedding-test'
+ embedding: 'kamilakesbi/speaker-segmentation-test'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, saving all sub-models means the model repo on the Hub is fully portable -> users can clone the repository and have all sub-models available to them locally
This is the design that was adopted for diffusers pipelines and has worked very well
Thus, we'll assume that any new checkpoints being pushed will follow this new repo structure, with an exception for current pipelines on the Hub that leverage components from multiple repositories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few suggestions regarding the save structure. Can we add relevant tests as well? (both for the model and the pipeline)
pyannote/audio/core/model.py
Outdated
repo_type="model", | ||
) | ||
|
||
model_type = str(type(self)).split("'")[1].split(".")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there not a model attribute or config param we can use to get this in a more robust way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I'm aware of... but it would be great!
pyannote/audio/core/pipeline.py
Outdated
segmentation_model = self.segmentation_model | ||
|
||
# Modify the config with new segmentation and embedding models: | ||
config["pipeline"]["params"]["embedding"] = embedding_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, saving all sub-models means the model repo on the Hub is fully portable -> users can clone the repository and have all sub-models available to them locally
This is the design that was adopted for diffusers pipelines and has worked very well
Thus, we'll assume that any new checkpoints being pushed will follow this new repo structure, with an exception for current pipelines on the Hub that leverage components from multiple repositories
Hi @hbredin, It would be nice if you had time to do a review on this PR so that we can iterate on it :) Thank you! |
Apologies, I will eventually have a look it but I really don't have the bandwidth right now. |
Hey @hbredin! No rush on reviewing this PR, whenever you get the chance we'd love to hear your feedback on the proposed changes! Otherwise, is there another maintainer who could give a quick review in the meantime? |
Hey @sanchit-gandhi. I understand the frustration but I am actually the sole maintainer and also have many other hats. I am doing my best but have other priorities right now (like the upcoming 3.3.0 release with speech separation support). |
# If hub repo contains subfolders, load models and pipeline: | ||
embedding = Model.from_pretrained(model_id, subfolder="embedding") | ||
segmentation = Model.from_pretrained(model_id, subfolder="segmentation") | ||
pipeline = Klass(**params) | ||
pipeline.embedding = embedding | ||
pipeline.segmentation_model = segmentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem way too specific to pyannote/speaker-diarization-3.1
.
I'd like to find a better (= more generic) way
We could do something like preprend subfolders by @model
(or anything that makes sense) to indicate to Pipeline.from_pretrained
that a model should be loaded from corresponding subfolders.
pipeline:
name: pyannote.audio.pipelines.SpeakerDiarization
params:
clustering: AgglomerativeClustering
embedding: @model/embedding
segmentation: @model/segmentation
Similarly, we could use @pipeline/
to load sub-pipelines, and later @whatever
if we ever want to add new pyannote stuff (I already have one in mind that I cannot really talk about right now).
Hi @hbredin,
I've started working on adding a
push_to_hub
method to both Model and Pipeline classes. It will hopefully help users push their custom pyannote speaker-segmentation and speaker-embedding models to the Hugging Face Hub, and use them within custom speaker-diarization pipelines.In this PR, I've added:
1. A
push_to_hub
method to the base Model class:The method is compatible with both the pyannote
PyanNet
segmentation model andWeSpeakerResNet34
speaker embedding model. It will:Save the state dict in a
pytorch_model.bin
file.Write a
config.yaml
file similar to the one of pyannote/segmentation-3.0 or pyannote/wespeaker-voxceleb-resnet34-LM.Write a minimal Readme file (which we can work on together), and add appropriate tags and licence.
I've tested the method using the following scripts:
Here is the result :)
Note: I've used the diarizers library here to first load a fine-tuned speaker segmentation model from the Hugging Face Hub, convert it to a pyannote format, and push it to the Hub.
Here is the result :)
2. A
push_to_hub
method to the base Pipeline class:The method can be used like this:
We can also push the model's checkpoints using:
Note that this is still a work in progress :) I can make changes to the code and adapt it to pyannote's needs!
Hope that this PR will be useful to pyannote.