Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serializing a model built with partial_fit #2196

Open
1 task done
chadlillian opened this issue Oct 22, 2024 · 2 comments
Open
1 task done

serializing a model built with partial_fit #2196

chadlillian opened this issue Oct 22, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@chadlillian
Copy link

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

I have built a model with partial_fit (using the code found in the documentation). I then serialize the model with pickle, and safetensor. Then I load the model.

The loaded pickled model works only if I execute partial_fit a few times (<5) and throws the first error below if I executed partial_fit a lot of times (>60). The loaded safetensor model always throws the second error belo

ERROR 1 ##############################
Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/bertopic.py", line 1218, in approximate_distribution
similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf
[self._outliers:])
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 1657, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 164, in check_pairwise_arrays
X = check_array(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 917, in check_array
array = _ensure_sparse_format(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 593, in _ensure_sparse_format
_assert_all_finite(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains infinity or a value too large for dtype('float64').

ERROR 2 #################################

Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1216, in approximate_distribution
bow_doc = self.vectorizer_model.transform(all_sentences)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1431, in transform
self._check_vocabulary()
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 508, in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided

Reproduction

from bertopic import BERTopic

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
import pandas as pd
import numpy as np
from sklearn.decomposition import IncrementalPCA
from sklearn.cluster import MiniBatchKMeans
from bertopic.vectorizers import OnlineCountVectorizer

d = pd.HDFStore('wiki_edgelist_s.hdf')
hdfkeys = d.keys()
d.close()

umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model,embedding_model=embedding_model)

frac = 0.01
nh = -1
topics = []
for i,hk in enumerate(hdfkeys[:nh]):
df = pd.read_hdf('wiki_pages_s.hdf',key=hk)

dfi = df.sample(frac=frac,weights=df['num_words'])
docs = dfi['text'].tolist()
topic_model.partial_fit(docs)
topics.extend(topic_model.topics_)

size = 10
topic_model.topics_ = topics
topic_model.save('model_p_%i.pkl'%size, serialization="pickle")
topic_model.save('model_p_%i'%size, serialization="safetensors",save_embedding_model=embedding_model,save_ctfidf=True)

######################
Loading Model:
from bertopic import BERTopic
import pandas as pd

d = pd.HDFStore('wiki_edgelist_s.hdf')
hdfkeys = d.keys()
d.close()

df = pd.read_hdf('wiki_pages_s.hdf',key=hdfkeys[0])
docs = df['text'].iloc[:10]

#lm = BERTopic.load("model_p_10")
lm = BERTopic.load("model_p_10.pkl")
lm.approximate_distribution(docs)

BERTopic Version

0.16.1

@chadlillian chadlillian added the bug Something isn't working label Oct 22, 2024
@chadlillian
Copy link
Author

I added this to the end of my code before serialization, no errors, but I haven't validated the results yet.

z = CountVectorizer()
z.vocabulary_ = topic_model.vectorizer_model.vocabulary_
topic_model.vectorizer_model = z

@MaartenGr
Copy link
Owner

Thank you for sharing this. The model saved with safetensors will indeed not work since it does not save the underlying dimensionality reduction and clustering algorithms.

I am, however, surprised that this does not work with pickle since it should save the entire state. It seems you are using an older version of BERTopic, could you try using a newer version?

Ah, it might just be that the decay parameter is set too high and that after too many iterations, entire rows get 0 values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants