serializing a model built with partial_fit #2196

chadlillian · 2024-10-22T16:33:26Z

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

I have built a model with partial_fit (using the code found in the documentation). I then serialize the model with pickle, and safetensor. Then I load the model.

The loaded pickled model works only if I execute partial_fit a few times (<5) and throws the first error below if I executed partial_fit a lot of times (>60). The loaded safetensor model always throws the second error belo

ERROR 1 ##############################
Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/bertopic.py", line 1218, in approximate_distribution
similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf[self._outliers:])
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 1657, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 164, in check_pairwise_arrays
X = check_array(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 917, in check_array
array = _ensure_sparse_format(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 593, in _ensure_sparse_format
_assert_all_finite(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains infinity or a value too large for dtype('float64').

ERROR 2 #################################

Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1216, in approximate_distribution
bow_doc = self.vectorizer_model.transform(all_sentences)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1431, in transform
self._check_vocabulary()
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 508, in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided

Reproduction

from bertopic import BERTopic

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
import pandas as pd
import numpy as np
from sklearn.decomposition import IncrementalPCA
from sklearn.cluster import MiniBatchKMeans
from bertopic.vectorizers import OnlineCountVectorizer

d = pd.HDFStore('wiki_edgelist_s.hdf')
hdfkeys = d.keys()
d.close()

umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model,embedding_model=embedding_model)

frac = 0.01
nh = -1
topics = []
for i,hk in enumerate(hdfkeys[:nh]):
df = pd.read_hdf('wiki_pages_s.hdf',key=hk)

dfi = df.sample(frac=frac,weights=df['num_words'])
docs = dfi['text'].tolist()
topic_model.partial_fit(docs)
topics.extend(topic_model.topics_)

size = 10
topic_model.topics_ = topics
topic_model.save('model_p_%i.pkl'%size, serialization="pickle")
topic_model.save('model_p_%i'%size, serialization="safetensors",save_embedding_model=embedding_model,save_ctfidf=True)

######################
Loading Model:
from bertopic import BERTopic
import pandas as pd

d = pd.HDFStore('wiki_edgelist_s.hdf')
hdfkeys = d.keys()
d.close()

df = pd.read_hdf('wiki_pages_s.hdf',key=hdfkeys[0])
docs = df['text'].iloc[:10]

#lm = BERTopic.load("model_p_10")
lm = BERTopic.load("model_p_10.pkl")
lm.approximate_distribution(docs)

BERTopic Version

0.16.1

The text was updated successfully, but these errors were encountered:

chadlillian · 2024-10-25T15:26:42Z

I added this to the end of my code before serialization, no errors, but I haven't validated the results yet.

z = CountVectorizer()
z.vocabulary_ = topic_model.vectorizer_model.vocabulary_
topic_model.vectorizer_model = z

MaartenGr · 2024-10-29T14:42:34Z

Thank you for sharing this. The model saved with safetensors will indeed not work since it does not save the underlying dimensionality reduction and clustering algorithms.

I am, however, surprised that this does not work with pickle since it should save the entire state. It seems you are using an older version of BERTopic, could you try using a newer version?

Ah, it might just be that the decay parameter is set too high and that after too many iterations, entire rows get 0 values.

chadlillian added the bug Something isn't working label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serializing a model built with partial_fit #2196

serializing a model built with partial_fit #2196

chadlillian commented Oct 22, 2024

chadlillian commented Oct 25, 2024

MaartenGr commented Oct 29, 2024

serializing a model built with partial_fit #2196

serializing a model built with partial_fit #2196

Comments

chadlillian commented Oct 22, 2024

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

BERTopic Version

chadlillian commented Oct 25, 2024

MaartenGr commented Oct 29, 2024