You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have built a model with partial_fit (using the code found in the documentation). I then serialize the model with pickle, and safetensor. Then I load the model.
The loaded pickled model works only if I execute partial_fit a few times (<5) and throws the first error below if I executed partial_fit a lot of times (>60). The loaded safetensor model always throws the second error belo
ERROR 1 ##############################
Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/bertopic.py", line 1218, in approximate_distribution
similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf[self._outliers:])
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 1657, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 164, in check_pairwise_arrays
X = check_array(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 917, in check_array
array = _ensure_sparse_format(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 593, in _ensure_sparse_format
_assert_all_finite(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains infinity or a value too large for dtype('float64').
ERROR 2 #################################
Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1216, in approximate_distribution
bow_doc = self.vectorizer_model.transform(all_sentences)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1431, in transform
self._check_vocabulary()
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 508, in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided
Reproduction
frombertopicimportBERTopic
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
import pandas as pd
import numpy as np
from sklearn.decomposition import IncrementalPCA
from sklearn.cluster import MiniBatchKMeans
from bertopic.vectorizers import OnlineCountVectorizer
d = pd.HDFStore('wiki_edgelist_s.hdf')
hdfkeys = d.keys()
d.close()
Thank you for sharing this. The model saved with safetensors will indeed not work since it does not save the underlying dimensionality reduction and clustering algorithms.
I am, however, surprised that this does not work with pickle since it should save the entire state. It seems you are using an older version of BERTopic, could you try using a newer version?
Ah, it might just be that the decay parameter is set too high and that after too many iterations, entire rows get 0 values.
Have you searched existing issues? 🔎
Desribe the bug
I have built a model with partial_fit (using the code found in the documentation). I then serialize the model with pickle, and safetensor. Then I load the model.
The loaded pickled model works only if I execute partial_fit a few times (<5) and throws the first error below if I executed partial_fit a lot of times (>60). The loaded safetensor model always throws the second error belo
ERROR 1 ##############################
Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/bertopic.py", line 1218, in approximate_distribution
similarity = cosine_similarity(c_tf_idf_doc, self.c_tf_idf[self._outliers:])
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 1657, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/metrics/pairwise.py", line 164, in check_pairwise_arrays
X = check_array(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 917, in check_array
array = _ensure_sparse_format(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 593, in _ensure_sparse_format
_assert_all_finite(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains infinity or a value too large for dtype('float64').
ERROR 2 #################################
Traceback (most recent call last):
File "", line 1, in
File "/home/chad/anaconda/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1216, in approximate_distribution
bow_doc = self.vectorizer_model.transform(all_sentences)
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1431, in transform
self._check_vocabulary()
File "/home/chad/anaconda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 508, in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided
Reproduction
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
import pandas as pd
import numpy as np
from sklearn.decomposition import IncrementalPCA
from sklearn.cluster import MiniBatchKMeans
from bertopic.vectorizers import OnlineCountVectorizer
d = pd.HDFStore('wiki_edgelist_s.hdf')
hdfkeys = d.keys()
d.close()
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model,embedding_model=embedding_model)
frac = 0.01
nh = -1
topics = []
for i,hk in enumerate(hdfkeys[:nh]):
df = pd.read_hdf('wiki_pages_s.hdf',key=hk)
size = 10
topic_model.topics_ = topics
topic_model.save('model_p_%i.pkl'%size, serialization="pickle")
topic_model.save('model_p_%i'%size, serialization="safetensors",save_embedding_model=embedding_model,save_ctfidf=True)
######################
Loading Model:
from bertopic import BERTopic
import pandas as pd
d = pd.HDFStore('wiki_edgelist_s.hdf')
hdfkeys = d.keys()
d.close()
df = pd.read_hdf('wiki_pages_s.hdf',key=hdfkeys[0])
docs = df['text'].iloc[:10]
#lm = BERTopic.load("model_p_10")
lm = BERTopic.load("model_p_10.pkl")
lm.approximate_distribution(docs)
BERTopic Version
0.16.1
The text was updated successfully, but these errors were encountered: