You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In building a Model2Vec model, I've been exploring different parameter configurations.
With that, I've also looked at the post training regularization. I explored a similar problem space years back (see this article).
Back at that time, I did something similar, except that process weighted fastText embeddings. I found that BM25 weighting worked pretty well.
Not sure if you've explored this but I did a quick prototype with a model I'm training and found a performance gain - the pearson correlation coefficient (PCC) increased from 90.37 to 91.99.
The code I used is below if you'd like to try. This can be called instead of weight_model.
importnumpyasnpfrommodel2vecimportStaticModelfrommore_itertoolsimportbatchedfromsklearn.decompositionimportPCAfromtokenlearn.trainimporttrain_modelfromtxtai.scoringimportScoringFactoryfromtqdmimporttqdmdeftokenize(tokenizer, texts):
fortintqdm(batched(texts, 1024)):
encodings=tokenizer.encode_batch_fast(t, add_special_tokens=False)
foreinencodings:
yield (None, e.ids, None)
defweight(model, texts, pca, method):
tokenizer=model.tokenizer# Build scoring indexscoring=ScoringFactory.create({"method": method, "terms": True})
scoring.index(tokenize(tokenizer, texts))
# Calculate weightsscores= {}
fortokeninscoring.idf:
_, weights=scoring.terms.weights(token)
scores[token] =np.mean(weights)
# Get weights arrayf=np.zeros(tokenizer.get_vocab_size())
foruid, scoreinscores.items():
f[uid] +=score# Get embeddingsw=model.embeddingw=np.nan_to_num(w)
# Apply PCAp=PCA(n_components=pca)
w=p.fit_transform(w)
# Apply weightsw*=f[:, None]
# Save embeddings to model and normalizemodel.embedding=wmodel.normalize=Truereturnmodel# Train the modelmodel=train_model(name, texts, vectors)
# Weight using BM25weight(model, texts, 256, "bm25")
The code above uses BM25 scoring from txtai but there are other Python libraries available as well from BM25 scoring or you could roll your own.
The text was updated successfully, but these errors were encountered:
In building a Model2Vec model, I've been exploring different parameter configurations.
With that, I've also looked at the post training regularization. I explored a similar problem space years back (see this article).
Back at that time, I did something similar, except that process weighted fastText embeddings. I found that BM25 weighting worked pretty well.
Not sure if you've explored this but I did a quick prototype with a model I'm training and found a performance gain -
the pearson correlation coefficient (PCC) increased from 90.37 to 91.99
.The code I used is below if you'd like to try. This can be called instead of
weight_model
.The code above uses BM25 scoring from txtai but there are other Python libraries available as well from BM25 scoring or you could roll your own.
The text was updated successfully, but these errors were encountered: