Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25 Weighting #14

Open
davidmezzetti opened this issue Dec 27, 2024 · 0 comments
Open

BM25 Weighting #14

davidmezzetti opened this issue Dec 27, 2024 · 0 comments

Comments

@davidmezzetti
Copy link

In building a Model2Vec model, I've been exploring different parameter configurations.

With that, I've also looked at the post training regularization. I explored a similar problem space years back (see this article).

Back at that time, I did something similar, except that process weighted fastText embeddings. I found that BM25 weighting worked pretty well.

Not sure if you've explored this but I did a quick prototype with a model I'm training and found a performance gain - the pearson correlation coefficient (PCC) increased from 90.37 to 91.99.

The code I used is below if you'd like to try. This can be called instead of weight_model.

import numpy as np

from model2vec import StaticModel
from more_itertools import batched
from sklearn.decomposition import PCA
from tokenlearn.train import train_model
from txtai.scoring import ScoringFactory
from tqdm import tqdm

def tokenize(tokenizer, texts):
    for t in tqdm(batched(texts, 1024)):
        encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False)
        for e in encodings:
            yield (None, e.ids, None)

def weight(model, texts, pca, method):
    tokenizer = model.tokenizer

    # Build scoring index
    scoring = ScoringFactory.create({"method": method, "terms": True})
    scoring.index(tokenize(tokenizer, texts))

    # Calculate weights
    scores = {}
    for token in scoring.idf:
        _, weights = scoring.terms.weights(token)
        scores[token] = np.mean(weights)

    # Get weights array
    f = np.zeros(tokenizer.get_vocab_size())
    for uid, score in scores.items():
        f[uid] += score

    # Get embeddings
    w = model.embedding
    w = np.nan_to_num(w)

    # Apply PCA
    p = PCA(n_components=pca)
    w = p.fit_transform(w)

    # Apply weights
    w *= f[:, None]

    # Save embeddings to model and normalize
    model.embedding = w
    model.normalize = True

    return model

# Train the model
model = train_model(name, texts, vectors)

# Weight using BM25
weight(model, texts, 256, "bm25")

The code above uses BM25 scoring from txtai but there are other Python libraries available as well from BM25 scoring or you could roll your own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant