new: Added jina embedding v3 #428

hh-space-invader · 2024-12-19T10:58:19Z

To compute canonical vectors:

import onnxruntime
import numpy as np
from transformers import AutoTokenizer, PretrainedConfig


def onnx_embed(session: onnxruntime.InferenceSession, task_id: int):
	inputs = {
		'input_ids': input_text['input_ids'],
		'attention_mask': input_text['attention_mask'],
		'task_id': np.array(task_id, dtype=np.int64)
	}
	outputs = session.run(None, inputs)[0]
	embeddings = mean_pooling(outputs, input_text["attention_mask"])
	embeddings = embeddings / np.linalg.norm(embeddings, ord=2, axis=1, keepdims=True)
	return embeddings

def mean_pooling(model_output: np.ndarray, attention_mask: np.ndarray):
    token_embeddings = model_output
    input_mask_expanded = np.expand_dims(attention_mask, axis=-1)
    input_mask_expanded = np.broadcast_to(input_mask_expanded, token_embeddings.shape)
    sum_embeddings = np.sum(token_embeddings * input_mask_expanded, axis=1)
    sum_mask = np.clip(np.sum(input_mask_expanded, axis=1), a_min=1e-9, a_max=None)
    return sum_embeddings / sum_mask

docs = [
    "Hello World",
    "Follow the white rabbit."
]
model_name = 'jinaai/jina-embeddings-v3'
tokenizer = AutoTokenizer.from_pretrained(model_name)

config = PretrainedConfig.from_pretrained(model_name)

input_text = tokenizer(docs, return_tensors='np', padding=True, truncation=True)
input_ids = input_text['input_ids']
attention_mask = input_text['attention_mask']

model_path = 'models/models--jinaai--jina-embeddings-v3/snapshots/62a81741b58448ed8f691764cec7aa5d3c045e4c/onnx/model.onnx'
session = onnxruntime.InferenceSession(model_path)

for task_id in range(5):
	embeddings = onnx_embed(session, task_id)
	print([[round(value, 4) for value in e.tolist()[:5]] for e in embeddings])

All Submissions:

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass the existing tests?
Have you added tests for your feature?
Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

Have you added an explanation of why it's important to include this model?
Have you added tests for the new model? Were canonical values for tests computed via the original model?
Have you added the code snippet for how canonical values were computed?
Have you successfully ran tests with your changes locally?

joein

tests
license
users can't change task id in this setting, since we are not propagating kwargs to onnx_embed, they are always hard-coded

fastembed/text/multitask_embedding.py

tests/test_text_multitask_embeddings.py

joein

lacks tests of propagation task ids in parallel processing

fastembed/text/multitask_embedding.py

tests/test_text_multitask_embeddings.py

fastembed/text/multitask_embedding.py

joein

Could you please upload scripts for canonical vectors computation to the dedicated repo?

fastembed/text/multitask_embedding.py

Co-authored-by: George <[email protected]>

hh-space-invader requested review from I8dNLo and joein December 19, 2024 10:58

joein requested changes Dec 20, 2024

View reviewed changes