Refactoring of internal structure #105

generall · 2024-02-01T23:33:35Z

Why?

Provides a structure for further expansion: new types of models, new modalities, sparse/colbert, e.t.c.
Provide single entry point for all types of text embedding models, which have same interface
Remove obscure logic: exclude param of the model list, multiple sources for the same model (how it was supposed to work?), remove the gcp fallback wrapper
Decompose utilities, file-management and model inference into dedicated modules, improve code navigation
Increased code re-usage

Anush008

Great 👍
The pre-commit found some left-out formatting changes.

Anush008 · 2024-02-02T03:02:31Z

multiple sources for the same model (how it was supposed to work?), remove the gcp fallback wrapper

HF sources were tried first. If they failed or were not present, fallback to GCS.

fastembed/text/e5_onnx_embedding.py

fastembed/text/text_embedding.py

fastembed/text/onnx_models.py

NirantK · 2024-02-02T04:33:58Z

fastembed/embedding.py

+DefaultEmbedding = TextEmbedding
+FlagEmbedding = TextEmbedding


I think this retains enough backward compatibility for most users.

that was our thought as well

NirantK · 2024-02-02T04:34:59Z

fastembed/text/text_embedding.py

+        JinaOnnxEmbedding,
+    ]
+
+    @classmethod


How do we handle the case of models which we use for both dense and sparse? E.g. BGE-M3

we would need to implement both TextEmbedding and SparseEmbedding(tba) classes for those models. I am strictly against changing the function output format depending on the model name

NirantK · 2024-02-02T04:37:12Z

fastembed/text/onnx_models.py

@@ -0,0 +1,133 @@
+supported_flag_models = [


I'd prefer to have a flat models.json or similar config-like file, instead of 3 different lists.

If we want to separate the models into different classes and files, we should move this completely to those corresponding files — that way the places to update in any new model addition are:

New class in a file

Embedding Registry

Tests

That's it

I am ok with moving lists to the corresponding implementation. But having a single list doesn't work. It creates very ugly hacks as with the "exclude" param of the list models method.

generall · 2024-02-02T08:55:18Z

multiple sources for the same model (how it was supposed to work?), remove the gcp fallback wrapper

HF sources were tried first. If they failed or were not present, fallback to GCS.

But what about multiple GCS and multiple HF courses with different names?

generall · 2024-02-02T08:56:26Z

The pre-commit found some left-out formatting changes.

~~how to run those?~~

I did a manual run, but it would be nice to mention a setup for the hook in dev docs

fastembed/text/onnx_embedding.py

NirantK · 2024-02-02T09:47:14Z

Added 2 new models to main @generall — let's have those here as well, and then we're good to merge this from my end!

Co-authored-by: George Panchuk <[email protected]>

NirantK · 2024-02-02T14:47:27Z

fastembed/text/e5_onnx_embedding.py

+        }
+    },
+    {
+        "model": "xenova/paraphrase-multilingual-mpnet-base-v2",


Suggested change

"model": "xenova/paraphrase-multilingual-mpnet-base-v2",

"model": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",

This is the model name. I missed this in the PR from Anush. Apologies for the confusion.

NirantK · 2024-02-02T14:48:15Z

fastembed/text/e5_onnx_embedding.py

+    {
+        "model": "xenova/multilingual-e5-large-quantized",
+        "dim": 1024,
+        "description": "Multilingual model. Recommended for non-English languages",
+        "size_in_GB": 2.24,
+        "sources": {
+            "hf": "xenova/multilingual-e5-large",
+        }
+    },


Suggested change

{

"model": "xenova/multilingual-e5-large-quantized",

"dim": 1024,

"description": "Multilingual model. Recommended for non-English languages",

"size_in_GB": 2.24,

"sources": {

"hf": "xenova/multilingual-e5-large",

}

},

We can delete this completely, since the model is covered by Qdrant now.

generall requested review from NirantK and Anush008 February 1, 2024 23:33

Anush008 approved these changes Feb 2, 2024

View reviewed changes

NirantK reviewed Feb 2, 2024

View reviewed changes

fastembed/text/onnx_embedding.py Outdated Show resolved Hide resolved

NirantK mentioned this pull request Feb 2, 2024

Add pre-commit instructions to README or Dev Docs #106

Closed

generall and others added 5 commits February 2, 2024 15:12

refactoring

4813b18

Co-authored-by: George Panchuk <[email protected]>

review fixes

3e6f69e

ruff

8b800da

rename flag -> onnx

5dbd007

new multilingual models

7883fa3

generall force-pushed the refactoring-off-everything branch from c67ebb9 to 7883fa3 Compare February 2, 2024 14:32

NirantK reviewed Feb 2, 2024

View reviewed changes

rename models

fcdc569

NirantK approved these changes Feb 2, 2024

View reviewed changes

generall merged commit a3bc73c into main Feb 2, 2024
14 checks passed

generall deleted the refactoring-off-everything branch February 2, 2024 15:48

Anush008 mentioned this pull request Feb 2, 2024

Single cache_dir determination #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring of internal structure #105

Refactoring of internal structure #105

generall commented Feb 1, 2024

Anush008 left a comment •

edited

Loading

Anush008 commented Feb 2, 2024

NirantK Feb 2, 2024

generall Feb 2, 2024

NirantK Feb 2, 2024

generall Feb 2, 2024

NirantK Feb 2, 2024

generall Feb 2, 2024

generall commented Feb 2, 2024

generall commented Feb 2, 2024 •

edited

Loading

NirantK commented Feb 2, 2024

NirantK Feb 2, 2024

NirantK Feb 2, 2024

		DefaultEmbedding = TextEmbedding
		FlagEmbedding = TextEmbedding

	"model": "xenova/paraphrase-multilingual-mpnet-base-v2",
	"model": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",

Refactoring of internal structure #105

Refactoring of internal structure #105

Conversation

generall commented Feb 1, 2024

Anush008 left a comment • edited Loading

Choose a reason for hiding this comment

Anush008 commented Feb 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

generall commented Feb 2, 2024

generall commented Feb 2, 2024 • edited Loading

NirantK commented Feb 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Anush008 left a comment •

edited

Loading

generall commented Feb 2, 2024 •

edited

Loading