Skip to content

Commit

Permalink
Potter/mixedbread embedder (#3513)
Browse files Browse the repository at this point in the history
Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai
embedder!
  • Loading branch information
potter-potter authored Aug 27, 2024
1 parent affd997 commit ddba928
Show file tree
Hide file tree
Showing 14 changed files with 13,948 additions and 0 deletions.
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,7 @@ jobs:
PINECONE_API_KEY: ${{secrets.PINECONE_API_KEY}}
ASTRA_DB_APPLICATION_TOKEN: ${{secrets.ASTRA_DB_TOKEN}}
ASTRA_DB_API_ENDPOINT: ${{secrets.ASTRA_DB_ENDPOINT}}
MXBAI_API_KEY: ${{secrets.MXBAI_API_KEY}}
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
CI: "true"
run: |
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

### Features

* **Add MixedbreadAI embedder** Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

### Fixes

* **Minify text_as_html from DOCX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ include requirements/ingest/dropbox.in
include requirements/ingest/elasticsearch.in
include requirements/ingest/embed-aws-bedrock.in
include requirements/ingest/embed-huggingface.in
include requirements/ingest/embed-mixedbreadai.in
include requirements/ingest/embed-openai.in
include requirements/ingest/gcs.in
include requirements/ingest/github.in
Expand Down
3 changes: 3 additions & 0 deletions requirements/ingest/embed-mixedbreadai.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
-c ../deps/constraints.txt
-c ../base.txt
mixedbread-ai
57 changes: 57 additions & 0 deletions requirements/ingest/embed-mixedbreadai.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#
# This file is autogenerated by pip-compile with Python 3.9
# by the following command:
#
# pip-compile ./ingest/embed-mixedbreadai.in
#
annotated-types==0.7.0
# via pydantic
anyio==4.4.0
# via
# -c ./ingest/../base.txt
# httpx
certifi==2024.7.4
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
# httpcore
# httpx
exceptiongroup==1.2.2
# via
# -c ./ingest/../base.txt
# anyio
h11==0.14.0
# via
# -c ./ingest/../base.txt
# httpcore
httpcore==1.0.5
# via
# -c ./ingest/../base.txt
# httpx
httpx==0.27.0
# via
# -c ./ingest/../base.txt
# mixedbread-ai
idna==3.8
# via
# -c ./ingest/../base.txt
# anyio
# httpx
mixedbread-ai==2.2.6
# via -r ./ingest/embed-mixedbreadai.in
pydantic==2.8.2
# via mixedbread-ai
pydantic-core==2.20.1
# via pydantic
sniffio==1.3.1
# via
# -c ./ingest/../base.txt
# anyio
# httpx
typing-extensions==4.12.2
# via
# -c ./ingest/../base.txt
# anyio
# mixedbread-ai
# pydantic
# pydantic-core
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ def load_requirements(file_list: Optional[Union[str, List[str]]] = None) -> List
"local-inference": all_doc_reqs,
"paddleocr": load_requirements("requirements/extra-paddleocr.in"),
"embed-huggingface": load_requirements("requirements/ingest/embed-huggingface.in"),
"embed-mixedbreadai": load_requirements("requirements/ingest/embed-mixedbreadai.in"),
"embed-octoai": load_requirements("requirements/ingest/embed-octoai.in"),
"embed-vertexai": load_requirements("requirements/ingest/embed-vertexai.in"),
"embed-voyageai": load_requirements("requirements/ingest/embed-voyageai.in"),
Expand Down
41 changes: 41 additions & 0 deletions test_unstructured/embed/test_mixedbreadai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from unstructured.documents.elements import Text
from unstructured.embed.mixedbreadai import (
MixedbreadAIEmbeddingConfig,
MixedbreadAIEmbeddingEncoder,
)


def test_embed_documents_does_not_break_element_to_dict(mocker):
mock_client = mocker.MagicMock()

def mock_embeddings(
model,
normalized,
encoding_format,
truncation_strategy,
request_options,
input,
):
mock_response = mocker.MagicMock()
mock_response.data = [mocker.MagicMock(embedding=[i, i + 1]) for i in range(len(input))]
return mock_response

mock_client.embeddings.side_effect = mock_embeddings

# Mock create_client to return our mock_client
mocker.patch.object(MixedbreadAIEmbeddingEncoder, "create_client", return_value=mock_client)

encoder = MixedbreadAIEmbeddingEncoder(
config=MixedbreadAIEmbeddingConfig(
api_key="api_key", model_name="mixedbread-ai/mxbai-embed-large-v1"
)
)

elements = encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)
assert len(elements) == 2
assert elements[0].to_dict()["text"] == "This is sentence 1"
assert elements[1].to_dict()["text"] == "This is sentence 2"
assert elements[0].embeddings is not None
assert elements[1].embeddings is not None
Loading

0 comments on commit ddba928

Please sign in to comment.