Skip to content

Commit

Permalink
fix: update HuggingFaceEmbeddingEncoder to use `langchain_huggingface…
Browse files Browse the repository at this point in the history
…` instead of `langchain-community` (#3436)

Similar to #3433.

### Summary
This PR aims to update `HuggingFaceEmbeddingEncoder` to use
`HuggingFaceEmbeddings` from `langchain_huggingface` package instead of
the deprecated version from `langchain-community`. This resolves the
deprecation warning and ensures compatibility with future versions of
langchain.

### Testing
```
from unstructured.documents.elements import Text
from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder

embedding_encoder = HuggingFaceEmbeddingEncoder(
    config=HuggingFaceEmbeddingConfig()
)
elements = embedding_encoder.embed_documents(
    elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
```
**Expected behavior**
No deprecation warning should be displayed. The code should use the
updated `HuggingFaceEmbeddings` class from the `langchain_huggingface`
package.
  • Loading branch information
christinestraub authored Jul 24, 2024
1 parent 798dcc0 commit 560cc0e
Show file tree
Hide file tree
Showing 5 changed files with 21 additions and 86 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
## 0.15.1-dev3
## 0.15.1-dev4

### Enhancements

### Features

### Fixes

* **Update `HuggingFaceEmbeddingEncoder` to use `HuggingFaceEmbeddings` from `langchain_huggingface` package instead of the deprecated version from `langchain-community`.** This resolves the deprecation warning and ensures compatibility with future versions of langchain.
* **Update `OpenAIEmbeddingEncoder` to use `OpenAIEmbeddings` from `langchain-openai` package instead of the deprecated version from `langchain-community`.** This resolves the deprecation warning and ensures compatibility with future versions of langchain.
* **Update import of Pinecone exception** Adds compatibility for pinecone-client>=5.0.0
* **File-type detection catches non-existent file-path.** `detect_filetype()` no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. Instead `FileNotFoundError` is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened.
Expand Down
5 changes: 2 additions & 3 deletions requirements/ingest/embed-huggingface.in
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
-c ../deps/constraints.txt
-c ../base.txt
huggingface
langchain-community
sentence_transformers

langchain-huggingface
91 changes: 13 additions & 78 deletions requirements/ingest/embed-huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,8 @@
#
# pip-compile ./ingest/embed-huggingface.in
#
aiohttp==3.9.5
# via
# langchain
# langchain-community
aiosignal==1.3.1
# via aiohttp
annotated-types==0.7.0
# via pydantic
async-timeout==4.0.3
# via
# aiohttp
# langchain
attrs==23.2.0
# via aiohttp
certifi==2024.7.4
# via
# -c ./ingest/../base.txt
Expand All @@ -27,36 +15,26 @@ charset-normalizer==3.3.2
# via
# -c ./ingest/../base.txt
# requests
dataclasses-json==0.6.7
# via
# -c ./ingest/../base.txt
# langchain-community
filelock==3.15.4
# via
# huggingface-hub
# torch
# transformers
frozenlist==1.4.1
# via
# aiohttp
# aiosignal
fsspec==2024.5.0
# via
# -c ./ingest/../deps/constraints.txt
# huggingface-hub
# torch
huggingface==0.0.1
# via -r ./ingest/embed-huggingface.in
huggingface-hub==0.24.1
# via
# langchain-huggingface
# sentence-transformers
# tokenizers
# transformers
idna==3.7
# via
# -c ./ingest/../base.txt
# requests
# yarl
jinja2==3.1.4
# via torch
joblib==1.4.2
Expand All @@ -67,47 +45,21 @@ jsonpatch==1.33
# via langchain-core
jsonpointer==3.0.0
# via jsonpatch
langchain==0.2.11
# via langchain-community
langchain-community==0.2.10
# via
# -c ./ingest/../deps/constraints.txt
# -r ./ingest/embed-huggingface.in
langchain-core==0.2.23
# via
# langchain
# langchain-community
# langchain-text-splitters
langchain-text-splitters==0.2.2
# via langchain
# via langchain-huggingface
langchain-huggingface==0.0.3
# via -r ./ingest/embed-huggingface.in
langsmith==0.1.93
# via
# langchain
# langchain-community
# langchain-core
# via langchain-core
markupsafe==2.1.5
# via jinja2
marshmallow==3.21.3
# via
# -c ./ingest/../base.txt
# dataclasses-json
mpmath==1.3.0
# via sympy
multidict==6.0.5
# via
# aiohttp
# yarl
mypy-extensions==1.0.0
# via
# -c ./ingest/../base.txt
# typing-inspect
networkx==3.2.1
# via torch
numpy==1.26.4
# via
# -c ./ingest/../base.txt
# langchain
# langchain-community
# scikit-learn
# scipy
# sentence-transformers
Expand All @@ -120,22 +72,18 @@ packaging==23.2
# -c ./ingest/../deps/constraints.txt
# huggingface-hub
# langchain-core
# marshmallow
# transformers
pillow==10.4.0
# via sentence-transformers
pydantic==2.8.2
# via
# langchain
# langchain-core
# langsmith
pydantic-core==2.20.1
# via pydantic
pyyaml==6.0.1
# via
# huggingface-hub
# langchain
# langchain-community
# langchain-core
# transformers
regex==2024.5.15
Expand All @@ -146,8 +94,6 @@ requests==2.32.3
# via
# -c ./ingest/../base.txt
# huggingface-hub
# langchain
# langchain-community
# langsmith
# transformers
safetensors==0.4.3
Expand All @@ -160,22 +106,17 @@ scipy==1.11.3
# scikit-learn
# sentence-transformers
sentence-transformers==3.0.1
# via -r ./ingest/embed-huggingface.in
sqlalchemy==2.0.31
# via
# langchain
# langchain-community
# via langchain-huggingface
sympy==1.13.1
# via torch
tenacity==8.5.0
# via
# langchain
# langchain-community
# langchain-core
# via langchain-core
threadpoolctl==3.5.0
# via scikit-learn
tokenizers==0.19.1
# via transformers
# via
# langchain-huggingface
# transformers
torch==2.3.1
# via
# -c ./ingest/../deps/constraints.txt
Expand All @@ -187,24 +128,18 @@ tqdm==4.66.4
# sentence-transformers
# transformers
transformers==4.43.1
# via sentence-transformers
# via
# langchain-huggingface
# sentence-transformers
typing-extensions==4.12.2
# via
# -c ./ingest/../base.txt
# huggingface-hub
# pydantic
# pydantic-core
# sqlalchemy
# torch
# typing-inspect
typing-inspect==0.9.0
# via
# -c ./ingest/../base.txt
# dataclasses-json
urllib3==1.26.19
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
# requests
yarl==1.9.4
# via aiohttp
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.15.1-dev3" # pragma: no cover
__version__ = "0.15.1-dev4" # pragma: no cover
6 changes: 3 additions & 3 deletions unstructured/embed/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from unstructured.utils import requires_dependencies

if TYPE_CHECKING:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEmbeddings


@dataclass
Expand Down Expand Up @@ -69,12 +69,12 @@ def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:

@EmbeddingEncoderConnectionError.wrap
@requires_dependencies(
["langchain_community", "sentence_transformers"],
["langchain_huggingface"],
extras="embed-huggingface",
)
def create_client(self) -> "HuggingFaceEmbeddings":
"""Creates a langchain Huggingface python client to embed elements."""
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

client = HuggingFaceEmbeddings(**self.config.to_dict())
return client

0 comments on commit 560cc0e

Please sign in to comment.