Skip to content

Commit

Permalink
build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561)
Browse files Browse the repository at this point in the history
      ### Summary
- Bump `unstructured.paddleocr` to 2.8.1.0
- Remove `opencv-python` and `opencv-contrib-python` constraint pins
- Fix `0.15.7` changelog
  • Loading branch information
christinestraub authored Aug 23, 2024
1 parent 32bb77a commit ac10ba4
Show file tree
Hide file tree
Showing 45 changed files with 97 additions and 115 deletions.
15 changes: 13 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,26 @@
## 0.15.8-dev3
## 0.15.8-dev4

### Enhancements

* **Bump unstructured.paddleocr to 2.8.1.0.**

### Features

### Fixes

* **Fix NLTK data download path to prevent nested directories**. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.
* **Minify text_as_html from DOCX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
* **Fall back to filename extension-based file-type detection for unidentified OLE files.** Resolves a problem where a DOC file that could not be detected as such by `filetype` was incorrectly identified as a MSG file.

## 0.15.7

### Enhancements

### Features

### Fixes

* **Fix NLTK data download path to prevent nested directories**. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

## 0.15.6

### Enhancements
Expand Down
2 changes: 1 addition & 1 deletion requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ httpcore==1.0.5
# via httpx
httpx==0.27.0
# via unstructured-client
idna==3.7
idna==3.8
# via
# anyio
# httpx
Expand Down
3 changes: 0 additions & 3 deletions requirements/deps/constraints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,6 @@ tokenizers>=0.19,<0.20
pycocotools>=2.0.7
# NOTE(crag): python3.8-python3.11 compat (if it ends up being required)
torch>2
# pinned in unstructured paddleocr
opencv-python==4.8.0.76
opencv-contrib-python==4.8.0.76
platformdirs==3.10.0

# TODO: Constaint due to boto, with python before 3.10 not requiring openssl 1.1.1, remove when that gets
Expand Down
10 changes: 5 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ filelock==3.15.4
# via virtualenv
identify==2.6.0
# via pre-commit
idna==3.7
idna==3.8
# via
# -c ./base.txt
# -c ./test.txt
Expand All @@ -94,7 +94,7 @@ ipython-genutils==0.2.0
# via
# nbclassic
# notebook
ipywidgets==8.1.3
ipywidgets==8.1.5
# via jupyter
jedi==0.19.1
# via ipython
Expand Down Expand Up @@ -140,7 +140,7 @@ jupyter-server-terminals==0.5.3
# via jupyter-server
jupyterlab-pygments==0.3.0
# via nbconvert
jupyterlab-widgets==3.0.11
jupyterlab-widgets==3.0.13
# via ipywidgets
markupsafe==2.1.5
# via
Expand Down Expand Up @@ -256,7 +256,7 @@ pyyaml==6.0.2
# -c ./test.txt
# jupyter-events
# pre-commit
pyzmq==26.1.1
pyzmq==26.2.0
# via
# ipykernel
# jupyter-client
Expand Down Expand Up @@ -352,7 +352,7 @@ wheel==0.44.0
# via
# -c ././deps/constraints.txt
# pip-tools
widgetsnbextension==4.0.11
widgetsnbextension==4.0.13
# via ipywidgets
zipp==3.20.0
# via importlib-metadata
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-paddleocr.in
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
-c base.txt

paddlepaddle==3.0.0b1
unstructured.paddleocr==2.8.0.1
unstructured.paddleocr==2.8.1.0
40 changes: 6 additions & 34 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,6 @@ anyio==4.4.0
# httpx
astor==0.8.1
# via paddlepaddle
attrdict==2.0.1
# via unstructured-paddleocr
cachetools==5.5.0
# via premailer
certifi==2024.7.4
# via
# -c ././deps/constraints.txt
Expand All @@ -27,18 +23,12 @@ charset-normalizer==3.3.2
# requests
contourpy==1.2.1
# via matplotlib
cssselect==1.2.0
# via premailer
cssutils==2.11.1
# via premailer
cycler==0.12.1
# via matplotlib
cython==3.0.11
# via unstructured-paddleocr
decorator==5.1.1
# via paddlepaddle
et-xmlfile==1.1.0
# via openpyxl
exceptiongroup==1.2.2
# via
# -c ./base.txt
Expand All @@ -57,7 +47,7 @@ httpx==0.27.0
# via
# -c ./base.txt
# paddlepaddle
idna==3.7
idna==3.8
# via
# -c ./base.txt
# anyio
Expand All @@ -69,23 +59,14 @@ imageio==2.35.1
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-resources==6.4.3
importlib-resources==6.4.4
# via matplotlib
kiwisolver==1.4.5
# via matplotlib
lanms-neo==1.0.2
# via unstructured-paddleocr
lazy-loader==0.4
# via scikit-image
lxml==5.3.0
# via
# -c ./base.txt
# premailer
# unstructured-paddleocr
matplotlib==3.9.2
# via imgaug
more-itertools==10.4.0
# via cssutils
networkx==3.2.1
# via
# paddlepaddle
Expand All @@ -106,17 +87,12 @@ numpy==1.26.4
# shapely
# tifffile
# unstructured-paddleocr
opencv-contrib-python==4.8.0.76
# via
# -c ././deps/constraints.txt
# unstructured-paddleocr
opencv-python==4.8.0.76
opencv-contrib-python==4.10.0.84
# via unstructured-paddleocr
opencv-python==4.10.0.84
# via
# -c ././deps/constraints.txt
# imgaug
# unstructured-paddleocr
openpyxl==3.1.5
# via unstructured-paddleocr
opt-einsum==3.3.0
# via paddlepaddle
packaging==24.1
Expand All @@ -138,8 +114,6 @@ pillow==10.4.0
# pdf2image
# scikit-image
# unstructured-paddleocr
premailer==3.10.0
# via unstructured-paddleocr
protobuf==4.23.4
# via
# -c ././deps/constraints.txt
Expand All @@ -161,7 +135,6 @@ rapidfuzz==3.9.6
requests==2.32.3
# via
# -c ./base.txt
# premailer
# unstructured-paddleocr
scikit-image==0.24.0
# via
Expand All @@ -178,7 +151,6 @@ shapely==2.0.6
six==1.16.0
# via
# -c ./base.txt
# attrdict
# imgaug
# python-dateutil
sniffio==1.3.1
Expand All @@ -197,7 +169,7 @@ typing-extensions==4.12.2
# -c ./base.txt
# anyio
# paddlepaddle
unstructured-paddleocr==2.8.0.1
unstructured-paddleocr==2.8.1.0
# via -r ./extra-paddleocr.in
urllib3==1.26.19
# via
Expand Down
11 changes: 5 additions & 6 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ googleapis-common-protos==1.63.2
# via
# google-api-core
# grpcio-status
grpcio==1.65.5
grpcio==1.66.0
# via
# -c ././deps/constraints.txt
# google-api-core
Expand All @@ -72,11 +72,11 @@ huggingface-hub==0.24.6
# unstructured-inference
humanfriendly==10.0
# via coloredlogs
idna==3.7
idna==3.8
# via
# -c ./base.txt
# requests
importlib-resources==6.4.3
importlib-resources==6.4.4
# via matplotlib
iopath==0.1.10
# via layoutparser
Expand Down Expand Up @@ -122,9 +122,8 @@ onnx==1.16.2
# unstructured-inference
onnxruntime==1.19.0
# via unstructured-inference
opencv-python==4.8.0.76
opencv-python==4.10.0.84
# via
# -c ././deps/constraints.txt
# layoutparser
# unstructured-inference
packaging==24.1
Expand Down Expand Up @@ -269,7 +268,7 @@ tqdm==4.66.5
# huggingface-hub
# iopath
# transformers
transformers==4.44.1
transformers==4.44.2
# via unstructured-inference
typing-extensions==4.12.2
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ huggingface-hub==0.24.6
# via
# tokenizers
# transformers
idna==3.7
idna==3.8
# via
# -c ./base.txt
# requests
Expand Down Expand Up @@ -99,7 +99,7 @@ tqdm==4.66.5
# huggingface-hub
# sacremoses
# transformers
transformers==4.44.1
transformers==4.44.2
# via -r ./huggingface.in
typing-extensions==4.12.2
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ charset-normalizer==3.3.2
# via
# -c ./ingest/../base.txt
# requests
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# requests
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/astradb.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ httpx[http2]==0.27.0
# astrapy
hyperframe==6.0.1
# via h2
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# anyio
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/azure-cognitive-search.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ charset-normalizer==3.3.2
# via
# -c ./ingest/../base.txt
# requests
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# requests
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ fsspec==2024.6.1
# via
# -r ./ingest/azure.in
# adlfs
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# requests
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ attrs==24.2.0
# via boxsdk
boxfs==0.3.0
# via -r ./ingest/box.in
boxsdk[jwt]==3.12.0
boxsdk[jwt]==3.13.0
# via boxfs
certifi==2024.7.4
# via
Expand All @@ -27,7 +27,7 @@ fsspec==2024.6.1
# via
# -r ./ingest/box.in
# boxfs
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# requests
Expand Down
8 changes: 4 additions & 4 deletions requirements/ingest/chroma.txt
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ google-auth==2.34.0
# via kubernetes
googleapis-common-protos==1.63.2
# via opentelemetry-exporter-otlp-proto-grpc
grpcio==1.65.5
grpcio==1.66.0
# via
# -c ./ingest/../deps/constraints.txt
# chromadb
Expand All @@ -78,15 +78,15 @@ huggingface-hub==0.24.6
# via tokenizers
humanfriendly==10.0
# via coloredlogs
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# anyio
# httpx
# requests
importlib-metadata==8.4.0
# via -r ./ingest/chroma.in
importlib-resources==6.4.3
importlib-resources==6.4.4
# via chromadb
kubernetes==30.1.0
# via chromadb
Expand Down Expand Up @@ -129,7 +129,7 @@ packaging==24.1
# build
# huggingface-hub
# onnxruntime
posthog==3.5.0
posthog==3.5.2
# via chromadb
protobuf==4.23.4
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/clarifai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ contextlib2==21.6.0
# via schema
googleapis-common-protos==1.63.2
# via clarifai-grpc
grpcio==1.65.5
grpcio==1.66.0
# via
# -c ./ingest/../deps/constraints.txt
# clarifai-grpc
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# requests
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/confluence.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ charset-normalizer==3.3.2
# requests
deprecated==1.2.14
# via atlassian-python-api
idna==3.7
idna==3.8
# via
# -c ./ingest/../base.txt
# requests
Expand Down
Loading

0 comments on commit ac10ba4

Please sign in to comment.