Skip to content

Commit

Permalink
build: downgrade nltk version (#3527)
Browse files Browse the repository at this point in the history
This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in
#3512 because
`3.8.2` is no longer available in PyPI due to some
issues(nltk/nltk#3301)
  • Loading branch information
christinestraub authored Aug 15, 2024
1 parent 9b778e2 commit d0211cc
Show file tree
Hide file tree
Showing 5 changed files with 18 additions and 8 deletions.
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
## 0.15.5-dev0

### Enhancements

### Features

### Fixes

* **Downgrade NLTK dependency version for compatibility**. Due to the unavailability of `nltk==3.8.2` on PyPI, the NLTK dependency has been downgraded to `<3.8.2`. This change ensures continued functionality and compatibility.


## 0.15.4

### Enhancements
Expand Down
3 changes: 1 addition & 2 deletions docker/rockylinux-9.2/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,7 @@ RUN python3.10 -m pip install pip==${PIP_VERSION} && \
dnf -y groupremove "Development Tools" && \
dnf clean all

RUN python3.10 -c "import nltk; nltk.download('punkt')" && \
python3.10 -c "import nltk; nltk.download('averaged_perceptron_tagger')"
RUN python3.10 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()"

FROM deps as code

Expand Down
2 changes: 1 addition & 1 deletion requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ mypy-extensions==1.0.0
# unstructured-client
nest-asyncio==1.6.0
# via unstructured-client
nltk==3.8.2
nltk==3.8.1
# via -r ./base.in
numpy==1.26.4
# via -r ./base.in
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.15.4" # pragma: no cover
__version__ = "0.15.5-dev0" # pragma: no cover
8 changes: 4 additions & 4 deletions unstructured/nlp/tokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@

CACHE_MAX_SIZE: Final[int] = 128

NLTK_DATA_FILENAME = "nltk_data_3.8.2.tar.gz"
NLTK_DATA_FILENAME = "nltk_data.tgz"
NLTK_DATA_URL = f"https://utic-public-cf.s3.amazonaws.com/{NLTK_DATA_FILENAME}"
NLTK_DATA_SHA256 = "ba2ca627c8fb1f1458c15d5a476377a5b664c19deeb99fd088ebf83e140c1663"
NLTK_DATA_SHA256 = "126faf671cd255a062c436b3d0f2d311dfeefcd92ffa43f7c3ab677309404d61"


# NOTE(robinson) - mimic default dir logic from NLTK
Expand Down Expand Up @@ -114,10 +114,10 @@ def _download_nltk_packages_if_not_present():

tagger_available = check_for_nltk_package(
package_category="taggers",
package_name="averaged_perceptron_tagger_eng",
package_name="averaged_perceptron_tagger",
)
tokenizer_available = check_for_nltk_package(
package_category="tokenizers", package_name="punkt_tab"
package_category="tokenizers", package_name="punkt"
)

if not (tokenizer_available and tagger_available):
Expand Down

0 comments on commit d0211cc

Please sign in to comment.