Skip to content

Commit

Permalink
fix nltk download
Browse files Browse the repository at this point in the history
  • Loading branch information
vangheem committed Nov 25, 2024
1 parent 626f73a commit 3809d97
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 0 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
## 0.16.6

### Fixes
- **Fix NLTK Download** to not download from unstructured S3 Bucket

## 0.16.6

### Enhancements
- **Every <table> tag is considered to be ontology.Table** Added special handling for tables in HTML partitioning. This change is made to improve the accuracy of table extraction from HTML documents.
- **Every HTML has default ontology class assigned** When parsing HTML to ontology each defined HTML in the Ontology has assigned default ontology class. This way it is possible to assign ontology class instead of UncategorizedText when the HTML tag is predicted correctly without class assigned class
Expand Down
7 changes: 7 additions & 0 deletions unstructured/nlp/tokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
NLTK_DATA_FILENAME = "nltk_data_3.8.2.tar.gz"
NLTK_DATA_URL = f"https://utic-public-cf.s3.amazonaws.com/{NLTK_DATA_FILENAME}"
NLTK_DATA_SHA256 = "ba2ca627c8fb1f1458c15d5a476377a5b664c19deeb99fd088ebf83e140c1663"
DOWNLOAD_S3_NLTK_DATA = os.getenv("DOWNLOAD_S3_NLTK_DATA", "false").lower() == "true"


# NOTE(robinson) - mimic default dir logic from NLTK
Expand Down Expand Up @@ -65,6 +66,12 @@ def get_nltk_data_dir() -> str | None:


def download_nltk_packages():

if not DOWNLOAD_S3_NLTK_DATA:
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("punkt_tab")
return

nltk_data_dir = get_nltk_data_dir()

if nltk_data_dir is None:
Expand Down

0 comments on commit 3809d97

Please sign in to comment.