Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added post_install in setup.py file #3883

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@
- **Vectorize layout (inferred, extracted, and OCR) data structure** Using `np.ndarray` to store a group of layout elements or text regions instead of using a list of objects. This improves the memory efficiency and compute speed around layout merging and deduplication.

### Fixes
- **Add auto-download for NLTK for Python Enviroment** When user install python library without image. It will automatic download nltk data from `tokenize.py` file
- **Correctly patch pdfminer to avoid PDF repair**. The patch applied to pdfminer's parser caused it to occasionally split tokens in content streams, throwing `PDFSyntaxError`. Repairing these PDFs sometimes failed (since they were not actually invalid) resulting in unnecessary OCR fallback.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline

* **Drop usage of ndjson dependency**

## 0.16.15

### Enhancements

### Features
Expand Down
2 changes: 1 addition & 1 deletion requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ langdetect==1.0.9
# via -r ./base.in
lxml==5.3.0
# via -r ./base.in
marshmallow==3.25.1
marshmallow==3.26.0
# via
# dataclasses-json
# unstructured-client
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ exceptiongroup==1.2.2
# via
# -c ./base.txt
# anyio
fonttools==4.55.4
fonttools==4.55.5
# via matplotlib
h11==0.14.0
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,15 @@ filelock==3.17.0
# transformers
flatbuffers==25.1.21
# via onnxruntime
fonttools==4.55.4
fonttools==4.55.5
# via matplotlib
fsspec==2024.12.0
# via
# huggingface-hub
# torch
google-api-core[grpc]==2.24.0
# via google-cloud-vision
google-auth==2.37.0
google-auth==2.38.0
# via
# google-api-core
# google-cloud-vision
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pptx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@ python-pptx==1.0.2
# via -r ./extra-pptx.in
typing-extensions==4.12.2
# via python-pptx
xlsxwriter==3.2.0
xlsxwriter==3.2.1
# via python-pptx
2 changes: 1 addition & 1 deletion requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ exceptiongroup==1.2.2
# -c ./base.txt
# anyio
# pytest
faker==34.0.0
faker==35.0.0
# via jsf
flake8==7.1.1
# via
Expand Down
29 changes: 24 additions & 5 deletions unstructured/nlp/tokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@
CACHE_MAX_SIZE: Final[int] = 128


def download_nltk_packages():
nltk.download("averaged_perceptron_tagger_eng", quiet=True)
nltk.download("punkt_tab", quiet=True)


def check_for_nltk_package(package_name: str, package_category: str) -> bool:
"""Checks to see if the specified NLTK package exists on the image."""
paths: list[str] = []
Expand All @@ -32,6 +27,30 @@ def check_for_nltk_package(package_name: str, package_category: str) -> bool:
return False


# We cache this because we do not want to attempt
# downloading the packages multiple times
@lru_cache()
def download_nltk_packages():
"""If required NLTK packages are not available, download them."""

tagger_available = check_for_nltk_package(
package_category="taggers",
package_name="averaged_perceptron_tagger_eng",
)
tokenizer_available = check_for_nltk_package(
package_category="tokenizers", package_name="punkt_tab"
)

if (not tokenizer_available) or (not tagger_available):
nltk.download("averaged_perceptron_tagger_eng", quiet=True)
nltk.download("punkt_tab", quiet=True)


# auto download nltk packages if the environment variable is set
if os.getenv("AUTO_DOWNLOAD_NLTK", "True").lower() == "true":
download_nltk_packages()


@lru_cache(maxsize=CACHE_MAX_SIZE)
def sent_tokenize(text: str) -> List[str]:
"""A wrapper around the NLTK sentence tokenizer with LRU caching enabled."""
Expand Down
Loading