Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(Unstructured): Pin Client version before breaking changes #1309

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lambda-science
Copy link
Contributor

Related Issues

There is breaking changed in unstructured client 0.26 that are quite annoying.
See: https://docs.unstructured.io/api-reference/api-services/sdk-python#migration-guide
Mainly we get this error: ypeError: General.partition() takes 1 positional argument but 2 were given

Proposed Changes:

Solved by pinning client version to latest compatible version.

How did you test it?

In my own project

Notes for the reviewer

Don't know why test don't catch it or if it's only on my local setup with self-hosted latest version unstructured-api (0.82.0 dec 2024)

Checklist

@lambda-science lambda-science requested a review from a team as a code owner January 21, 2025 09:41
@lambda-science lambda-science requested review from davidsbatista and removed request for a team January 21, 2025 09:41
@anakin87
Copy link
Member

Hello, @lambda-science!

Before pinning the version, I would like to reproduce the bug (so that we can also understand how to update the integration in the future)...

Could you share more details/a reproducible example?

@anakin87 anakin87 self-requested a review January 21, 2025 10:28
@lambda-science
Copy link
Contributor Author

lambda-science commented Jan 21, 2025

Hi @anakin87

  1. Run the unstructured API docker container on port 8002 quay.io/unstructured-io/unstructured-api:0.0.82
  2. Have a python env with:
dependencies = [
    "haystack-ai==2.9.0",
    "unstructured-fileconverter-haystack>=0.4.1",
    "unstructured-client>=0.26",
]

Run this script next to sample4.docx

from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
UNSTRUCTURED_SETTINGS = {
    "skip_infer_table_types": "[]",
    "chunking_strategy": "by_title",
    "combine_under_n_chars": "1000",
    "new_after_n_chars": "1500",
    "max_characters": "2000",
    "pdf_infer_table_structure": "True",
    "languages": ["eng", "fra"],
    "strategy": "fast",
}

converter = UnstructuredFileConverter(api_url="http://localhost:8002/general/v0/general",
                                              document_creation_mode="one-doc-per-element",
                                              unstructured_kwargs=UNSTRUCTURED_SETTINGS)
documents = converter.run(paths=["sample4.docx"])
print(documents)

Results:

Converting files to Haystack Documents: 0it [00:00, ?it/s]WARNING: Unstructured could not process file sample4.docx. Error: 1 validation error for PartitionParameters
skip_infer_table_types
  Input should be a valid list [type=list_type, input_value='[]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/list_type
Converting files to Haystack Documents: 1it [00:00,  1.67it/s]
{'documents': []}

If you remove the skip_infer_table_types param you get

Converting files to Haystack Documents: 0it [00:00, ?it/s]WARNING: Unstructured could not process file sample4.docx. Error: General.partition() takes 1 positional argument but 2 were given
Converting files to Haystack Documents: 1it [00:00,  1.78it/s]
{'documents': []}

instead of the list of doc if you pin <0.26

@lambda-science
Copy link
Contributor Author

Here is the sample docx
sample4.docx

@anakin87
Copy link
Member

Thanks for the detailed report! I'll take a look...

@anakin87
Copy link
Member

anakin87 commented Jan 23, 2025

@lambda-science
I tried both main and the latest version of the integration, with python 3.9 and 3.10.

unstructured-client==0.29.0 is automatically installed and I don't get the error (when removing skip_infer_table_types), but I get:

{'documents': [Document(id=177226da7488ee08d3eb12b7453e3b0acf010a1d2008ae0dde0a7a704d4a3a63,
content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse sed ex eget diam dictum varius ...',
meta: {'file_path': 'sample4.docx', 'element_index': 0, 'filename': 'sample4.docx', 'languages': ['cat', 'ita', 'fra'], 'page_number': 1, 'orig_elements': ...', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'CompositeElement'}),
...]}

Could you please check?

@lambda-science
Copy link
Contributor Author

I'm on 3.11 I'll check if it's 3.11 specific when I get a moment :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants