Skip to content

Commit

Permalink
rfctr: prep for pluggable partitioners (Unstructured-IO#3806)
Browse files Browse the repository at this point in the history
**Summary**
Prepare auto-partitioning for pluggable partitioners.

Move toward a uniform partitioner call signature in `auto/partition()`
such that a custom or override partitioner can be registered without
requiring code changes.

**Additional Context**
The central job of `auto/partition()` is to detect the file-type of the
given file and use that to dispatch partitioning to the corresponding
partitioner function e.g. `partition_pdf()` or `partition_docx()`.

In the existing code, each partitioner function is called with
parameters "hand-picked" from the available parameters passed to the
`partition()` function. This is unnecessary and couples those
partitioners tightly with the dispatch function. The desired state is
that all available arguments are passed as `kwargs` and the partitioner
function "self-selects" the arguments it will be sensitive to, applies
its own appropriate default values when the argument is omitted, and
simply ignore any arguments it doesn't use. Note that achieving this
requires no changes to partitioner functions because they already do
precisely this.

So the job is to pass all arguments (other than `filename` and `file`)
to the partitioner as `kwargs`. This will allow additional or alternate
partitioners to be registered at runtime and dispatched to, because as
long as they have the signature `partition_x(filename, file, kwargs) ->
list[Element]` then they can be dispatched to without customization.
  • Loading branch information
scanny authored Dec 10, 2024
1 parent b981d71 commit 3b718ec
Show file tree
Hide file tree
Showing 11 changed files with 104 additions and 328 deletions.
12 changes: 10 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
## 0.16.11
## 0.16.12-dev0

### Enhancements

- **Prepare auto-partitioning for pluggable partitioners**. Move toward a uniform partitioner call signature so a custom or override partitioner can be registered without code changes.

### Features

### Fixes

- Fix ipv4 regex to correctly include up to three digit octets.
## 0.16.11

### Enhancements

Expand All @@ -14,6 +20,8 @@

### Fixes

- Fix ipv4 regex to correctly include up to three digit octets.

## 0.16.10

### Enhancements
Expand Down
1 change: 1 addition & 0 deletions test_unstructured/metrics/test_element_type.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
("Title", 0): 4,
("Title", 1): 1,
("NarrativeText", 0): 3,
("PageBreak", None): 3,
("ListItem", 0): 6,
("ListItem", 1): 6,
("ListItem", 2): 3,
Expand Down
11 changes: 0 additions & 11 deletions test_unstructured/partition/html/test_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -1232,17 +1232,6 @@ def it_knows_the_caller_provided_detection_origin(

assert opts.detection_origin == detection_origin

# -- .encoding -------------------------------

@pytest.mark.parametrize("encoding", ["utf-8", None])
def it_knows_the_caller_provided_encoding(
self, encoding: str | None, opts_args: dict[str, Any]
):
opts_args["encoding"] = encoding
opts = HtmlPartitionerOptions(**opts_args)

assert opts.encoding == encoding

# -- .html_text ------------------------------

def it_gets_the_HTML_from_the_file_path_when_one_is_provided(self, opts_args: dict[str, Any]):
Expand Down
24 changes: 1 addition & 23 deletions test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

from __future__ import annotations

import io
import json
import os
import pathlib
Expand Down Expand Up @@ -561,7 +560,6 @@ def test_auto_partition_pdf_with_fast_strategy(request: FixtureRequest):
strategy=PartitionStrategy.FAST,
languages=None,
metadata_filename=None,
include_page_breaks=False,
infer_table_structure=False,
extract_images_in_pdf=False,
extract_image_block_types=None,
Expand Down Expand Up @@ -897,7 +895,7 @@ def test_auto_partition_raises_with_bad_type(request: FixtureRequest):

with pytest.raises(
UnsupportedFileFormatError,
match="Invalid file made-up.fake. The FileType.UNK file type is not supported in partiti",
match="Partitioning is not supported for the FileType.UNK file type.",
):
partition(filename="made-up.fake", strategy=PartitionStrategy.HI_RES)

Expand Down Expand Up @@ -1037,26 +1035,6 @@ def test_auto_partition_forwards_metadata_filename_via_kwargs():
assert all(e.metadata.filename == "much-more-interesting-name.txt" for e in elements)


def test_auto_partition_warns_about_file_filename_deprecation(caplog: LogCaptureFixture):
file_path = example_doc_path("fake-text.txt")

with open(file_path, "rb") as f:
elements = partition(file=f, file_filename=file_path)

assert all(e.metadata.filename == "fake-text.txt" for e in elements)
assert caplog.records[0].levelname == "WARNING"
assert "The file_filename kwarg will be deprecated" in caplog.text


def test_auto_partition_raises_when_both_file_filename_and_metadata_filename_args_are_used():
file_path = example_doc_path("fake-text.txt")
with open(file_path, "rb") as f:
file = io.BytesIO(f.read())

with pytest.raises(ValueError, match="Only one of metadata_filename and file_filename is spe"):
partition(file=file, file_filename=file_path, metadata_filename=file_path)


# -- ocr_languages --------------------------------------------------------


Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.16.11" # pragma: no cover
__version__ = "0.16.12-dev0" # pragma: no cover
Loading

0 comments on commit 3b718ec

Please sign in to comment.