feat: add instruction extraction with sync and async example notebooks #51

jojortz · 2024-10-19T00:30:27Z

Feature

Add instruction extraction to any-parser sdk.
Refactor some repeated functionality into helper functions in any_parser.py
Add pdf_to_key_value.ipynb and async_pdf_to_key_value.ipynb
Add pre-commit packages to pyproject.toml and instructions in README.md

Testing

Ran automated unit tests, including test_sync_extract_key_value and test_async_extract_key_value_and_fetch
Ensured pdf_to_key_value.ipynb and async_pdf_to_key_value.ipynb ran without errors

goldmermaid · 2024-10-21T20:05:36Z

any_parser/any_parser.py

@@ -172,6 +227,8 @@ def async_extract(
            process_type = ProcessType.FILE_REFINED_QUICK
        elif model == ModelType.ULTRA:


can you remove ultra?

Removed in latest commit

goldmermaid · 2024-10-21T20:23:30Z

Can you add extract_json and the notebooks into our docs: https://docs.cambioml.com/api-reference/example?

show users what the differences between sync and async and how to properly use them;

CambioML · 2024-10-22T06:33:08Z

WAIT TO MERGE UNTIL THIS PR IS MERGED: CambioML/any-parser-realtime-cdk#76

Feature

* Add instruction extraction to any-parser sdk.

* Refactor some repeated functionality into helper functions in any_parser.py

* Add `pdf_to_json.ipynb` and `async_pdf_to_json.ipynb`

Testing

* Testing using notebooks, tested more extensively on backend

what does tested more extensively on backend means? Let's be explicit regarding what is tested to help us gain confidence.

CambioML · 2024-10-22T06:29:57Z

any_parser/any_parser.py

+        file_path: str,
+        extract_instruction: Dict,
+    ) -> Tuple[str, str]:
+        """Extract json in real-time.


This looks like a GenAI low quality dosstring, you should not just trust GenAI auto completely output, but given model models about extract what json based on what?

Updated with latest commit

CambioML · 2024-10-22T06:31:24Z

any_parser/any_parser.py

+        file_path: str,
+        extract_instruction: Dict,
+    ) -> str:
+        """Extract data asynchronously.


Same. This what does extract data means in the docstring.

Updated with latest commit

CambioML · 2024-10-22T06:35:30Z

any_parser/any_parser.py

@@ -137,6 +136,62 @@ def extract(
        else:
            return f"Error: {response.status_code} {response.text}", None

+    def extract_json(


This looks to me a very bad name. Especially, extract required no prompt while extract_json requires prompt. Then, suddently, you start to return json. Logically, I do not know why an extract_instruction starts to get this extract to extract_json. You should re-consider on naming this.

updated to extract_key_value

CambioML · 2024-10-22T06:35:42Z

any_parser/any_parser.py

+        # If response successful, upload the file
+        return self._upload_file_to_presigned_url(file_path, response)
+
+    def async_extract_json(


updated to extract_key_value

CambioML · 2024-10-22T06:36:17Z

any_parser/any_parser.py

+    def _upload_file_to_presigned_url(
+        self, file_path: str, response: requests.Response
+    ) -> str:
+        if response.status_code == 200:
+            try:
+                file_id = response.json().get("fileId")
+                presigned_url = response.json().get("presignedUrl")
+                with open(file_path, "rb") as file_to_upload:
+                    files = {"file": (file_path, file_to_upload)}
+                    upload_resp = requests.post(
+                        presigned_url["url"],
+                        data=presigned_url["fields"],
+                        files=files,
+                        timeout=TIMEOUT,
+                    )
+                    if upload_resp.status_code != 204:
+                        return f"Error: {upload_resp.status_code} {upload_resp.text}"
+                return file_id
+            except json.JSONDecodeError:
+                return "Error: Invalid JSON response"
+        else:
+            return f"Error: {response.status_code} {response.text}"
+
    def _check_model(self, model: ModelType) -> None:
-        if model not in {ModelType.BASE, ModelType.PRO, ModelType.ULTRA}:
+        if model not in {ModelType.BASE, ModelType.PRO}:
            valid_models = ", ".join(["`" + model.value + "`" for model in ModelType])
-            raise ValueError(
-                f"Invalid model type: {model}. Supported `model` types include {valid_models}."
-            )
+            return f"Invalid model type: {model}. Supported `model` types include {valid_models}."
+
+    def _check_file_type_and_path(self, file_path, file_extension):
+        # Check if the file exists
+        if not Path(file_path).is_file():
+            return f"Error: File does not exist: {file_path}"
+
+        if file_extension not in SUPPORTED_FILE_EXTENSIONS:
+            supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS)
+            return f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}."
+
+    def _check_result_type(self, result_type: str) -> None:
+        if result_type not in RESULT_TYPES:
+            valid_result_types = ", ".join(RESULT_TYPES)
+            return f"Invalid result type: {result_type}. Supported `result_type` types include {valid_result_types}."


Refactor to a utils.py to improve readability and no need to use _ then.

moved to utils.py in latest commit

jojortz · 2024-10-22T18:26:51Z

WAIT TO MERGE UNTIL THIS PR IS MERGED: CambioML/any-parser-realtime-cdk#76
Feature
* Add instruction extraction to any-parser sdk.

* Refactor some repeated functionality into helper functions in any_parser.py

* Add `pdf_to_json.ipynb` and `async_pdf_to_json.ipynb`
Testing
* Testing using notebooks, tested more extensively on backend
what does tested more extensively on backend means? Let's be explicit regarding what is tested to help us gain confidence.

Updated comment with details on latest testing, which added unit tests to test.py for this key-value extract

CambioML

LGTM -R

add instruction extraction with sync and async example notebooks

e9168a6

jojortz requested review from CambioML and Sdddell as code owners October 19, 2024 00:30

goldmermaid reviewed Oct 21, 2024

View reviewed changes

jojortz and others added 4 commits October 21, 2024 14:18

remove ULTRA ModelType

582ddf6

add sync and async extrqct-json tests, add README

6cb2fdb

add .env info to test README

9de5fd4

Merge branch 'main' into json-extract

8a62433

CambioML reviewed Oct 22, 2024

View reviewed changes

update naming to key_value, fix docstring, add utils.py

53002f3

add pre-commit packages to dev.dependencies and tests/README.md

6d5347c

CambioML approved these changes Oct 22, 2024

View reviewed changes

CambioML merged commit 556b5bc into main Oct 22, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add instruction extraction with sync and async example notebooks #51

feat: add instruction extraction with sync and async example notebooks #51

jojortz commented Oct 19, 2024 •

edited

Loading

goldmermaid Oct 21, 2024

jojortz Oct 21, 2024

goldmermaid commented Oct 21, 2024

CambioML commented Oct 22, 2024

CambioML Oct 22, 2024

jojortz Oct 22, 2024

CambioML Oct 22, 2024

jojortz Oct 22, 2024

CambioML Oct 22, 2024

jojortz Oct 22, 2024

CambioML Oct 22, 2024

jojortz Oct 22, 2024

CambioML Oct 22, 2024

jojortz Oct 22, 2024

jojortz commented Oct 22, 2024

CambioML left a comment

		@@ -172,6 +227,8 @@ def async_extract(
		process_type = ProcessType.FILE_REFINED_QUICK
		elif model == ModelType.ULTRA:

feat: add instruction extraction with sync and async example notebooks #51

feat: add instruction extraction with sync and async example notebooks #51

Conversation

jojortz commented Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldmermaid commented Oct 21, 2024

CambioML commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jojortz commented Oct 22, 2024

CambioML left a comment

Choose a reason for hiding this comment

jojortz commented Oct 19, 2024 •

edited

Loading