Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add instruction extraction with sync and async example notebooks #51

Merged
merged 7 commits into from
Oct 22, 2024

Conversation

jojortz
Copy link
Collaborator

@jojortz jojortz commented Oct 19, 2024

Feature

  • Add instruction extraction to any-parser sdk.
  • Refactor some repeated functionality into helper functions in any_parser.py
  • Add pdf_to_key_value.ipynb and async_pdf_to_key_value.ipynb
  • Add pre-commit packages to pyproject.toml and instructions in README.md

Testing

  • Ran automated unit tests, including test_sync_extract_key_value and test_async_extract_key_value_and_fetch
  • Ensured pdf_to_key_value.ipynb and async_pdf_to_key_value.ipynb ran without errors

@@ -172,6 +227,8 @@ def async_extract(
process_type = ProcessType.FILE_REFINED_QUICK
elif model == ModelType.ULTRA:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove ultra?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in latest commit

@goldmermaid
Copy link
Collaborator

Can you add extract_json and the notebooks into our docs: https://docs.cambioml.com/api-reference/example?

show users what the differences between sync and async and how to properly use them;

@CambioML
Copy link
Owner

WAIT TO MERGE UNTIL THIS PR IS MERGED: CambioML/any-parser-realtime-cdk#76

Feature

* Add instruction extraction to any-parser sdk.

* Refactor some repeated functionality into helper functions in any_parser.py

* Add `pdf_to_json.ipynb` and `async_pdf_to_json.ipynb`

Testing

* Testing using notebooks, tested more extensively on backend

what does tested more extensively on backend means? Let's be explicit regarding what is tested to help us gain confidence.

file_path: str,
extract_instruction: Dict,
) -> Tuple[str, str]:
"""Extract json in real-time.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a GenAI low quality dosstring, you should not just trust GenAI auto completely output, but given model models about extract what json based on what?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with latest commit

file_path: str,
extract_instruction: Dict,
) -> str:
"""Extract data asynchronously.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. This what does extract data means in the docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with latest commit

@@ -137,6 +136,62 @@ def extract(
else:
return f"Error: {response.status_code} {response.text}", None

def extract_json(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to me a very bad name. Especially, extract required no prompt while extract_json requires prompt. Then, suddently, you start to return json. Logically, I do not know why an extract_instruction starts to get this extract to extract_json. You should re-consider on naming this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to extract_key_value

# If response successful, upload the file
return self._upload_file_to_presigned_url(file_path, response)

def async_extract_json(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to extract_key_value

Comment on lines 352 to 392
def _upload_file_to_presigned_url(
self, file_path: str, response: requests.Response
) -> str:
if response.status_code == 200:
try:
file_id = response.json().get("fileId")
presigned_url = response.json().get("presignedUrl")
with open(file_path, "rb") as file_to_upload:
files = {"file": (file_path, file_to_upload)}
upload_resp = requests.post(
presigned_url["url"],
data=presigned_url["fields"],
files=files,
timeout=TIMEOUT,
)
if upload_resp.status_code != 204:
return f"Error: {upload_resp.status_code} {upload_resp.text}"
return file_id
except json.JSONDecodeError:
return "Error: Invalid JSON response"
else:
return f"Error: {response.status_code} {response.text}"

def _check_model(self, model: ModelType) -> None:
if model not in {ModelType.BASE, ModelType.PRO, ModelType.ULTRA}:
if model not in {ModelType.BASE, ModelType.PRO}:
valid_models = ", ".join(["`" + model.value + "`" for model in ModelType])
raise ValueError(
f"Invalid model type: {model}. Supported `model` types include {valid_models}."
)
return f"Invalid model type: {model}. Supported `model` types include {valid_models}."

def _check_file_type_and_path(self, file_path, file_extension):
# Check if the file exists
if not Path(file_path).is_file():
return f"Error: File does not exist: {file_path}"

if file_extension not in SUPPORTED_FILE_EXTENSIONS:
supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS)
return f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}."

def _check_result_type(self, result_type: str) -> None:
if result_type not in RESULT_TYPES:
valid_result_types = ", ".join(RESULT_TYPES)
return f"Invalid result type: {result_type}. Supported `result_type` types include {valid_result_types}."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to a utils.py to improve readability and no need to use _ then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to utils.py in latest commit

@jojortz
Copy link
Collaborator Author

jojortz commented Oct 22, 2024

WAIT TO MERGE UNTIL THIS PR IS MERGED: CambioML/any-parser-realtime-cdk#76
Feature

* Add instruction extraction to any-parser sdk.

* Refactor some repeated functionality into helper functions in any_parser.py

* Add `pdf_to_json.ipynb` and `async_pdf_to_json.ipynb`

Testing

* Testing using notebooks, tested more extensively on backend

what does tested more extensively on backend means? Let's be explicit regarding what is tested to help us gain confidence.

Updated comment with details on latest testing, which added unit tests to test.py for this key-value extract

Copy link
Owner

@CambioML CambioML left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -R

@CambioML CambioML merged commit 556b5bc into main Oct 22, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants