-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add instruction extraction with sync and async example notebooks #51
Conversation
any_parser/any_parser.py
Outdated
@@ -172,6 +227,8 @@ def async_extract( | |||
process_type = ProcessType.FILE_REFINED_QUICK | |||
elif model == ModelType.ULTRA: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove ultra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in latest commit
Can you add extract_json and the notebooks into our docs: https://docs.cambioml.com/api-reference/example? show users what the differences between |
what does |
any_parser/any_parser.py
Outdated
file_path: str, | ||
extract_instruction: Dict, | ||
) -> Tuple[str, str]: | ||
"""Extract json in real-time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a GenAI low quality dosstring, you should not just trust GenAI auto completely output, but given model models about extract what json based on what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with latest commit
any_parser/any_parser.py
Outdated
file_path: str, | ||
extract_instruction: Dict, | ||
) -> str: | ||
"""Extract data asynchronously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. This what does extract data means in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with latest commit
any_parser/any_parser.py
Outdated
@@ -137,6 +136,62 @@ def extract( | |||
else: | |||
return f"Error: {response.status_code} {response.text}", None | |||
|
|||
def extract_json( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to me a very bad name. Especially, extract required no prompt while extract_json requires prompt. Then, suddently, you start to return json. Logically, I do not know why an extract_instruction starts to get this extract to extract_json. You should re-consider on naming this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to extract_key_value
any_parser/any_parser.py
Outdated
# If response successful, upload the file | ||
return self._upload_file_to_presigned_url(file_path, response) | ||
|
||
def async_extract_json( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to extract_key_value
any_parser/any_parser.py
Outdated
def _upload_file_to_presigned_url( | ||
self, file_path: str, response: requests.Response | ||
) -> str: | ||
if response.status_code == 200: | ||
try: | ||
file_id = response.json().get("fileId") | ||
presigned_url = response.json().get("presignedUrl") | ||
with open(file_path, "rb") as file_to_upload: | ||
files = {"file": (file_path, file_to_upload)} | ||
upload_resp = requests.post( | ||
presigned_url["url"], | ||
data=presigned_url["fields"], | ||
files=files, | ||
timeout=TIMEOUT, | ||
) | ||
if upload_resp.status_code != 204: | ||
return f"Error: {upload_resp.status_code} {upload_resp.text}" | ||
return file_id | ||
except json.JSONDecodeError: | ||
return "Error: Invalid JSON response" | ||
else: | ||
return f"Error: {response.status_code} {response.text}" | ||
|
||
def _check_model(self, model: ModelType) -> None: | ||
if model not in {ModelType.BASE, ModelType.PRO, ModelType.ULTRA}: | ||
if model not in {ModelType.BASE, ModelType.PRO}: | ||
valid_models = ", ".join(["`" + model.value + "`" for model in ModelType]) | ||
raise ValueError( | ||
f"Invalid model type: {model}. Supported `model` types include {valid_models}." | ||
) | ||
return f"Invalid model type: {model}. Supported `model` types include {valid_models}." | ||
|
||
def _check_file_type_and_path(self, file_path, file_extension): | ||
# Check if the file exists | ||
if not Path(file_path).is_file(): | ||
return f"Error: File does not exist: {file_path}" | ||
|
||
if file_extension not in SUPPORTED_FILE_EXTENSIONS: | ||
supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS) | ||
return f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}." | ||
|
||
def _check_result_type(self, result_type: str) -> None: | ||
if result_type not in RESULT_TYPES: | ||
valid_result_types = ", ".join(RESULT_TYPES) | ||
return f"Invalid result type: {result_type}. Supported `result_type` types include {valid_result_types}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor to a utils.py to improve readability and no need to use _ then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to utils.py in latest commit
Updated comment with details on latest testing, which added unit tests to test.py for this key-value extract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -R
Feature
pdf_to_key_value.ipynb
andasync_pdf_to_key_value.ipynb
pre-commit
packages topyproject.toml
and instructions in README.mdTesting
test_sync_extract_key_value
andtest_async_extract_key_value_and_fetch
pdf_to_key_value.ipynb
andasync_pdf_to_key_value.ipynb
ran without errors