-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add input folder support for batch api #74
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
examples/parse_batch_fetch.py:14
- The variable name MAX_WORKER should be renamed to MAX_WORKERS for consistency.
MAX_WORKER = 10
README.md:87
- The example code is missing the step to read the request ID from the response. It should be updated for clarity.
markdown = ap.batches.retrieve(request_id)
examples/parse_batch_fetch.py
Outdated
response["requestStatus"] = "COMPLETED" | ||
response["completionTime"] = markdown.completionTime | ||
except Exception as e: | ||
print(f"Error processing {request_id}: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace print statement with a proper logging mechanism for error handling.
print(f"Error processing {request_id}: {str(e)}") | |
logging.error(f"Error processing {request_id}: {str(e)}") |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boqiny please address this comment.
any_parser/batch_parser.py
Outdated
@@ -8,6 +13,8 @@ | |||
from any_parser.base_parser import BaseParser | |||
|
|||
TIMEOUT = 60 | |||
MAX_FILES = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to restrict on this. For the batch API, the logic is that
- there is a crone job for every 2 hour
- there is a scan every 5 mins and if the batch queue size is more than 1000
One of the two situation match will trigger a batch run.
any_parser/batch_parser.py
Outdated
if len(files) > MAX_FILES: | ||
raise ValueError( | ||
f"Found {len(files)} files. Maximum allowed is {MAX_FILES}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not need for this.
any_parser/batch_parser.py
Outdated
output_path = folder_path.parent / output_filename | ||
|
||
with open(output_path, "w") as f: | ||
for response in responses: | ||
f.write(json.dumps(response) + "\n") | ||
|
||
return str(output_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rule of thumb is that it is better to return a list of UploadResponse instead of a awkward file path that people have to know what this means and read it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified and moved this logic to the upload python script
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
examples/parse_batch_fetch.py
Outdated
response["requestStatus"] = "COMPLETED" | ||
response["completionTime"] = markdown.completionTime | ||
except Exception as e: | ||
print(f"Error processing {request_id}: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger.error instead of print for error handling.
print(f"Error processing {request_id}: {str(e)}") | |
logger.error(f"Error processing {request_id}: {str(e)}") |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boqiny please address this comment.
examples/parse_batch_fetch.py
Outdated
response["requestStatus"] = "COMPLETED" | ||
response["completionTime"] = markdown.completionTime | ||
except Exception as e: | ||
print(f"Error processing {request_id}: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boqiny please address this comment.
README.md
Outdated
# This will generate a jsonl with filename and requestID | ||
response = ap.batches.create(WORKING_FOLDER) | ||
|
||
# Fetch the extracted content using the request ID | ||
markdown = ap.batches.retrieve(request_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: be a bit more specific regarding how to get a single request_id and then check its status.
examples/parse_batch_fetch.py
Outdated
response["requestStatus"] = "COMPLETED" | ||
response["completionTime"] = markdown.completionTime | ||
except Exception as e: | ||
print(f"Error processing {request_id}: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boqiny please address this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comment
Please incorporate this #72, if you have not done so in this PR. |
User description
Description
Previous batch api only accepts single file input, which is just like a super long async function.
Now refer to OpenAI's batch API implementation https://platform.openai.com/docs/guides/batch/getting-started?lang=node
We should add for a folder support by:
Now I set max worker = 10 & max input folder size = 1000
Related Issue
Type of Change
How Has This Been Tested?
Test scripts being added to examples folder
parse_batch_upload.py
andparse_batch_fetch.py
Screenshots (if applicable)
Generated Json after upload input folder
Fetch (Added markdown contents after processing is finished)
Checklist
Additional Notes
PR Type
Enhancement, Documentation
Description
parse_batch_upload.py
andparse_batch_fetch.py
) to demonstrate folder upload and response fetching functionalities.Changes walkthrough 📝
batch_parser.py
Add folder upload support and concurrent processing in batch parser.
any_parser/batch_parser.py
batch processing.
_upload_folder
method to handle concurrent file uploadsusing a thread pool.
create
method to handle both files and folders.parse_batch_fetch.py
Add example script for fetching batch API responses.
examples/parse_batch_fetch.py
parse_batch_upload.py
Add example script for uploading folders to batch API.
examples/parse_batch_upload.py
README.md
Update README with batch API folder input usage.
README.md
extraction.