-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add csv feature to extract_tables #79
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
@jdev01-del Please fix the build issue due to black format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
any_parser/any_parser.py
Outdated
@staticmethod | ||
def flatten_to_string(lst): | ||
result = [] | ||
for item in lst: | ||
if isinstance(item, list): | ||
result.append(AnyParser.flatten_to_string(item)) | ||
else: | ||
result.append(str(item)) | ||
return "".join(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flatten_to_string method has multiple critical flaws in handling nested lists:
It incorrectly flattens nested lists by converting them to string representations or appending list objects directly, which prevents true flattening.
The method fails to properly extend the result list with flattened items, causing type errors when attempting to join the result.
The implementation assumes all iterables are lists, which limits its flexibility with other iterable types like tuples or sets.
The method needs to be redesigned to recursively flatten all nested lists into a
single string, ensuring that each nested item is converted to a string and fully
expanded, while supporting various iterable types.
Also, why this is a staticmethod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. I used static method because I think this function will only need to support parsing the parameters, so it's easier if I make it a static method.
any_parser/any_parser.py
Outdated
df_list = pd.read_html(extracted_html) | ||
csv_list = [] | ||
for df in df_list: | ||
csv_list.append(df.to_csv(index=False)) | ||
csv_output = "\n\n".join(csv_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extract_tables method has a CSV conversion issue when handling multiple tables. When converting HTML tables to CSV, the method incorrectly joins multiple tables using "\n\n".join(csv_list), which breaks the CSV format by inserting unnecessary newlines. Additionally, the method does not properly handle cases where extracted_html is a list, potentially causing type conversion errors when using pd.read_html().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you add both html and csv example in the notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure all your github actions are passing before you request for review to save reviwer time.
@jdev01-del Make sure you reply to all my comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
examples/extract_tables.ipynb
Outdated
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ap = AnyParser(api_key=\"...\")" | ||
"ap = AnyParser(api_key=\"key\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's not change this.
User description
Description
extract_tables
used to only supporthtml
return format. This commit makes it also supportcsv
return format.To change return format, find this line in
extract_tables.ipynb
:file_path="./sample_data/test_1figure_1table.png", return_type="csv"
change
return_type
to either csv or html based on needs.Input Table:
CSV output:
0,1,2
,latency,(ms)
participants,mean,99th percentile
1,17.0 +1.4,75.0 34.9
2,24.5 +2.5,87.6 35.9
5,31.5 +6.2,104.5 52.2
10,30.0 +3.7,95.6 25.4
25,35.5 +5.6,100.4 42.7
50,42.7 +4.1,93.7 22.9
100,71.4 +7.6,131.2 +17.6
200,150.5 +11.0,320.3 35.1
Related Issue
Type of Change
How Has This Been Tested?
Locally running extract_tables.ipynb
Screenshots (if applicable)
Checklist
Additional Notes
PR Type
Enhancement
Description
Added support for CSV output in
extract_tables
method.Introduced a utility function
flatten_to_string
for nested list handling.Updated example notebook to demonstrate CSV output functionality.
Improved code formatting and added error handling for missing dependencies.
Changes walkthrough 📝
any_parser.py
Add CSV output functionality and utility methods
any_parser/any_parser.py
return_type
parameter toextract_tables
method for CSV or HTMLoutput.
flatten_to_string
utility for handling nested lists.dependency.
extract_tables.ipynb
Update example notebook for CSV output demonstration
examples/extract_tables.ipynb
usage.
return_type="csv"
inextract_tables
.