Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add csv feature to extract_tables #79

Merged
merged 3 commits into from
Jan 14, 2025
Merged

feat: add csv feature to extract_tables #79

merged 3 commits into from
Jan 14, 2025

Conversation

jdev01-del
Copy link
Collaborator

@jdev01-del jdev01-del commented Jan 9, 2025

User description

Description

extract_tables used to only support html return format. This commit makes it also support csv return format.

To change return format, find this line in extract_tables.ipynb:
file_path="./sample_data/test_1figure_1table.png", return_type="csv"

change return_type to either csv or html based on needs.

Input Table:
image

CSV output:
0,1,2
,latency,(ms)
participants,mean,99th percentile
1,17.0 +1.4,75.0 34.9
2,24.5 +2.5,87.6 35.9
5,31.5 +6.2,104.5 52.2
10,30.0 +3.7,95.6 25.4
25,35.5 +5.6,100.4 42.7
50,42.7 +4.1,93.7 22.9
100,71.4 +7.6,131.2 +17.6
200,150.5 +11.0,320.3 35.1

Related Issue

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

How Has This Been Tested?

Locally running extract_tables.ipynb

Screenshots (if applicable)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes


PR Type

Enhancement


Description

  • Added support for CSV output in extract_tables method.

  • Introduced a utility function flatten_to_string for nested list handling.

  • Updated example notebook to demonstrate CSV output functionality.

  • Improved code formatting and added error handling for missing dependencies.


Changes walkthrough 📝

Relevant files
Enhancement
any_parser.py
Add CSV output functionality and utility methods                 

any_parser/any_parser.py

  • Added return_type parameter to extract_tables method for CSV or HTML
    output.
  • Implemented flatten_to_string utility for handling nested lists.
  • Added logic to convert HTML tables to CSV using pandas.
  • Improved formatting and added error handling for missing pandas
    dependency.
  • +44/-5   
    extract_tables.ipynb
    Update example notebook for CSV output demonstration         

    examples/extract_tables.ipynb

  • Updated notebook to demonstrate CSV output functionality.
  • Modified imports and added runtime warnings for deprecated pandas
    usage.
  • Adjusted example code to use return_type="csv" in extract_tables.
  • Enhanced output display logic for better readability.
  • +54/-35 

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    github-actions bot commented Jan 9, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Issue

    The flatten_to_string method added in the PR may not handle edge cases effectively, such as deeply nested lists or non-stringifiable objects. This could lead to unexpected behavior or errors.

    @staticmethod
    def flatten_to_string(lst):
        result = []
        for item in lst:
            if isinstance(item, list):
                result.append(AnyParser.flatten_to_string(item))
            else:
                result.append(str(item))
        return "".join(result)
    CSV Conversion Warning

    The use of pd.read_html for converting HTML to CSV generates a FutureWarning. This indicates potential deprecation in future versions of pandas, which could break functionality.

    try:
        import pandas as pd
    except ImportError:
        raise ImportError(
            "Please install pandas to use CSV return_type"
        )
    
    df_list = pd.read_html(extracted_html)
    csv_list = []
    for df in df_list:
        csv_list.append(df.to_csv(index=False))
    csv_output = "\n\n".join(csv_list)
    
    return csv_output, time_elapsed
    Example Code Clarity

    The example notebook includes commented-out code and lacks clear documentation for the new return_type parameter. This could confuse users trying to understand or utilize the new feature.

     "cell_type": "code",
     "execution_count": 62,
     "metadata": {},
     "outputs": [],
     "source": [
      "from IPython.display import display, Markdown\n",
      "\n",
      "# from any_parser import AnyParser\n",
      "import sys\n",
      "import importlib\n",
      "\n",
      "\n",
      "sys.path.append(\"..\")\n",
      "import any_parser.any_parser\n",
      "\n",
      "importlib.reload(any_parser.any_parser)\n",
      "from any_parser.any_parser import AnyParser"
     ]
    },
    {
     "cell_type": "code",
     "execution_count": 63,
     "metadata": {},
     "outputs": [],
     "source": [
      "ap = AnyParser(api_key=\"key\")"
     ]
    },
    {
     "cell_type": "code",
     "execution_count": 64,
     "metadata": {},
     "outputs": [
      {
       "name": "stderr",
       "output_type": "stream",
       "text": [
        "/home/ubuntu/any-parser/examples/../any_parser/any_parser.py:232: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.\n",
        "  \n"
       ]
      }
     ],
     "source": [
      "html_output, time_info = ap.extract_tables(\n",
      "    file_path=\"./sample_data/test_1figure_1table.png\", return_type=\"csv\"\n",
      ")"
     ]
    },
    {
     "cell_type": "code",
     "execution_count": 65,
     "metadata": {},
     "outputs": [
      {
       "name": "stdout",
       "output_type": "stream",
       "text": [
        "CPU times: user 3 μs, sys: 0 ns, total: 3 μs\n",
        "Wall time: 5.01 μs\n"
       ]
      }
     ],
     "source": [
      "time"
     ]
    },
    {
     "cell_type": "code",
     "execution_count": 66,
     "metadata": {},
     "outputs": [
      {
       "data": {
        "text/markdown": [
         "0,1,2\n",
         ",latency,(ms)\n",
         "participants,mean,99th percentile\n",
         "1,17.0 +1.4,75.0 34.9\n",
         "2,24.5 +2.5,87.6 35.9\n",
         "5,31.5 +6.2,104.5 52.2\n",
         "10,30.0 +3.7,95.6 25.4\n",
         "25,35.5 +5.6,100.4 42.7\n",
         "50,42.7 +4.1,93.7 22.9\n",
         "100,71.4 +7.6,131.2 +17.6\n",
         "200,150.5 +11.0,320.3 35.1\n"
        ],
        "text/plain": [
         "<IPython.core.display.Markdown object>"
        ]
       },
       "metadata": {},
       "output_type": "display_data"
      }
     ],
     "source": [
      "if isinstance(html_output, list):\n",
      "    html_output_str = \"\\n\".join(html_output)\n",
      "else:\n",
      "    html_output_str = html_output\n",
      "\n",
      "display(Markdown(html_output_str))"
     ]
    

    Copy link

    github-actions bot commented Jan 9, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    General
    Replace direct HTML string usage with StringIO to ensure compatibility with future versions of pandas

    Use StringIO when passing literal HTML strings to pd.read_html to avoid deprecation
    warnings and future compatibility issues.

    any_parser/any_parser.py [232]

    -df_list = pd.read_html(extracted_html)
    +from io import StringIO
    +df_list = pd.read_html(StringIO(extracted_html))
    Suggestion importance[1-10]: 10

    Why: This suggestion resolves a deprecation warning and ensures future compatibility with pandas by using StringIO for literal HTML strings. It is a necessary change to maintain functionality in upcoming versions of pandas.

    10
    Add validation for the return_type parameter to prevent unexpected behavior from invalid inputs

    Validate the return_type parameter in extract_tables to ensure it only accepts
    "html" or "csv" and raise a clear error for invalid values.

    any_parser/any_parser.py [224]

    +if return_type.lower() not in ["html", "csv"]:
    +    raise ValueError("Invalid return_type. Expected 'html' or 'csv'.")
     if return_type.lower() == "csv":
    Suggestion importance[1-10]: 8

    Why: Adding validation for the return_type parameter improves code robustness by ensuring only valid inputs are processed. This prevents unexpected behavior and enhances error handling.

    8
    Add type checking for html_output in the notebook example to handle unexpected data types

    Ensure the notebook example handles cases where html_output is not a valid list or
    string to prevent runtime errors.

    examples/extract_tables.ipynb [110-113]

     if isinstance(html_output, list):
         html_output_str = "\n".join(html_output)
    +elif isinstance(html_output, str):
    +    html_output_str = html_output
     else:
    -    html_output_str = html_output
    +    raise TypeError("html_output must be a list or a string")
    Suggestion importance[1-10]: 7

    Why: The suggestion improves the notebook example by adding type checking for html_output, which prevents runtime errors when the data type is unexpected. This enhances the reliability of the example code.

    7
    Possible issue
    Add handling for circular references in nested lists to prevent infinite recursion

    Ensure that the flatten_to_string method handles circular references in nested lists
    to prevent infinite recursion.

    any_parser/any_parser.py [190-197]

    -def flatten_to_string(lst):
    +def flatten_to_string(lst, seen=None):
    +    if seen is None:
    +        seen = set()
    +    if id(lst) in seen:
    +        raise ValueError("Circular reference detected in list")
    +    seen.add(id(lst))
         result = []
         for item in lst:
             if isinstance(item, list):
    -            result.append(AnyParser.flatten_to_string(item))
    +            result.append(AnyParser.flatten_to_string(item, seen))
             else:
                 result.append(str(item))
         return "".join(result)
    Suggestion importance[1-10]: 9

    Why: The suggestion effectively addresses a potential issue of infinite recursion in the flatten_to_string method by adding handling for circular references. This is a critical improvement for robustness and prevents runtime errors in edge cases.

    9

    @lingjiekong
    Copy link
    Member

    @jdev01-del Please fix the build issue due to black format.

    @lingjiekong lingjiekong requested a review from Copilot January 9, 2025 18:38

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

    Comment on lines 189 to 197
    @staticmethod
    def flatten_to_string(lst):
    result = []
    for item in lst:
    if isinstance(item, list):
    result.append(AnyParser.flatten_to_string(item))
    else:
    result.append(str(item))
    return "".join(result)
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The flatten_to_string method has multiple critical flaws in handling nested lists:
    It incorrectly flattens nested lists by converting them to string representations or appending list objects directly, which prevents true flattening.
    The method fails to properly extend the result list with flattened items, causing type errors when attempting to join the result.
    The implementation assumes all iterables are lists, which limits its flexibility with other iterable types like tuples or sets.
    The method needs to be redesigned to recursively flatten all nested lists into a
    single string, ensuring that each nested item is converted to a string and fully
    expanded, while supporting various iterable types.

    Also, why this is a staticmethod?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Addressed. I used static method because I think this function will only need to support parsing the parameters, so it's easier if I make it a static method.

    Comment on lines 232 to 236
    df_list = pd.read_html(extracted_html)
    csv_list = []
    for df in df_list:
    csv_list.append(df.to_csv(index=False))
    csv_output = "\n\n".join(csv_list)
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The extract_tables method has a CSV conversion issue when handling multiple tables. When converting HTML tables to CSV, the method incorrectly joins multiple tables using "\n\n".join(csv_list), which breaks the CSV format by inserting unnecessary newlines. Additionally, the method does not properly handle cases where extracted_html is a list, potentially causing type conversion errors when using pd.read_html().

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Addressed.

    Copy link
    Member

    @lingjiekong lingjiekong left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Make sure you add both html and csv example in the notebook

    Copy link
    Member

    @lingjiekong lingjiekong left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Make sure all your github actions are passing before you request for review to save reviwer time.

    @lingjiekong
    Copy link
    Member

    @jdev01-del Make sure you reply to all my comments.

    Copy link
    Member

    @lingjiekong lingjiekong left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    "metadata": {},
    "outputs": [],
    "source": [
    "ap = AnyParser(api_key=\"...\")"
    "ap = AnyParser(api_key=\"key\")"
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nit: let's not change this.

    @lingjiekong lingjiekong merged commit fc996f0 into main Jan 14, 2025
    1 check passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants