Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added MEA and accuracy to the GitHub workflow #34

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/pmml.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,37 @@ jobs:
- name: Run All Steps (1-3)
run: |
python run_all_steps.py

- name: Run tests
run: |
python utils/metrics_test.py

shell: bash
run: |
accuracy=$(grep -Po 'accuracy_score: \K[0-9.]+' Metrics_output.txt)
mae=$(grep -Po 'MAE: \K[0-9.]+' Metrics_output.txt)
echo "Accuracy: $accuracy"
echo "MAE: $mae"
echo "::set-output name=accuracy::$accuracy"
echo "::set-output name=mae::$mae"
Comment on lines +49 to +56
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update deprecated set-output command and add error handling.

The step correctly extracts the metrics, but there are a few improvements we can make:

  1. The ::set-output syntax is deprecated. Use the $GITHUB_OUTPUT environment file instead.
  2. Add error handling in case the grep commands don't find matches.
  3. Consider using awk instead of grep for more robust parsing.

Here's a suggested improvement:

 shell: bash
 run: |
-  accuracy=$(grep -Po 'accuracy_score: \K[0-9.]+' Metrics_output.txt)
-  mae=$(grep -Po 'MAE: \K[0-9.]+' Metrics_output.txt)
+  accuracy=$(awk '/accuracy_score:/ {print $2}' Metrics_output.txt)
+  mae=$(awk '/MAE:/ {print $2}' Metrics_output.txt)
+  if [ -z "$accuracy" ] || [ -z "$mae" ]; then
+    echo "::error::Failed to extract metrics from Metrics_output.txt"
+    exit 1
+  fi
   echo "Accuracy: $accuracy"
   echo "MAE: $mae"
-  echo "::set-output name=accuracy::$accuracy"
-  echo "::set-output name=mae::$mae"
+  echo "accuracy=$accuracy" >> $GITHUB_OUTPUT
+  echo "mae=$mae" >> $GITHUB_OUTPUT

This change:

  1. Uses awk for more robust parsing.
  2. Adds error checking if metrics are not found.
  3. Updates to the new $GITHUB_OUTPUT syntax for setting outputs.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
shell: bash
run: |
accuracy=$(grep -Po 'accuracy_score: \K[0-9.]+' Metrics_output.txt)
mae=$(grep -Po 'MAE: \K[0-9.]+' Metrics_output.txt)
echo "Accuracy: $accuracy"
echo "MAE: $mae"
echo "::set-output name=accuracy::$accuracy"
echo "::set-output name=mae::$mae"
shell: bash
run: |
accuracy=$(awk '/accuracy_score:/ {print $2}' Metrics_output.txt)
mae=$(awk '/MAE:/ {print $2}' Metrics_output.txt)
if [ -z "$accuracy" ] || [ -z "$mae" ]; then
echo "::error::Failed to extract metrics from Metrics_output.txt"
exit 1
fi
echo "Accuracy: $accuracy"
echo "MAE: $mae"
echo "accuracy=$accuracy" >> $GITHUB_OUTPUT
echo "mae=$mae" >> $GITHUB_OUTPUT
🧰 Tools
🪛 yamllint

[error] 49-49: syntax error: expected , but found ''

(syntax)


- name: Post results to PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v6
with:
script: |
const accuracy = "${{ steps.compute_metrics.outputs.accuracy }}";
const mae = "${{ steps.compute_metrics.outputs.mae }}";
const body = `
### Metrics from latest run:
- **Accuracy**: ${accuracy}
- **MAE**: ${mae}
`;

github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});

2 changes: 2 additions & 0 deletions pmml/Metrics_output.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
accuracy_score: 0.75
MAE: 0.3333333333333333
2 changes: 2 additions & 0 deletions pmml/utils/Metrics_output.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
accuracy_score: 0.75
MAE: 0.3333333333333333
22 changes: 22 additions & 0 deletions pmml/utils/metrics_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from sklearn.metrics import accuracy_score, mean_absolute_error
import pandas
from os.path import basename

# Load the preprocessed test data CSV into a DataFrame
storybooks_csv_path = '../step1_prepare/step1_3_storybooks_test.csv'
storybooks_dataframe = pandas.read_csv(storybooks_csv_path)
val_y = storybooks_dataframe['reading_level']

# Load Predicted values from step3_2_predictions.csv
val_predictions = pandas.read_csv('../step3_predict/step3_2_predictions.csv')
Comment on lines +5 to +11
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance robustness of data loading process.

While the data loading process is straightforward, consider the following improvements:

  1. Use absolute paths or environment variables instead of relative paths to ensure the script works correctly regardless of where it's run from.
  2. Add error handling for file not found scenarios.
  3. Validate the structure of the loaded CSV files to ensure they contain the expected columns.

Here's a suggested improvement:

import os
from pathlib import Path

# Use environment variables or construct paths relative to the script location
BASE_DIR = Path(__file__).resolve().parent.parent
storybooks_csv_path = os.path.join(BASE_DIR, 'step1_prepare', 'step1_3_storybooks_test.csv')
predictions_csv_path = os.path.join(BASE_DIR, 'step3_predict', 'step3_2_predictions.csv')

try:
    storybooks_dataframe = pandas.read_csv(storybooks_csv_path)
    val_y = storybooks_dataframe['reading_level']
    val_predictions = pandas.read_csv(predictions_csv_path)

    # Validate dataframe structure
    assert 'reading_level' in storybooks_dataframe.columns, "Missing 'reading_level' column in test data"
    assert val_predictions.shape[1] == 1, "Predictions should be a single column"

except FileNotFoundError as e:
    print(f"Error: {e}. Please ensure the CSV files exist in the correct location.")
    exit(1)
except AssertionError as e:
    print(f"Error: {e}. The structure of the CSV files is not as expected.")
    exit(1)

This modification improves the script's robustness and provides clearer error messages if issues arise.


accuracy = accuracy_score(val_y, val_predictions)
print(basename(__file__), f'accuracy_score: {accuracy}')

mae = mean_absolute_error(val_y, val_predictions)
print(basename(__file__), f'accuracy_score: {mae}')
Comment on lines +16 to +17
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in MAE print statement.

The MAE calculation is correct, but there's a typo in the print statement.

Please apply the following fix:

- print(basename(__file__), f'accuracy_score: {mae}')
+ print(basename(__file__), f'MAE: {mae}')

This change ensures that the output correctly identifies the metric as MAE instead of accuracy_score.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mae = mean_absolute_error(val_y, val_predictions)
print(basename(__file__), f'accuracy_score: {mae}')
mae = mean_absolute_error(val_y, val_predictions)
print(basename(__file__), f'MAE: {mae}')


# Save the results to a file for the GitHub workflow to read
with open('Metrics_output.txt', 'w') as f:
f.write(f'accuracy_score: {accuracy}\n')
f.write(f'MAE: {mae}\n')
Comment on lines +19 to +22
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling and file path management for results saving.

While the use of a context manager for file operations is good practice, consider the following improvements:

  1. Add error handling for potential IOErrors during file writing.
  2. Use an absolute path or environment variable for the output file location to ensure consistency regardless of where the script is run from.
  3. Consider appending to the file instead of overwriting, or include a timestamp in the filename to preserve historical results.

Here's a suggested improvement:

import os
from datetime import datetime

# Use an environment variable or a config file to set the output directory
output_dir = os.environ.get('METRICS_OUTPUT_DIR', '.')
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = os.path.join(output_dir, f'Metrics_output_{timestamp}.txt')

try:
    with open(output_file, 'w') as f:
        f.write(f'accuracy_score: {accuracy}\n')
        f.write(f'MAE: {mae}\n')
    print(f"Metrics successfully written to {output_file}")
except IOError as e:
    print(f"Error writing to file: {e}")

This modification improves error handling, uses a more robust file path, and includes a timestamp in the filename to preserve historical results.