fix: parse and clean archive badges and markdown links to URL #243

banesullivan · 2024-12-15T06:02:44Z

This will parse markdown links/badges to consistently capture DOI URLs from the archive and JOSS DOI fields.

Note that this adds https://github.com/papis/python-doi as a dependency

The solution I landed on here is to always coerce the DOI to a single URL since this is what the review templates expect here:

https://github.com/pyOpenSci/pyopensci.github.io/blob/d0b561cc493f6e4691f171b4460b0f7d4793267a/_includes/package-grid.html#L42

The parser here now also validates the DOI links using the python-doi dependency. For testing/validation, I needed to use a real/valid DOI for the test data so I used PyVista's DOI for these 😄

I also noticed that some review issues had the JOSS archive key as JOSS DOI and others have it as JOSS. The changes here account for that and ensure data are normalized to JOSS

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1209011748769325

codecov · 2024-12-15T06:04:24Z

Codecov Report

Attention: Patch coverage is 88.63636% with 5 lines in your changes missing coverage. Please review.

Project coverage is 75.67%. Comparing base (b6179f3) to head (4a56e10).
Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
src/pyosmeta/utils_clean.py	87.09%	2 Missing and 2 partials ⚠️
src/pyosmeta/models/base.py	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #243      +/-   ##
==========================================
+ Coverage   74.06%   75.67%   +1.60%     
==========================================
  Files          10       10              
  Lines         671      703      +32     
  Branches       82       90       +8     
==========================================
+ Hits          497      532      +35     
+ Misses        166      161       -5     
- Partials        8       10       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lwasser · 2024-12-19T19:12:15Z

hey @banesullivan i'll leave some more specific feedback here but it looks like

this is erroring on a review (bibat) which means there is an outlier format in archive that isn't being handled as we would like it to be handled.

update-reviews
/Users/leahawasser/Documents/GitHub/pyos/pyosMeta/src/pyosmeta/parse_issues.py:489: UserWarning: ## Community Partnerships not found in the list
  warnings.warn(f"{section_str} not found in the list")
Error in review at url: https://api.github.com/repos/pyOpenSci/software-submission/issues/83
Traceback (most recent call last):

  File "/Users/leahawasser/Documents/GitHub/pyos/pyosMeta/src/pyosmeta/parse_issues.py", line 310, in parse_issues
    review = self.parse_issue(issue)

  File "/Users/leahawasser/Documents/GitHub/pyos/pyosMeta/src/pyosmeta/parse_issues.py", line 284, in parse_issue
    return ReviewModel(**model)

  File "/Users/leahawasser/mambaforge/envs/pyosmeta/lib/python3.10/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)

pydantic_core._pydantic_core.ValidationError: 1 validation error for ReviewModel
joss
  Value error, Invalid archive URL:  [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

--------------------
http://www.sunpy.org 'http://' replacing w 'https://'
http://sourmash.readthedocs.io/en/latest/ 'http://' replacing w 'https://'
http://movingpandas.org 'http://' replacing w 'https://'

When running it locally is that it hangs. I suspect it's slower because adding python-doi means we are parsing and checking urls / DOIs for each package. For now, we might consider adding more output so a user knows it's doing something. Maybe the package name it's processing in the terminal would be good to print out? I almost thought it was broken, and then I saw it was just processing but slower than it was previously.

lwasser · 2024-12-19T19:30:55Z

I'm also noticing some inconsistency in JOSS:

for joss / sleplet, there is a link to the paper.
issue_link: pyOpenSci/software-submission#149
joss: https://joss.theoj.org/papers/10.21105/joss.05221

For nbcompare, there is a link to the DOI, which resolves to the paper!
issue_link: pyOpenSci/software-submission#146
joss: https://doi.org/10.21105/joss.06490

Both will work! My question is, should we be consistent in how we save the doi and always link to the doi rather than the paper in terms of the data we store in our "database," aka YML file?

the archive value is inconsistent, too, but I think it's OK as is. Especially because sometimes it's a GitHub link and others it's Zenodo, and sometimes the Zenodo link is to the "latest" rather than the actual archive VERSION that we approved. So let's not worry about archive and focus on JOSS DOI as it hopefully isn't a huge amount of work. if it is, we are ok as is and we can open an issue about a future iteration that cleans it up just a bit more.

So, the takeaways here are:

the one issue that is erroring - let's fix that (comment above)
let's fill in tests and use pytest.raises match= for those try/except blocks
let's make the JOSS doi consistent!
let's make sure there is some information in the terminal when it's processing reviews so a user running it knows it's not stalled. A simple fix would be to write out the package name being processed. And to write out when it's failing on a doi URL (potentially?). I'll leave that up to you.

Thank you so much. This looks really great!! 🚀 If you'd like, we could merge this as is today, and you could work on the comments above in another pr. that would allow us to update the website tomorrow.

if you'd like to complete all of the work in this pr, say the word, and we can hold off until January to get things merged.

lwasser

oops. it looks like i left blocks of feedbak but not the line by line comments. i'm just making sure those are visible now.

lwasser · 2024-12-18T21:27:45Z

src/pyosmeta/utils_clean.py

@@ -7,6 +7,8 @@
 from datetime import datetime
 from typing import Any

+import doi


Since we've added a new dep, we should make sure that it is noted in the changelog and also document why we added it.

Further clarified this in e1246e0

CHANGELOG.md

lwasser · 2024-12-19T19:17:25Z

src/pyosmeta/utils_clean.py

+    This utility will attempt to parse the DOI link from the various formats
+    that are commonly present in review metadata. This utility will handle:
+
+    * Markdown links in the format `[label](URL)`, e.g., `[my archive](https://doi.org/10.1234/zenodo.12345678)`


it looks like the issue that failed when i ran it locally using a zenodo badge

pyOpenSci/software-submission#83

honestly, I may have updated and added that (it is possible). But it would be good to parse a markdown badge url too.

let me know if you don't see that error but i saw it running locally.

All good, it was the JOSS DOI field causing the failure as it was left black for pyOpenSci/software-submission#83. The zenodo badge in this example is handle and a recurring pattern on a lot of reviews

lwasser · 2024-12-19T19:18:32Z

src/pyosmeta/utils_clean.py

+            archive = archive.replace("http://", "https://")
+        # Validate that the URL resolves
+        if not check_url(archive):
+            raise ValueError(f"Invalid archive URL: {archive}")


I think we want to use pytest.raises ValueError // match= to hit these missing lines in our coverage.

Added more validation testing in a4e9477

lwasser · 2025-01-07T21:13:07Z

hey @banesullivan i'm checking in on this pr. we are running into the issue that every pr on the website fails CI because of the dois! i think we have two options.

we can merge this almost as is and push a new release if it fixes things, then the website will atleast be green.
you can work on the tests and other smaller items in a separate pr.

The above approach is great if you are busy during this first full week back!!
Alternatively, we can leave this open for a bit longer and fix things here.

Please let me know what you prefer!

Co-authored-by: Leah Wasser <[email protected]>

banesullivan · 2025-01-13T06:06:44Z

this is erroring on a review (bibat) which means there is an outlier format in archive that isn't being handled as we would like it to be handled.

Whoops! Not sure how I missed this on the first go around. I track this down and it was because the JOSS DOI field was left blank. I have updated the parser/validator to handle missing data like this in b6b9f27

When running it locally is that it hangs. I suspect it's slower because adding python-doi means we are parsing and checking urls / DOIs for each package. For now, we might consider adding more output so a user knows it's doing something. Maybe the package name it's processing in the terminal would be good to print out? I almost thought it was broken, and then I saw it was just processing but slower than it was previously.

Ooof, I didn't realize how much slower it was until now. Indeed it is the DOI validation step that is causing this slowdown. I can add a print statement, perhaps in this for-loop to indicate each package it is currently processing.

With this dramatic of a slow down, I'm considering not doing the full validation of the DOI URLs to ensure they resolve correctly. Perhaps these could be gathered and at the end of critical build steps like for the documentation we check these DOI URLs to validate that the resolve. For now, I'm adding a simple print statement in b25ebbc to indicate progress, but I think we should come back to this later.

banesullivan · 2025-01-13T06:37:03Z

I'm also noticing some inconsistency in JOSS ... Both will work! My question is, should we be consistent in how we save the doi and always link to the doi rather than the paper in terms of the data we store in our "database," aka YML file?

I'm not sure! This feels weird to me as the DOIs themselves resoilve to the paper and if I use the python-doi package to validate a DOI like 10.21105/joss.05221 i get a link to the paper:

>>> import doi
>>> doi.validate_doi('10.21105/joss.05221')
'https://joss.theoj.org/papers/10.21105/joss.05221'

I'm not sure what's best here. This may be something we engage JOSS directly on and follow up.

banesullivan · 2025-01-13T06:40:23Z

we are running into the issue that every pr on the website fails CI because of the dois! i think we have two options.

agh! Let's push to merge as is at this point and document any lingering issues like archive URL format inconsistencies for future work

lwasser · 2025-01-14T18:48:21Z

thank you @banesullivan this is running much better now and isn't erroring. let's merge. I've opened a few issues #248 #247 that we can work on independently. i'm going to merge this and push out a new patch release to see if we can get ci to be happy on the website side of things !! 🤞🏻

feat: parse and clean archive badges and markdown links to URL

3187744

banesullivan changed the title ~~feat: parse and clean archive badges and markdown links to URL~~ fix: parse and clean archive badges and markdown links to URL Dec 19, 2024

banesullivan added 8 commits December 18, 2024 23:51

Improve docs, reuse URL checker, improve test data

aa53640

coverage

dc3d627

Test with JOSS DOI

09c276b

cleanup

053f12d

Add changelog

7dce691

attribution in changelog

9e051cc

Better testing and normalize the JOSS field

cea651c

Correct changelog

2b9db98

lwasser requested changes Jan 7, 2025

View reviewed changes

lwasser mentioned this pull request Jan 8, 2025

fix: force create joss cross ref badge pyOpenSci/pyopensci.github.io#524

Closed

banesullivan and others added 3 commits January 12, 2025 21:40

Apply suggestions from code review

03376a9

Co-authored-by: Leah Wasser <[email protected]>

Handle missing values

b6b9f27

Add print statement to indicate progress

b25ebbc

banesullivan added 3 commits January 12, 2025 22:07

Update changelog with note on python-doi

e1246e0

Improve testing

a4e9477

Use match= in validation testing

4a56e10

This was referenced Jan 14, 2025

feat: Validate JOSS and Zenodo dois in a separate workflow #247

Open

Consistent formats for JOSS dois #248

Open

lwasser approved these changes Jan 14, 2025

View reviewed changes

lwasser merged commit 383e10f into pyOpenSci:main Jan 14, 2025
4 checks passed

banesullivan mentioned this pull request Jan 17, 2025

pyosmeta missing joss doi and sometimes parsing DOI values differently; we should add some checks for consistency #228

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: parse and clean archive badges and markdown links to URL #243

fix: parse and clean archive badges and markdown links to URL #243

banesullivan commented Dec 15, 2024 •

edited

Loading

codecov bot commented Dec 15, 2024 •

edited

Loading

lwasser commented Dec 19, 2024

lwasser commented Dec 19, 2024

lwasser left a comment

lwasser Dec 18, 2024

banesullivan Jan 13, 2025

lwasser Dec 19, 2024

banesullivan Jan 13, 2025

lwasser Dec 19, 2024

banesullivan Jan 13, 2025

lwasser commented Jan 7, 2025

banesullivan commented Jan 13, 2025

banesullivan commented Jan 13, 2025

banesullivan commented Jan 13, 2025

lwasser commented Jan 14, 2025

fix: parse and clean archive badges and markdown links to URL #243

fix: parse and clean archive badges and markdown links to URL #243

Conversation

banesullivan commented Dec 15, 2024 • edited Loading

codecov bot commented Dec 15, 2024 • edited Loading

Codecov Report

lwasser commented Dec 19, 2024

lwasser commented Dec 19, 2024

lwasser left a comment

Choose a reason for hiding this comment

lwasser Dec 18, 2024

Choose a reason for hiding this comment

banesullivan Jan 13, 2025

Choose a reason for hiding this comment

lwasser Dec 19, 2024

Choose a reason for hiding this comment

banesullivan Jan 13, 2025

Choose a reason for hiding this comment

lwasser Dec 19, 2024

Choose a reason for hiding this comment

banesullivan Jan 13, 2025

Choose a reason for hiding this comment

lwasser commented Jan 7, 2025

banesullivan commented Jan 13, 2025

banesullivan commented Jan 13, 2025

banesullivan commented Jan 13, 2025

lwasser commented Jan 14, 2025

banesullivan commented Dec 15, 2024 •

edited

Loading

codecov bot commented Dec 15, 2024 •

edited

Loading