Added content type check for getdatalink #607

d-giles · 2024-10-04T16:15:52Z

This PR seeks to address issue #328 and the associated discussion by checking if the access_format property is present and a datalink or a votable. Failing that, this update checks the headers of URLs in the record with the meta.ref.url UCD, if the content type is found to be a datalink or a votable, the link is provided as the datalink_url for the record.
Where no datalink is found, a DALServiceError is raised informing the user that no datalink was found for the record.

Fixes #328

codecov · 2024-10-04T16:21:16Z

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.

Project coverage is 81.88%. Comparing base (029980e) to head (cda2eae).
Report is 132 commits behind head on main.

Files with missing lines	Patch %	Lines
pyvo/dal/adhoc.py	73.68%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #607      +/-   ##
==========================================
+ Coverage   81.70%   81.88%   +0.18%     
==========================================
  Files          69       70       +1     
  Lines        7093     7178      +85     
==========================================
+ Hits         5795     5878      +83     
- Misses       1298     1300       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

msdemlei

Hm... first,

       if ('datalink' in access_format) or ('votable' in access_format):

has far too many false positives – and possibly also false negatives, because RFC 2045 defines media types as case-insensitive.

What I think ought to be done: Write a function is_datalink (or something like that) that receives a media type string and returns true if the media type compares equal to application/x-votable+xml;content=datalink according to RFC 2045 rules, that is, after parsing, case folding, and all. I'd say this should sit in pyvo.dal.adhoc (stupid name, another thing we ought to change). This function should then be used wherever there are comparisons against adhoc.DATALINK_MIME_TYPE in pyVO.

And then quite a bit of this duplicates functionality that's already in _guess_access_format and _guess_access_url. It's bad enough it that's in one place (as these kinds of heuristics always suck), but let's not have them in two places. Can this be refactored?

d-giles · 2024-10-14T21:19:34Z

@msdemlei

has far too many false positives – and possibly also false negatives, because RFC 2045 defines media types as case-insensitive.

The access_format is converted to lowercase prior to the check, so that should, I believe, prevent false negatives with respect to case-insensitivity. As regards false positives, I agree that "votable" in access_format is incredibly broad, but I made the check as permissive as possible to replicate the original behavior of getdatlink while culling results which couldn't be parsed down the line. If it is the case that we can say for certain that every datalink will have content-type equals "application/x-votable+xml;content=datalink" then I'm happy to change the check to be much more restrictive and straightforward. Are we confident that "access_format" will be given as that exact value when the content is a datalink?

And then quite a bit of this duplicates functionality that's already in _guess_access_format and _guess_access_url. It's bad enough it that's in one place (as these kinds of heuristics always suck), but let's not have them in two places. Can this be refactored?

I can switch from self.getdataformat() and self.getdataurl() to DatalinkResults._guess_access_format(...) and DatalinkResults._guess_access_url(...), they look for a few more keywords. Though, if the access_url isn't a datalink and there's a separate column for datalinks then we need to loop through other columns anyways. Both getdata[...] and _guess_access_[...] return the first best guess and not a full list of matching conditions. I.e. in the example you gave on the original issue select top 1 dlurl from rosat.images, the datalink url is in the column dlurl and this is not returned by either of the aforementioned methods if the full record is queried. Should _guess_access_url be modified to optionally return a list of all potential urls or should the loop be confined to getdatalinkurl in your opinion?

msdemlei · 2024-10-15T08:36:23Z

On Mon, Oct 14, 2024 at 02:19:55PM -0700, d-giles wrote: the line. If it is the case that we can say for certain that every datalink will have content-type equals "application/x-votable+xml;content=datalink" then I'm happy to change the check to be much more restrictive and straightforward.

The standard is clear on that: <https://ivoa.net/documents/DataLink/20231215/REC-DataLink-1.1.html#tth_sEc3.1> -- and given there is not much prior expectation here, we should not indulge in extra lenience. However, we should be correct in properly parsing the media type. Regrettably, the media type parsing code (such as it was) in the cgi module is on the way out of the standard library, and the currently recommended replacement in the email package is fairly clunky. I have just written this function for DaCHS: def parseMediaType(mediaType: str) -> Tuple[str, dict]: """returns a pair of basic, lowercased media type and a parameter dictionary for an RFC 2045 media type. Except for lowercasing the type, normalisation follows whatever the mail package does, which is probably too little. Regrettably, EmailMessage's parsing code silently falls back to text/plain when it cannot make out anything. We can't use that, so we have extra logic that doubts text/plain and raises a ValueError if it suspects that's a hallucination. >> parseMediaType("application/x-VOTable+xml;serialisation=tabledata; content=datalink") ('application/x-votable+xml', {'serialisation': 'tabledata', 'content': 'datalink'}) >> parseMediaType("not a media type at all.") Traceback (most recent call last): ValueError: Cannot parse media type 'not a media type at all.' """ # yeah, it's weird to construct an email message here, but it's what # the purgers of the cgi module wanted m = emailmessage.EmailMessage() m.add_header("content-type", mediaType) baseType = "{}/{}".format(m.get_content_maintype(), m.get_content_subtype()) if baseType=="text/plain": if not mediaType.lower().startswith(baseType): raise ValueError(f"Cannot parse media type '{mediaType}'") return baseType, dict(m.get("content-type").params) This probably is fairly slow, so it may not be ideal in a "tight" loop, but until we actually establish that that's a problem, I'd say we should use that (adapting the identifier style) and then have a function like: def is_datalink_type(content_type): base_type, params = parse_media_type(content_type) return (base_type=='application/x-votable+xml' and params["content"]=="datalink") or so. That way, we survive if someone does application/x-votable+xml;content=datalink;serialization=tabledata;charset=iso-8859-1

-- which I'd argue would still be within what the datalink standard mandates (though it could be more explicit on that).

keywords. Though, if the access_url isn't a datalink and there's a separate column for datalinks then we need to loop through other columns anyways. Both `getdata[...]` and `_guess_access_[...]`

I'm not too sure about what behaviour we want here, though I tend to agree that indeed we should pick up non-access_url datalinks. I am fairly sure, though, that that behaviour should be as consistent as possible between the various places in which we are guessing.

column `dlurl` and this is not returned by either of the aforementioned methods. Should `_guess_access_url` be modified to optionally return a list of all potential urls or should the loop be confined to `getdatalinkurl` in your opinion?

I'm always for consistency, so let's modify _guess_access_url unless that breaks something big time (and if it does, I'd probably still like to have a long, hard look at why different behaviours appear to be necessary).

zoghbi-a · 2024-10-21T20:50:42Z

I think this is issue may (or should) be split into two separate issues: One is handle this getdatalink error, and another to get the mime type.
For the first one, I think a similar check is already done in _iter_datalinks_from_product_rows here, so it may make sense to do a similar check in getdatalink.

msdemlei · 2024-10-22T08:25:34Z

On Mon, Oct 21, 2024 at 01:51:04PM -0700, Abdu Zoghbi wrote: I think this is issue may (or should) be split into two separate issues: One is handle this `getdatalink` error, and another to get the mime type. For the first one, I think a similar check is already done in `_iter_datalinks_from_product_rows` [here](https://github.com/astropy/pyvo/blob/8f5e780102295ff4bff137af32ba61da3bc88579/pyvo/dal/adhoc.py#L266), so it may make sense to do a similar check in `getdatalink`.

I'm fine with anything as long as we don't create more code paths and more ways the datalink media type is checked. I'm happy to contribute robust media type checking later.

d-giles added 2 commits October 3, 2024 16:39

Added checks for content type for getdatalink

1c2019b

added sia and register_mocks fixtures to test

ff01bba

Added getdatalink() update to CHANGES

cda2eae

bsipocz added bug component: dal labels Oct 4, 2024

bsipocz added this to the v1.5.3 milestone Oct 4, 2024

msdemlei requested changes Oct 9, 2024

View reviewed changes

bsipocz modified the milestones: v1.5.3, v1.6 Oct 15, 2024

d-giles closed this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added content type check for getdatalink #607

Added content type check for getdatalink #607

d-giles commented Oct 4, 2024

codecov bot commented Oct 4, 2024 •

edited

Loading

msdemlei left a comment

d-giles commented Oct 14, 2024 •

edited

Loading

msdemlei commented Oct 15, 2024 via email

zoghbi-a commented Oct 21, 2024

msdemlei commented Oct 22, 2024 via email

Added content type check for getdatalink #607

Added content type check for getdatalink #607

Conversation

d-giles commented Oct 4, 2024

codecov bot commented Oct 4, 2024 • edited Loading

Codecov Report

msdemlei left a comment

Choose a reason for hiding this comment

d-giles commented Oct 14, 2024 • edited Loading

msdemlei commented Oct 15, 2024 via email

zoghbi-a commented Oct 21, 2024

msdemlei commented Oct 22, 2024 via email

codecov bot commented Oct 4, 2024 •

edited

Loading

d-giles commented Oct 14, 2024 •

edited

Loading