-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added content type check for getdatalink #607
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #607 +/- ##
==========================================
+ Coverage 81.70% 81.88% +0.18%
==========================================
Files 69 70 +1
Lines 7093 7178 +85
==========================================
+ Hits 5795 5878 +83
- Misses 1298 1300 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... first,
if ('datalink' in access_format) or ('votable' in access_format):
has far too many false positives – and possibly also false negatives, because RFC 2045 defines media types as case-insensitive.
What I think ought to be done: Write a function is_datalink (or something like that) that receives a media type string and returns true if the media type compares equal to application/x-votable+xml;content=datalink according to RFC 2045 rules, that is, after parsing, case folding, and all. I'd say this should sit in pyvo.dal.adhoc (stupid name, another thing we ought to change). This function should then be used wherever there are comparisons against adhoc.DATALINK_MIME_TYPE in pyVO.
And then quite a bit of this duplicates functionality that's already in _guess_access_format and _guess_access_url. It's bad enough it that's in one place (as these kinds of heuristics always suck), but let's not have them in two places. Can this be refactored?
The access_format is converted to lowercase prior to the check, so that should, I believe, prevent false negatives with respect to case-insensitivity. As regards false positives, I agree that
I can switch from |
On Mon, Oct 14, 2024 at 02:19:55PM -0700, d-giles wrote:
the line. If it is the case that we can say for certain that every
datalink will have content-type equals
"application/x-votable+xml;content=datalink" then I'm happy to
change the check to be much more restrictive and straightforward.
The standard is clear on that:
<https://ivoa.net/documents/DataLink/20231215/REC-DataLink-1.1.html#tth_sEc3.1>
-- and given there is not much prior expectation here, we should not
indulge in extra lenience.
However, we should be correct in properly parsing the media type.
Regrettably, the media type parsing code (such as it was) in the cgi
module is on the way out of the standard library, and the currently
recommended replacement in the email package is fairly clunky. I
have just written this function for DaCHS:
def parseMediaType(mediaType: str) -> Tuple[str, dict]:
"""returns a pair of basic, lowercased media type and a parameter
dictionary for an RFC 2045 media type.
Except for lowercasing the type, normalisation follows whatever the mail
package does, which is probably too little.
Regrettably, EmailMessage's parsing code silently falls back to text/plain
when it cannot make out anything. We can't use that, so we have extra
logic that doubts text/plain and raises a ValueError if it suspects
that's a hallucination.
>> parseMediaType("application/x-VOTable+xml;serialisation=tabledata; content=datalink")
('application/x-votable+xml', {'serialisation': 'tabledata', 'content': 'datalink'})
>> parseMediaType("not a media type at all.")
Traceback (most recent call last):
ValueError: Cannot parse media type 'not a media type at all.'
"""
# yeah, it's weird to construct an email message here, but it's what
# the purgers of the cgi module wanted
m = emailmessage.EmailMessage()
m.add_header("content-type", mediaType)
baseType = "{}/{}".format(m.get_content_maintype(), m.get_content_subtype())
if baseType=="text/plain":
if not mediaType.lower().startswith(baseType):
raise ValueError(f"Cannot parse media type '{mediaType}'")
return baseType, dict(m.get("content-type").params)
This probably is fairly slow, so it may not be ideal in a "tight"
loop, but until we actually establish that that's a problem, I'd say
we should use that (adapting the identifier style) and then have
a function like:
def is_datalink_type(content_type):
base_type, params = parse_media_type(content_type)
return (base_type=='application/x-votable+xml'
and params["content"]=="datalink")
or so. That way, we survive if someone does
application/x-votable+xml;content=datalink;serialization=tabledata;charset=iso-8859-1
-- which I'd argue would still be within what the datalink standard
mandates (though it could be more explicit on that).
keywords. Though, if the access_url isn't a datalink and there's a
separate column for datalinks then we need to loop through other
columns anyways. Both `getdata[...]` and `_guess_access_[...]`
I'm not too sure about what behaviour we want here, though I tend to
agree that indeed we should pick up non-access_url datalinks. I am
fairly sure, though, that that behaviour should be as consistent as
possible between the various places in which we are guessing.
column `dlurl` and this is not returned by either of the
aforementioned methods. Should `_guess_access_url` be modified to
optionally return a list of all potential urls or should the loop
be confined to `getdatalinkurl` in your opinion?
I'm always for consistency, so let's modify _guess_access_url unless
that breaks something big time (and if it does, I'd probably still
like to have a long, hard look at why different behaviours appear to
be necessary).
|
I think this is issue may (or should) be split into two separate issues: One is handle this |
On Mon, Oct 21, 2024 at 01:51:04PM -0700, Abdu Zoghbi wrote:
I think this is issue may (or should) be split into two separate issues: One is handle this `getdatalink` error, and another to get the mime type.
For the first one, I think a similar check is already done in `_iter_datalinks_from_product_rows` [here](https://github.com/astropy/pyvo/blob/8f5e780102295ff4bff137af32ba61da3bc88579/pyvo/dal/adhoc.py#L266), so it may make sense to do a similar check in `getdatalink`.
I'm fine with anything as long as we don't create more code paths and
more ways the datalink media type is checked. I'm happy to
contribute robust media type checking later.
|
This PR seeks to address issue #328 and the associated discussion by checking if the
access_format
property is present and a datalink or a votable. Failing that, this update checks the headers of URLs in the record with themeta.ref.url
UCD, if the content type is found to be a datalink or a votable, the link is provided as thedatalink_url
for the record.Where no datalink is found, a
DALServiceError
is raised informing the user that no datalink was found for the record.Fixes #328