-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add global discovery #470
Add global discovery #470
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #470 +/- ##
==========================================
- Coverage 81.24% 79.13% -2.11%
==========================================
Files 69 54 -15
Lines 7129 6155 -974
==========================================
- Hits 5792 4871 -921
+ Misses 1337 1284 -53 ☔ View full report in Codecov by Sentry. |
This attempts to address astropy#470 (review)
service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.
4735346
to
159aa5c
Compare
service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.
2142430
to
e6ca7bc
Compare
So... I think this could now be worth a look. See also my blog post on this: https://blog.g-vo.org/global-dataset-discovery-in-pyvo.html. The CI failures look interesting – I half fear they are related to implementation details changing the request hashes that utils.testing generates. Perhaps we'll have to properly scaffold these things after all? Anyway, I'll look at this some other day; it shouldn't impact discovery as such. Meanwhile, on the usual codestyle failures I have to say I'm fairly unhappy that I'll now have to say foo / 2 rather than foo/2 – does that really help someone? And do people here in pyVO actually prefer the former? If we have to have these checks in place: Can someone recommend a formatter/pretty-printer that I can run halfway blindly on the sources? Frankly, I don't think I'd like to get used to typing foo / 2 or 1 + 2, so if this kind of thing can happen in a pre-commit hook, that'd make my life easier... |
e6ca7bc
to
bba06d9
Compare
even old basic tools could fix this, autopep8, or I have flymake set up on my emacs to check on the codestyle (and it should pick up the setting from the repo) |
So... I think modulo review points and perhaps a few style matters, this could be merged, with a caveat that the API is probably not stable yet. It does something interesting, and although data providers still can do quite a bit to make this work better, it's a lot more likely they will do that if there is merged code in pyVO. From my perspective, the main problem is testing. Since there is so much network activity going on here, my plan has been to ship frozen network answers as part of the source (see utils.testing for the general scheme). It turns out that that is nowhere near as simple as I had hoped. The test failures with old versions of python are almost certainly due to some subtle difference in the serialisation of query parameters; they might be fixable by normalising these a bit more aggressively, but perhaps they are not; I'd have to pull up systems with these old versions to look. But: Is this the right way to go about this? Perhaps the instrumentation should sit in TAPService, where the parameters are much less dependent on versions than where requests.requests has written its payload? Or should we, especially after the xz disaster, try harder to avoid what looks like binary blobs in the VCS in the first place? But then how do we do the scaffolding for these tests without going insane? Thoughts and/or ideas? |
Hi Markus, |
On Tue, Apr 16, 2024 at 06:46:34AM -0700, Manon Marchand wrote:
With which python version do the tests pass without access to
remote data? Maybe there was a change in the way hashs are
calculated that we could identify in python changelogs?
I'm rather sure it's not related to hash computations as such; these
are fairly stable.
In the current checks, things already fail for python 3.9. But good
thing you ask, because it made me realise that for python 3.8, things
already fail because there is no functools.cache there. At least
that I can fix. Perhaps we can learn something from the emerging
pattern.
|
Ah well... failures up and down. I'll dig a bit deeper tomorrow, but this does look as if we're going to need a giant scaffold. Oh dang. |
Oh yikes. I've tried this on two different Debian relases (so, I'm not reproducing github's environment). The first instance for differing hashes is because astropy includes something like "Produced with astropy.io.votable version 5.2.1" in its VOTables (which get uploaded in the course of the tests). Updating the cached files depending on the underlying astropy release obviously is not an option. I think the testing plan with intercepting and caching https requests and responses and recovering the responses based on hashes of the requests is fundamentally broken for something as complex as the global discovery. I suppose we'll have to mock at a higher level. If someone has good ideas how to go about this: I'd be most grateful. |
I don't have any suggestions without diving into this properly, but will try to find time to do so next week. |
service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.
5f89f41
to
d143384
Compare
service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.
d143384
to
a2f7d8a
Compare
service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.
Basically, you can now clear the queue of services to query. Once the currently running query finishes, that will also end query_services. I've thought about canceling the currently running query, too, but that seems to be hard.
Also, fixing a crash in relationship discovery when there are no records to discover relationships for.
This is relevant when, for one reason or other, multiple capabilities of a type are present in a registry record. For our purposes here (sia, ssap, tap...), it is highly unlikely that these different capabilities correspond to different data (that's as different from VizieR SCS, where different capabilities correspond to different tables; but that's a pain and we don't want anything like that here).
SIA1 has all the various bandpass definitions optional (whether we like it or not). This commit makes our result construction resilient against services that don't provide it.
Also, we're sneaking in a change in SIA1 bandpass behaviour: We now accept the standard strings for bandpass units and assume m if the bandpass declaration is missing (that's against the standard, but in line with acutal practice).
This is mainly about using 1=contains rather than distance in the obscore queries; we also had to update the RegTAP capability.xml in the registry test to bring it in line with what the discover tests do. A better isolation of these test suites would be desirable; but then we don't want forsake capabilities caching altogether either... Also, sneaking in a few formatting problems.
It turns out that hashing actual network requests is too fragile -- due to version strings in headers and payloads, it is probably impossible to pull that off across python versions. I'm also dropping utils.testing; I think it might be useful to retain this in pyvo's history -- perhaps it will be useful as a basis for something comparable. Meanwhile, I have made the discovery tests remote_data. More unit tests ought to mock on a higher level, presumably by monkeypatching registry.search and the dal classes involved (yikes!).
When resetting the services and hence emptying the queue, the iterator currently active would keep returning the queriables that were in there before. That's no longer the case now. Also, adding a few tests to improve coverage.
This is in response to astropy#470 (comment).
Also, fixing a few tests that failed because of external changes.
e676bee
to
8caab1f
Compare
Hm. I'd suggest to ignore the CI failures. From what I can discern in the ASCII torrent, this is MIVOT stuff doing behind-the-scenes network requests – and I'd much prefer to not get entangled with that code. |
Yes, mivot failings feels to be unrelated, so I'm fixing the docs build one more time and this is good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go ahead and merge this now, in agreement with Tom and Adrian.
If the mivot failures are persistent, we'll deal with them in a follow-up.
Thank you so much for your patience here, Markus! |
This PR should eventually add a facility for global dataset discovery to pyVO. Specifically, people should be able to say "give me all images in the VO showing M42 around 42nm on MJD 54125."
Realistically, we ought to allow intervals on input and probably multiple positions (though we should refuse SIA1 for multiple objects).
At the moment, none of this will work; I'm missing two very basic building
blocks that I won't have time to locate before I'm going on holiday, both related to the way pyvo treats query results:
If there is no better solution for both, we probably need to turn things into astropy tables, manipulate them, and then turn them back to VOTables. It'd be great if there were a less convoluted way to do this. Is there?
Even if that's there, this won't be ready for prime time; we will at least need defined timeouts on all services involved, because there's always some services in the VO hanging clients for almost arbitrary times.
Once we have the image case, a similar modules for spectrum discovery (in SSA and Obscore for now) should be simple.
We probably should talk to the IVOA to make throwing out capabilities leading to the same data possible/more robust; I think we ought to suggest that SIAv2 services over data that's also there in Obscore services should have an IsDerivedFrom set to the respective TAP service (or Obscore resource).
So, there's a metric ton of TODOs – I'm grateful for commits fixing any of them.
Ah – and before you ask: Testing code like this that talks to services all over the internet is of course a particular challenge; I think I have a plan, so don't worry about that particular issue too much at this point.