-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new xtype="json" #33
Comments
Would this be able to handle UTF8 strings embedded in the JSON ?
|
Unicode is a bit of a mess in VOTable anyway, but it's no worse for JSON than for other text content (all of the JSON syntax characters are representable in 7-bit ASCII, i.e. look the same in UTF8 and ASCII). So you could put a JSON string in either a |
On Tue, Nov 14, 2023 at 09:03:18AM -0800, Zarquan wrote:
Would this be able to handle UTF8 strings embedded in the JSON ?
https://www.json.org/json-en.html
> "A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. "
I'm not sure whether I'd prefer that to finally allowing utf-8 in
VOTable char, but sure, why not?
|
Generally, the presence of an xtype means "you can parse this primitive value into an ???" where ??? is the xtype value, e.g. point. Here, that comes across as "you can parse this string into a json (object)" which isn't really saying anything about the kind of content one has to grok or not. Consider xtype="moc" which in the details says "using ascii moc serialisation from document ...". That could have been and maybe someday will be "using json moc format from document ...". json is a serialisation format, not an extended datatype in it's own right, so I'm against this. It doesn't say enough about the type of the value. I am not against serialising with json if that's the most appropriate way to deliver the structure efficiently. |
On Mon, Nov 27, 2023 at 10:08:06AM -0800, Patrick Dowler wrote:
point. Here, that comes across as "you can parse this string into a
json (object)" which isn't really saying anything about the kind
of content one has to grok or not.
Well, it is saying "you can turn this string into something *more*
meaningful by passing it to a json parser". I could argue it's not
*much* different from "turn this array into a circle" without saying
whether that circle really is an error circle or an aperture or a
region of interest.
Turning the string into a parsed structure, in turn, could let people
directly say col[3]["measurements"][20] in a client -- of course,
presuming they know what to expect in that column.
I'd hence argue that a json xtype has a fairly clear operational
meaning, which is what I like about it. I readily admit that I, too,
dislike much else about it, in particular the temptation for data
providers to skip metadata declaration and proper relational design.
json is a serialisation format, not an extended datatype in it's
own right, so I'm against this. It doesn't say enough about the
type of the value. I am not against serialising with json if that's
I, too, would like it better if there were some machine-readable and
verifiable way to declare the structure of such a json column. I
haven't looked as JSON-schema, but assuming it's not dramatically
less powerful than XML schema -- would you be less opposed to this if
we required people to attach a JSON schema to a json column?
Full disclosure: The reason I'm after this is that DaCHS users
asked for ways to pull json/jsonb-valued columns into TAP responses,
and I'm already producing xtype="json" for those, under the reasoning
that it's better than betraying nothing about the inner structure of
these strings.
I can't say I like it, but there *are* use cases for that that I
cannot readily reject.
|
Is there a way to do that? I'd really like if we find a way to allow for a schema attached. |
@pdowler wrote:
I think you could apply all those arguments to URI and UUID, which have been introduced as xtypes in this draft. A URI doesn't say what it's the URI of. Just as you might one day want to represent a MOC in JSON, you might want to represent a MOC using its URI. So do we chuck out I would say that @msdemlei 's "you can turn this [string] into something more meaningful by [using its xtype]" is the most useful criterion here. |
My point is that parsing json into an programmatically accessible object is an intermediate representation, much like a DOM with xml, and then the intermediate representation gets pulled apart to create some structured object. That object has a type that would be part of a data model and the dm implementation would likely have a class corresponding to that type. eg Yes, you can parse a string into a JSONObject or a DOM... I just don't see that as "more meaningful" - it is still opaque unless you know what structure it is trying to convey. |
On Thu, Nov 30, 2023 at 09:28:11AM -0800, Patrick Dowler wrote:
Yes, you can parse a string into a JSONObject or a DOM... I just
don't see that as "more meaningful" - it is still opaque unless you
Well, being able to write col[3]["fluxes"][-1] in your code rather
than json.loads(col[3])["fluxes"][-1] very arguably is at least a
convenience. Whether that's already something in the way of
"meaning" is... a question of what the meaning of meaning is, and
I'd like to leave that question open.
More pragmatically, I have these people with jsonb columns in
their database, and they publish these things through TAP (or, let me
hedge here: at least they want to). So, what should I do?
If I just write char[*], I feel I'm concealing a *very* relevent fact
about the column -- that it's parseable -- from the client.
Should I say "this is json" using a UCD? That feels even more wrong
than an xtype.
Should I tell them to just not push out the jsonb stuff and unroll it
through a view? That's an option, but I think in this case that's
rather unattractive because none of what is in the JSON should be
available for SQL constraints and operators, and of course I'm a
coward and would like to avoid an ugly discussion when you have a
four-level nested structure in this column.
Should I do a <LINK content-type="application/json" href="."> in the
FIELD? Uh, I can already see the pitchforks in front of my office...
So... what would you do when someone came to you with a table with
jsonb columns?
|
I copy @msdemlei last question. How do I make a little more interoperable JSONB objects stores in a DB field? |
OK, that's fair. Right now we have xtype="moc" and the serialization is defined as "ascii moc from REC-MOC-2.0". Let's say a future version of the MOC standard defines "json moc" serialization and people want to put that into a VOTable. How would we handle that? xtype="jsonmoc"? xtype="jmoc"? (shudder) ... xtype="json:moc"? There are two pieces of info to convey and I kind of like the latter one. I kind of think of xtype as a URN anyway and if we thought about having something like xtype="json:moc" in future, we could then also say that xtype="json" is just a less specific base... not super expressive but more expressive than nothing. We could still encourage people to try to say more with xtype="json:foo". We would also be able to handle different serializations of types without having to make up arbitrary compound words. I wouldn't want this to turn into UCD with rules about primary and qualifiers and such... still just strings but with a style that trivially conveys multiple bits of info. In that context, to me "json:moc" looks better than "moc:json" because that's how I expect people to say/read it. thoughts? |
On Fri, Dec 01, 2023 at 09:05:44AM -0800, Patrick Dowler wrote:
Right now we have xtype="moc" and the serialization is defined as
"ascii moc from REC-MOC-2.0". Let's say a future version of the MOC
standard defines "json moc" serialization and people want to put
that into a VOTable. How would we handle that? xtype="jsonmoc"?
I'd first say let's not invent more ways to say the same thing.
encourage people to try to say more with xtype="json:foo". We would
also be able to handle different serializations of types without
Oh, please, let's just not have (more) things with more than one
serialisation in VOTable *at all*; if there's one way to say it,
that's enough. I think there is never (+/- epsilon) a benefit to
supporting multiple serialisations that outweighs the pains in
interoperability and implementability (tabledata vs. binary2 *may* be
in that epsilon).
As I said above, I do see a benefit in being able to somehow inform a
client about the inner structure of a json-serialised field (perhaps
JSON-schema, perhaps MIVOT, perhaps still something else) -- but I'd
argue very strongly against trying to shoehorn this into an xtype
string. These, I'd say, should be simple, atomic and hence
unparseable.
|
I'm not saying it's a good idea to have mutliple serializations (ascii and json, for example), just that xtype="json" by itself is not saying what kind of structure is there, just the syntax of the serialization. If we want to also use xtype to convey syntax then it opens up that can or worms... I'm against conveying syntax because to me it's very low (admittedly non-zero) value. If someone wants to convey a some json, they surely have a specific structure in mind. They should use an experimental xtype that denotes the structure, e.g. xtype="proto:foo" and document (tap_schema.columns.description) what the syntax is. That's what I would do in that case. |
On Tue, Dec 05, 2023 at 09:11:30AM -0800, Patrick Dowler wrote:
(ascii and json, for example), just that xtype="json" by itself is
not saying what kind of structure is there, just the syntax of the
I suppose our dissent basically is whether knowing something is JSON
is already useful enough by itself. I'm not sure what a good way to
settle that dissent is... can I argue that there is a media type for
json, too, and that doesn't have a defined way to convey the
structure, either?
If someone wants to convey a some json, they surely have a specific
structure in mind. They should use an experimental xtype that
denotes the structure, e.g. xtype="proto:foo" and document
(tap_schema.columns.description) what the syntax is. That's what I
would do in that case.
Oh, that may be another point of dissent: Frankly, I believe that the
overwhelming majority of what people will stuff into jsonb columns is
going to be fairly adhoc (which is why I'm not happy about the
concept either -- but since it's there, I think we need to do
*something* about it). This means that having two different json
columns with the same structure/schema will be the huge exception,
and hence we'll basically have an xtype string per json column.
If that assessment is about right, the plan to define the structure
in the xtype string in some way is a non-starter. No client will
ever bother to learn such one-shot xtypes.
*If* we really want a structure definition for json columns, it'll
have to be machine-readable and sit outside of the xtype.
Do we *need* such a mechanism to make a json xtype useful? Well,
that's back to dissent 1.
|
You could be right. I feel like I only know half of a use case, but maybe |
Hypothetical use case - a machine learning pipeline needs to add attributes to their data that contain the parameters used to train the algorithm. Their application is written in Python and their data scientists are familiar with reading and writing JSON fragments so saving the parameters as JSON strings makes sense for their project. One of the steps in their pipeline adds the training and validation parameters for their ML algorithm to the data. They then want to upload this data to a TAP service, cross match it with the remote dataset and then pass the results on to the training and validation steps of their pipeline. Markus is right, the descriptions for the columns will be specific to this pipeline and the project development team are As far as they are concerned, the only things that need to understand the JSON content of the columns are the pipeline stage that adds them and the pipeline stage that reads them. As far as the rest of the world is concerned, they are just strings. Pat's suggestion of Embedding a JSON schema for the column in the VOTable header would be limited to this VOTable instance; it would not survive the TAP upload, JOIN query and download. Equally, embedding a JSON schema for the column in TAP schema would be linked to an originating TAP service, so it couldn't be used to describe columns added by other software e.g. the ML pipeline. From Markus's comment:
In this example I don't see anything that would justify a machine readable description of the columns. The ML project itself just needs a unique string to identify the columns, which could either be column name or a simple To the outside world, the data type defines them as strings, which is enough to transport the VOTable and read the fields. No one outside the ML project is going to write code that reads the Equally, extending the IVOA data models to include things like 'machine learning training parameters' is way out scope. However, I would argue there is a case for providing some level of descriptive metadata for users outside the original project. Someone who received a VOTable of results that were generated by the ML project, doesn't necessarily know where the data originated from (result of a query on another service perhaps), and wants to understand what is in the extra fields. In which case Our standard could say :
In our ML pipline example, they could use URLs that point to pages on their own website to describe the training and validation parameters:
Everything else can treat the xtypes as opaque strings, no different to Using resolvable URLs to human readable descriptions provides a low cost way for the ML project to explain to external users what their special columns contain. Using URLs of pages on their own project website is an easy way to ensure they are globally unique, and the recipient doesn't need any special software to resolve them, just a normal web browser. |
On Wed, Dec 13, 2023 at 11:25:47AM -0800, Zarquan wrote:
> Do we *need* such a mechanism to make a json xtype useful? Well,
> that's back to dissent 1.
In this example I don't see anything that would justify a machine
readable description of the columns.
No, certainly not; in your example, a group communicates with itself,
using custom software. For them, the column name is plenty good, and
they clearly don't need any xtype at all -- which is why this example
perhaps isn't ideal to figure out what a useful feature for rather
different case of communication between people (and their machines)
that don't have any contract between them except for our standards.
However, I would argue there is a case for providing some level of
descriptive metadata for users outside the original project.
Perhaps -- but please let's not use xtype for that. xtype is
machine-readable: depending on what's there, a VOTable parser
produces polygon instances, or timestamps, or perhaps lists and
dictionaries (with a json xtype) from some less structured literal.
When instead xtype suddenly contains links to human-readable
documentation, as in...
In which case `xtype="http://www.project.org/types/foo"` might be a
useful pattern to promote.
...that will make the parsers' lives a lot more difficult.
Associating longer, human-readable documentation to a column,
meanwhile, is a use case I've been after for a long, long time (in
the form for "table footnotes") -- I think your use case could fit in
there, too.
Although:
Using URLs of pages on their own project website is an easy way to
ensure they are globally unique, and the recipient doesn't need any
special software to resolve them, just a normal web browser.
...from a curation standpoint (and in particular when I hear "ML",
which still has undertones of move fast and break things) I'd perhaps
tend to discourage *linking* such longer pieces of documentation,
because when people revisit such a VOTable a year later, chances are
the URL will dereference to a 404.
DaCHS has been using GROUP-s with FIELDref-s for table notes for
a long time now; see
https://dc.zah.uni-heidelberg.de/arihip/q/cone/scs.xml?RA=0&DEC=0&SR=1
for an example, look for "note-" in the result.
I'd be totally in on standardising that (plus perhaps an ad-hoc utype
to identify the GROUP-s as table footnotes) for this kind of use case.
But that's a different bug against (probably) a different standard, I
think.
|
This example is about communication between people that don't have any direct contract between them except for what is in this one VOTable.
... or application specific JSON encoded data like the ML parameters in this example. This doesn't exclude simply putting
Objectively, why is this more complex than Pat's suggestion of The client software does not need to parse the URL, it just treats it as an opaque string.
The URL is simply there to provide placeholder for the data scientist who added the extra columns to communicate some additional information to someone in the future. A client that doesn't know what a 'Foo' or 'Bar' is would just treat the column as a String.
Yep, I agree. It isn't ideal. Let's say it only has a 10% chance of resolving, but that is 10% more than
This is really useful stuff, but how much of it would survive a TAP upload, JOIN query and download via someone else's service ?
As your 'move fast and break things' comment suggests, some ML data scientists may indeed be less rigorous about engineering best practices. They may be in a hurry to get things done, so our advice needs to be as easy to use as possible:
It guarantees uniqueness and gives them a way to link to extra information if they want to. It just provides the mechanism. How well they use it is up to them. |
On Thu, Dec 14, 2023 at 06:56:05AM -0800, Zarquan wrote:
> When instead xtype suddenly contains links to human-readable documentation ... that will make the parsers' lives a lot more difficult.
Objectively, why is this more complex than Pat's suggestion of `prefix:foo`?
I give you it's not worse, but I didn't like Pat's prefix:foo any
better on the same grounds: you cannot put that into general-purpose
VOTable parsers. You...
The client software does not need to parse the URL, it just treats
it as an opaque string.
```
if (xtype == "point")
// Create an IVOA Point object
else if (xtype == "prefix:foo")
// Create an application specific Foo object
else if (xtype == "http://www.project.org/types/bar")
// Create an application specific Bar object
else
// Just treat it as a sring.
```
...the xtype processing is in the VOTable parser, and as a
library/program user you don't notice any of that. You just get a
(more) useful object, that's it; with xtype=json you'd just get a
parsed dict or list and wouldn't have to call your environment's
equivalent of json.loads yourself.
Which means that code would need to sit in, say, STIL or
astropy.votable, or gavo.votable. If we try to convince the
maintainers of these code bases to write some "else if", I think we
should be able to come up with a proper xtype string -- and the
chances that they'll take up some random group's convention are slim.
Let's not even imply a group could ask for having such a "custom"
xtype. I certainly don't want to have to discuss whether such code
shouldn't go into, say, astropy after all.
The URL is simply there to provide placeholder for the data
scientist who added the extra columns to communicate some
additional information to someone in the future.
Let's not mix human-readable documentation with machine-readable
metadata if we can help it.
> DaCHS has been using GROUP-s with FIELDref-s for table notes for
> a long time now;
This is really useful stuff, but how much of it would survive a TAP
upload, JOIN query and download via someone else's service ?
Let's first standardise it. Once that standard is there, it's not
terribly hard to pull a bit of extra column metadata through a TAP
upload at *at least* the level of an xtype (which means: if you write
4*foo*bar, it is likely that the engine will discard any footnotes
foo or bar may have had, but that's a reasonable thing, and you'd
almost certainly would not pull through xtypes of either foo or bar
in that situation either).
|
It would be simple to create an extendable parser that would accept handlers for additional xtypes. The idea is that the list of xtypes is not fixed, and projects can add their own custom types later.
Isn't your notes field human readable ? Getting back on subject. The initial suggestion was Yep, OK with that. Then we started asking, what if the JSON object represented an application specific complex Object? Could we represents that as Possibly not so good, because anything that could handle an application specific Foo object would probably already know Would it be better to recommend a URI structure of It works, but it misses an opportunity to say something about what the application specific Foo object is. If we use a URL rather than just a URI, it gives the user that adds the column the opportunity to point to more metadata. This doesn't exclude adding the notes in the It just says if we are going to have application specific URIs, we should recommend that they use a URL that points to something. |
If we are suggesting that We should also have
As part of the group looking at updating our standards to move away from XML, I suggest that whenever we add something to handle JSON, we should also add the same functionality to handle YAML. This doesn't add much more complexity, but it does make us think about making our standards serialization agnostic, rather than just moving from one fixed serialization to another. |
On Fri, Dec 15, 2023 at 04:10:21AM -0800, Zarquan wrote:
Isn't your notes field human readable ?
Sure -- as is DESCRIPTION, say. I'm not arguing against
human-readable metadata in VOTable at all, just against using
parts of metadata so far intended for machine consumption to somehow
sneak in human-readable parts that a machine simply can't handle.
Possibly not so good, because anything that could handle an
application specific Foo object would probably already know that it
should be serialised in JSON. So in reality, it doesn't gain us
that much more than `xtype="foo"`.
I'd look less at URIs here and more at media types that have a
similar thing, where, for instance, the media type
application/x-desise+json that we use for our custom vocabulary
format still indicate what's coming back is JSON even to clients that
don't know Desise.
I'd not struggle against such a thing very hard, but I'd say: as long
as we don't have anyone who *actually* wants to write code that would
profit from this kind of thing, let's postpone it. True, saying
"discard anything in front of a plus in xtype" would *slightly*
improve backwards compatibility for such changes, but not so
decisively that I'd find speculative features justified.
… If we use a URL rather than just a URI, it gives the user that adds
the column the opportunity to point to more metadata.
-- but then the client has no way to work out it's json in the first
place, and that's a loss, no?
|
On Fri, Dec 15, 2023 at 04:26:18AM -0800, Zarquan wrote:
If we are suggesting that `xtype="json"` is useful because it avoids the need to call `json.loads`.
We should also have `xtype="yaml"` that parses the field as YAML and returns a similar object without having to load the equivalent YAML parsing libraries.
Well, we could, but
(a) nobody's asking us for that,
(b) there's no database column type named "yaml" (though admittedly
you could serialise the content of jsonb columns as yaml -- but
why would you do that?)
(c) the benefits of another "complex data in a column" type seem
rather limited to me -- don't forget, parsers (more or less) have
to implement this, and I think they won't appreciate a stance of "we
don't want to lock in our data producers, so see how you
consumers cope".
(d) yaml support, while widespread, still is a lot more limited
that json support (e.g., no module in the Python standard
library), and
(e) yaml uses whitespace for grammar. In theory, VOTable preserves
whitespace (by virtue of TD being xs:string even in TABLEDATA).
In practice, given XML prettyprinters and all that, I'd rather
not bet on it.
So... I find json borderline already (as I said above: it's an
open invitation to skip sound data design and metadata declaration).
Yaml, however, at this point even lacks the pragmatic points that
made me decide json is somewhat on the "accept" side of the border.
|
Using a URL in the xtype allows the user to link to a schema, providing a method for including data design and metadata declaration ? |
I have heard a few people recently talking about putting JSON into VOTable fields. I can't immediately think of a reason why client code would need to know that the string data it's getting from a table is in JSON format, but it might do, and xtype would seem like the most appropriate place to include this information.
The text was updated successfully, but these errors were encountered: