Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new xtype="json" #33

Open
mbtaylor opened this issue Nov 14, 2023 · 25 comments
Open

new xtype="json" #33

mbtaylor opened this issue Nov 14, 2023 · 25 comments

Comments

@mbtaylor
Copy link
Member

I have heard a few people recently talking about putting JSON into VOTable fields. I can't immediately think of a reason why client code would need to know that the string data it's getting from a table is in JSON format, but it might do, and xtype would seem like the most appropriate place to include this information.

@Zarquan
Copy link
Member

Zarquan commented Nov 14, 2023

Would this be able to handle UTF8 strings embedded in the JSON ?
https://www.json.org/json-en.html

"A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. "

@mbtaylor
Copy link
Member Author

Unicode is a bit of a mess in VOTable anyway, but it's no worse for JSON than for other text content (all of the JSON syntax characters are representable in 7-bit ASCII, i.e. look the same in UTF8 and ASCII). So you could put a JSON string in either a char or unicodeChar column.

@msdemlei
Copy link
Contributor

msdemlei commented Nov 16, 2023 via email

@pdowler
Copy link
Collaborator

pdowler commented Nov 27, 2023

Generally, the presence of an xtype means "you can parse this primitive value into an ???" where ??? is the xtype value, e.g. point. Here, that comes across as "you can parse this string into a json (object)" which isn't really saying anything about the kind of content one has to grok or not.

Consider xtype="moc" which in the details says "using ascii moc serialisation from document ...". That could have been and maybe someday will be "using json moc format from document ...".

json is a serialisation format, not an extended datatype in it's own right, so I'm against this. It doesn't say enough about the type of the value. I am not against serialising with json if that's the most appropriate way to deliver the structure efficiently.

@msdemlei
Copy link
Contributor

msdemlei commented Nov 29, 2023 via email

@molinaro-m
Copy link
Member

-- would you be less opposed to this if we required people to attach a JSON schema to a json column? Full disclosure:

Is there a way to do that? I'd really like if we find a way to allow for a schema attached.
I might have an additional constraint, that is JSON schema attached to column AND row.
I know, it gets too complicated, but if it can be kept in mind while evolving this...

@mbtaylor
Copy link
Member Author

@pdowler wrote:

Generally, the presence of an xtype means "you can parse this primitive value into an ???" where ??? is the xtype value, e.g. point. Here, that comes across as "you can parse this string into a json (object)" which isn't really saying anything about the kind of content one has to grok or not.

Consider xtype="moc" which in the details says "using ascii moc serialisation from document ...". That could have been and maybe someday will be "using json moc format from document ...".

json is a serialisation format, not an extended datatype in it's own right, so I'm against this. It doesn't say enough about the type of the value. I am not against serialising with json if that's the most appropriate way to deliver the structure efficiently.

I think you could apply all those arguments to URI and UUID, which have been introduced as xtypes in this draft. A URI doesn't say what it's the URI of. Just as you might one day want to represent a MOC in JSON, you might want to represent a MOC using its URI. So do we chuck out xtype="uri" as well?

I would say that @msdemlei 's "you can turn this [string] into something more meaningful by [using its xtype]" is the most useful criterion here.

@pdowler
Copy link
Collaborator

pdowler commented Nov 30, 2023

My point is that parsing json into an programmatically accessible object is an intermediate representation, much like a DOM with xml, and then the intermediate representation gets pulled apart to create some structured object. That object has a type that would be part of a data model and the dm implementation would likely have a class corresponding to that type. eg xtype='uri" -> java.net.URI in my code. Sure, xtype="uri" allows for many different things (ivorn and url would be more restricted), but it's common in a DM to have a field of type "uri".

Yes, you can parse a string into a JSONObject or a DOM... I just don't see that as "more meaningful" - it is still opaque unless you know what structure it is trying to convey.

@msdemlei
Copy link
Contributor

msdemlei commented Dec 1, 2023 via email

@molinaro-m
Copy link
Member

I copy @msdemlei last question. How do I make a little more interoperable JSONB objects stores in a DB field?

@pdowler
Copy link
Collaborator

pdowler commented Dec 1, 2023

OK, that's fair. char and * are even less useful. Let me ask a leading question...

Right now we have xtype="moc" and the serialization is defined as "ascii moc from REC-MOC-2.0". Let's say a future version of the MOC standard defines "json moc" serialization and people want to put that into a VOTable. How would we handle that? xtype="jsonmoc"? xtype="jmoc"? (shudder) ... xtype="json:moc"? There are two pieces of info to convey and I kind of like the latter one.

I kind of think of xtype as a URN anyway and if we thought about having something like xtype="json:moc" in future, we could then also say that xtype="json" is just a less specific base... not super expressive but more expressive than nothing. We could still encourage people to try to say more with xtype="json:foo". We would also be able to handle different serializations of types without having to make up arbitrary compound words. I wouldn't want this to turn into UCD with rules about primary and qualifiers and such... still just strings but with a style that trivially conveys multiple bits of info. In that context, to me "json:moc" looks better than "moc:json" because that's how I expect people to say/read it.

thoughts?

@msdemlei
Copy link
Contributor

msdemlei commented Dec 5, 2023 via email

@pdowler
Copy link
Collaborator

pdowler commented Dec 5, 2023

I'm not saying it's a good idea to have mutliple serializations (ascii and json, for example), just that xtype="json" by itself is not saying what kind of structure is there, just the syntax of the serialization. If we want to also use xtype to convey syntax then it opens up that can or worms... I'm against conveying syntax because to me it's very low (admittedly non-zero) value.

If someone wants to convey a some json, they surely have a specific structure in mind. They should use an experimental xtype that denotes the structure, e.g. xtype="proto:foo" and document (tap_schema.columns.description) what the syntax is. That's what I would do in that case.

@msdemlei
Copy link
Contributor

msdemlei commented Dec 6, 2023 via email

@pdowler
Copy link
Collaborator

pdowler commented Dec 6, 2023

You could be right. I feel like I only know half of a use case, but maybe
it really is ad-hoc and that is the whole use case.

@Zarquan
Copy link
Member

Zarquan commented Dec 13, 2023

Hypothetical use case - a machine learning pipeline needs to add attributes to their data that contain the parameters used to train the algorithm. Their application is written in Python and their data scientists are familiar with reading and writing JSON fragments so saving the parameters as JSON strings makes sense for their project.

One of the steps in their pipeline adds the training and validation parameters for their ML algorithm to the data. They then want to upload this data to a TAP service, cross match it with the remote dataset and then pass the results on to the training and validation steps of their pipeline.

Markus is right, the descriptions for the columns will be specific to this pipeline and the project development team are
unlikely to want to add complicated code to their client to add metadata that describes something that their team already knows about.

As far as they are concerned, the only things that need to understand the JSON content of the columns are the pipeline stage that adds them and the pipeline stage that reads them. As far as the rest of the world is concerned, they are just strings.

Pat's suggestion of xtype="proto:foo" provides part of the solution, it distinguishes between IVOA core xtypes (no prefix) and application specific xtypes (prefix:), but it doesn't provide a mechanism for describing the type.

Embedding a JSON schema for the column in the VOTable header would be limited to this VOTable instance; it would not survive the TAP upload, JOIN query and download.

Equally, embedding a JSON schema for the column in TAP schema would be linked to an originating TAP service, so it couldn't be used to describe columns added by other software e.g. the ML pipeline.

From Markus's comment:

If that assessment is about right, the plan to define the structure
in the xtype string in some way is a non-starter. No client will
ever bother to learn such one-shot xtypes.

If we really want a structure definition for json columns, it'll
have to be machine-readable and sit outside of the xtype.

Do we need such a mechanism to make a json xtype useful? Well,
that's back to dissent 1.

In this example I don't see anything that would justify a machine readable description of the columns.

The ML project itself just needs a unique string to identify the columns, which could either be column name or a simple xtype.

To the outside world, the data type defines them as strings, which is enough to transport the VOTable and read the fields.

No one outside the ML project is going to write code that reads the xtype metadata for arbitrary JSON columns and dynamically generate something that unpacks the JSON and interprets it. To do what ? The only software that can realistically use the content of the JSON columns will be written by members of the ML project, who already know what is in them.

Equally, extending the IVOA data models to include things like 'machine learning training parameters' is way out scope.

However, I would argue there is a case for providing some level of descriptive metadata for users outside the original project. Someone who received a VOTable of results that were generated by the ML project, doesn't necessarily know where the data originated from (result of a query on another service perhaps), and wants to understand what is in the extra fields.

In which case xtype="http://www.project.org/types/foo" might be a useful pattern to promote.

Our standard could say :

Identifiers for standard DALI xtypes do not have prefixes.
Identifier for user defined xtypes should be a valid URI.
Preferably, a resolvable URL that points to a resource that describes the type.
The simple form, xtype="project:foo", is sufficient for a quick prototype, but deployed production systems should use a resolvable URL e.g. xtype="http://www.project.org/types/foo".
Where "http://www.project.org/types/foo" points to a resource that describes the data content and serialization format.

In our ML pipline example, they could use URLs that point to pages on their own website to describe the training and validation parameters:

  • datatype="unicodeChar", arraysize="*", xtype="http://www.project.org/types/training-param"
  • datatype="unicodeChar", arraysize="*", xtype="http://www.project.org/types/validation-param"

Everything else can treat the xtypes as opaque strings, no different to project:training-param and project:validation-param.

Using resolvable URLs to human readable descriptions provides a low cost way for the ML project to explain to external users what their special columns contain.

Using URLs of pages on their own project website is an easy way to ensure they are globally unique, and the recipient doesn't need any special software to resolve them, just a normal web browser.

@msdemlei
Copy link
Contributor

msdemlei commented Dec 14, 2023 via email

@Zarquan
Copy link
Member

Zarquan commented Dec 14, 2023

different case of communication between people (and their machines) that don't have any contract between them except for our standards.

This example is about communication between people that don't have any direct contract between them except for what is in this one VOTable.

xtype is machine-readable: depending on what's there, a VOTable parser produces polygon instances, or timestamps, or ....

... or application specific JSON encoded data like the ML parameters in this example.

This doesn't exclude simply putting xtype="json". This is for where we want to provide a mechanism for telling the recipient a bit more about the application specific content than just the serialization.

When instead xtype suddenly contains links to human-readable documentation ... that will make the parsers' lives a lot more difficult.

Objectively, why is this more complex than Pat's suggestion of prefix:foo?

The client software does not need to parse the URL, it just treats it as an opaque string.

if (xtype == "point")
    // Create an IVOA Point object
else if (xtype == "prefix:foo")
    // Create an application specific Foo object
else if (xtype == "http://www.project.org/types/bar")
    // Create an application specific Bar object
else
    // Just treat it as a sring.

The URL is simply there to provide placeholder for the data scientist who added the extra columns to communicate some additional information to someone in the future.

A client that doesn't know what a 'Foo' or 'Bar' is would just treat the column as a String.

if (xtype == "point")
    // Create an IVOA Point object
else
    // Just treat it as a sring.

.. I'd perhaps tend to discourage linking such longer pieces of documentation, because when people revisit such a VOTable a year later, chances are the URL will dereference to a 404.

Yep, I agree. It isn't ideal. Let's say it only has a 10% chance of resolving, but that is 10% more than prefix:foo.

DaCHS has been using GROUP-s with FIELDref-s for table notes for a long time now;

This is really useful stuff, but how much of it would survive a TAP upload, JOIN query and download via someone else's service ?

I'd be totally in on standardising that (plus perhaps an ad-hoc utype > to identify the GROUP-s as table footnotes) for this kind of use case.

As your 'move fast and break things' comment suggests, some ML data scientists may indeed be less rigorous about engineering best practices.

They may be in a hurry to get things done, so our advice needs to be as easy to use as possible:

  • For a custom xtype, use a URL to a page on your website

It guarantees uniqueness and gives them a way to link to extra information if they want to.

It just provides the mechanism. How well they use it is up to them.

@msdemlei
Copy link
Contributor

msdemlei commented Dec 15, 2023 via email

@Zarquan
Copy link
Member

Zarquan commented Dec 15, 2023

Which means that code would need to sit in, say, STIL or astropy.votable, or gavo.votable ....

It would be simple to create an extendable parser that would accept handlers for additional xtypes.

The idea is that the list of xtypes is not fixed, and projects can add their own custom types later.

Let's not mix human-readable documentation with machine-readable metadata if we can help it.

Isn't your notes field human readable ?

Getting back on subject.

The initial suggestion was xtype="json" would return a parsed JSON object, avoiding the need to call json.loads yourself.

Yep, OK with that.

Then we started asking, what if the JSON object represented an application specific complex Object? Could we represents that as xtype="json:foo", which mixes the serialization and object class.

Possibly not so good, because anything that could handle an application specific Foo object would probably already know
that it should be serialised in JSON. So in reality, it doesn't gain us that much more than xtype="foo".

Would it be better to recommend a URI structure of xtype="prefix:foo", where anything that matched the pattern [a-z]*:.* would be considered an application specific xtype.

It works, but it misses an opportunity to say something about what the application specific Foo object is.

If we use a URL rather than just a URI, it gives the user that adds the column the opportunity to point to more metadata.

This doesn't exclude adding the notes in the FIELD metadata as you suggest.

It just says if we are going to have application specific URIs, we should recommend that they use a URL that points to something.

@Zarquan
Copy link
Member

Zarquan commented Dec 15, 2023

If we are suggesting that xtype="json" is useful because it avoids the need to call json.loads.

We should also have xtype="yaml" that parses the field as YAML and returns a similar object without having to load the equivalent YAML parsing libraries.

  • Anything a machine can do with JSON can also be done with YAML.
  • YAML is easier for humans to read and edit.
  • YAML can include comments and multi-line text.

As part of the group looking at updating our standards to move away from XML, I suggest that whenever we add something to handle JSON, we should also add the same functionality to handle YAML.

This doesn't add much more complexity, but it does make us think about making our standards serialization agnostic, rather than just moving from one fixed serialization to another.

@msdemlei
Copy link
Contributor

msdemlei commented Dec 15, 2023 via email

@msdemlei
Copy link
Contributor

msdemlei commented Dec 15, 2023 via email

@Zarquan
Copy link
Member

Zarquan commented Dec 15, 2023

To keep this issue on track, I've created a separate issue for the YAML discussion, see #35.

(a) nobody's asking us for that,

I'm asking for it, because it makes us think outside the box.

The rest of the discussion in #35.

@Zarquan
Copy link
Member

Zarquan commented Dec 15, 2023

it's an open invitation to skip sound data design and metadata declaration

Using a URL in the xtype allows the user to link to a schema, providing a method for including data design and metadata declaration ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants