-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create SpdxId type for spdxId property #407
base: main
Are you sure you want to change the base?
Conversation
Although this would be convenient for some serializations, it would add complexity to others. In RDF, you can always check the type of the subject to find out if it is an Element. Also note that the rules for "subclassing" data types are a bit different - see https://www.w3.org/TR/xmlschema11-2/#dt-derived - I'm not sure if our current schema tooling supports deriving one datatype from another. |
This is the same issue as #377 (comment). We already have SemVer, DateTime, and MediaType datatypes in the model, and contrary to the assertion that it can't be done, they work fine, as the xsd example shows. The issue is just a matter of wording - "subclassing" gives the impression that datatypes are subclassed from something. They aren't. They are defined as root types themselves, not subclassed from anything: CreationInfo, DictionaryEntry, ExternalIdentifier and other datatypes aren't a "SubclassOf" anything. ExternalMap, ExternalReference, PositiveIntegerRange and others are a "SubclassOf" "none", which means the same thing. SemVer, DateTime and MediaType say they are "SubclassOf" xsd:string, but they aren't. They are root datatypes like other datatypes, xsd calls them "facets" that have a "base" of xsd:string (as demonstrated in the XSD in #377) but the base is not a subclass relationship. The model uses "Vocabularies" files that define xsd:string types that can take on a restricted set of values. But the model doesn't claim that Vocabularies are "SubclassOf" xsd:string. Can Vocabularies be implemented in RDF? If so, then other string-based datatypes can do it the same way. Get rid of the "SubclassOf" metadata item from all datatypes, or replace it with "Base" for simple datatypes, along with pattern or value restrictions below like Vocabularies, and the mistaken impression of subclassing goes away. (Actually Python does allow subclassing built-in types like str and int to do almost anything. But what we are modeling is much simpler, the type is exactly a native string, but with a setter method that only allows certain values. And unlike SemVer that does have a pattern that can be checked, xsd:anyURI doesn't even do any checking. The URI label is purely a semantic marker for a string that can have any value; it's up to the application to ensure that the strings it labels as URIs are usable as URIs.) Note: see #408 for an example of why types like SpdxId are useful. xsd:string by itself is implemented as a programming language string variable ( |
I'm running into other issues with the subclassing conventions used - for example, in RDF, there is no class known as This is complex enough we should probably have a general discussion on:
|
@goneall please keep in mind that the human-editable markdown files were not meant to be used for anything else besides the spec-parser. I'm now seeing more and more attempts to parse these files, which I think it's a step to the wrong direction. |
@zvr - that is exactly my point - I didn't mean to imply we are to use the markdown files for anything beyond translating to other formats via the spec parser. Since the usage of the markdown files are restricted, we have some freedom to define the semantics and syntax to meet the needs of the output schemas generated from the markdown files. As it stands, I don't think creating subclasses for data types will generate the correct OWL / SHACL and we don't currently support XSD definitions of data types - so merging this PR in I don't think would result in anything useful. Let me know my understanding is not correct. |
@goneall @zvr: I'd ask you both to look at Relationship. This is the logical type of a relationship element. which has start and end DateTimes. The logical type has nothing to do with any particular serialization, it is the answer to the question "what is the start time of this Relationship". Regardless of whether RDF or Tag or JSON is being used, the user must be able to look at the serialized data and get the answer to that question, and it doesn't make sense to NOT have a name for what a DateTime looks like. Given the model markdown files, is the logical type the exactly correct list of questions that can be asked about a Relationship instance, not more nor fewer? The question isn't whether tooling currently supports giving names to regex patterns, the question is whether it should. It is a tooling defect to not do so. |
I agree it DateTime should be defined - I'm just confused about the use of When I look at the generated OWL/SHACL file, I see:
I'm not completely sure, but I don't believe this is correct. I don't think you want to have the type to be a subclass of For the purposes of generating schemas we can validate against, it looks like we should be using SHACL shapes and not RDFS Subclasses for this purpose. I have to admit, I find this area to be rather complex and I could well be wrong. Here's one reference to a discussion on a similar topic in Stack Overflow. |
I spent some time trying to get a definitive answer on if you can subclass data types in RDF. It looks like I was wrong - you can subclass datatypes. Based on this, I withdraw my objection and it looks like we are property treating DateTime property correctly. I wasn't able to figure this out using google searches, but with a bit of help from Google Bard, I found an example. Based on the dialog, I'm not too embarrassed that I didn't know this from the start - it seems Google Bard was also a bit confused ;) Mostly for entertainment value, here's my Google Dialog. Prompt:
Which prompted my second question Response:
My response: Bard:
My next prompt: Bard:
My next prompt: Bard:
Prompt: Bard:
|
Now I'm not so sure - Bard may have been hallucinating - I couldn't find any spec or examples directly from the internet. @davaya @zvr @sbarnum - can you provide any references to using |
I found this W3C note that provides some context on using data types. For reference, here's the W3C description of data types in the concepts document. My reading of the above is data types need to be extensions of the XML datatypes and not using owl:subclasses. I did find this reference to extensible datatypes in OWL2. Note that these extensible datatypes can not be used with literals - so we should probably avoid them. Back to the conclusion we should not be subclassing datatypes (and an additional conclusion that Bard is hallucinating on the topic). |
The almost correct:
combined with the almost-correct correction:
gets close to my mental model of the explanation:
XSD has two kinds of classes (simple and complex) while RDF, being a graph language, further distinguishes So as long as "SubclassOf" means a base from which something else can be derived, then RDF can "Subclass" all of I've never tried Bard, but I'm quite impressed with your exchanges - thanks for doing and posting them. https://www.reddit.com/r/quotes/comments/ds44r2/the_marvel_is_not_that_the_bear_dances_well_but/ As for how to restrict a datatype class in RDF, @sbarnum will have to answer that based on the fact that RDF has both Bard's confusion is another symptom of why "JSON" people (@nishakm, @tsteenbe) say that JSON/JSON-Schema (and to a lesser extent XML/XSD) are simple to understand and use while JSON-LD, RDF/XML and Turtle should be a separate topic of discussion, not harmonized as one-or-the-other equivalents of JSON and XML. I don't think Bard would have any trouble explaining XSD facets or JSON Schema types. And for SpdxId:
(I always use the Gettysburg test whenever I see "xsd:string" - if it's OK for the data to be "Fourscore and seven years ago our fathers brought forth, upon this continent, a new nation" then it's OK to use string. Otherwise it's necessary to ask why the Gettysburg address isn't an OK value for that property, and create a SubtypeOf string that is OK. The Gettysburg address is fine as a comment, but it's probably not fine as a name.) |
"The result of Generative AI is highly plausible fiction -- which oftentimes happens to be factually correct." |
I think for RDF, my current opinion is we "restrict" our use to data types to only be pre-defined XSD schema datatypes and use SHACL to further restrict the property. For It looks like implementing pattern restrictions is still a TODO in the spec-parser, so perhaps when we implement that we can switch from using |
Back to the original PR - in RDF, the It seems a bit odd that we would have a property that has to be the same as the object URI. We would want to restrict the object URI to be a real URI (e.g. no anonymous) - not sure if there is a way to do such a restriction. @sbarnum any thoughts? |
I'm not sure the idea of Logical types and values is coming through. For each Element you need to know:
Serialization determines how the answers to those questions are represented in data. In computing, matrices can be stored in memory in row-major or column-major order - it is the identical matrix regardless of how it is laid out (serialized) in memory. Similarly, Elements can be serialized into payloads in type-major order (as 2.3 does it, with elements indexed by type (Package, File, etc) first, then by id within a type) or id-major order (with elements indexed by id first, then the type of the element is designated individually for each id). There have been discussions of which is better, but the point is that it makes no difference to the logical model.
A logical "property" is a question to be answered about an element value. Regardless of how it is serialized, an element has to have an "object URI" (an id) and a type. In RDF serialization there would not be two copies of the same id (or worse, two ids with different values). Regardless of serialization, there is one answer to the id question - the spdxId "logical property" of that element, of type SpdxId. In an in-memory representation of elements as used by a Python or Java application, a reasonable implementation would be for the spdxId to be a dictionary key. The value of that key would be all the other "logical properties", the value wouldn't have another copy of the key. In an application with type-major internal storage it would be a dictionary with type as the key and the value could be either a dictionary or a list. Either The logical model is implementation variable agnostic just as it is serialization-agnostic - the requirement levied by the logical model is for an application to be able to read RDF data and access the values of an element and know each element's id and type.
That's fine - the end result is that every property with a DateTime type has to match the pattern. An implementer would accomplish that using SHACL for RDF, XSD for XML, and JSON Schema for JSON. The logical model doesn't define implementation, it defines results. |
@davaya - My comment above is rather specific to RDF and the Ontology and Schema generated by the spec parser. In it's current form, we are duplicating the ID - I'm not sure that makes sense specifically for the RDF OWL and SHACL schemas. We're currently using the following transformations: Model markdown files -> model.ttl file (an RDF Turtle serialization of the OWL and SHACL for RDF) -> Java source files. When I look at what comes out in the Java files, I'm noticing some oddities - one of which is duplicate object ID's. |
I made some progress on the sandbox code that demonstrates datatype classes. In the model files, a Format section signals that the class is a simple datatype. A simple datatype has:
For example, the DateTime class has a pattern constraint under Format: DateTimeModel: 828388b 2023-07-05T10:09:48Z
String is the root string class with type xsd:string, AnyUri is a subtype of String with the xsd:anyURI schema, and SpdxId is a subtype of AnyUri that inherits the xsd:anyURI schema using the schema constraint under Format: StringModel: 828388b 2023-07-05T10:09:48Z
AnyUriModel: 828388b 2023-07-05T10:09:48Z
SpdxIdModel: 828388b 2023-07-05T10:09:48Z
Implementing simple datatype classes is simple in Python; I'm not a Java programmer but assume the details would be similar. Test code for the DateTime class produces the expected output:
I hope you are able to find the transformation bugs that are causing duplicate Object IDs; the model markdown files are working fine with the simple datatype Core classes. |
As above, I am against introducing a type which does not add anything to what is already available. |
The reason for defining an SpdxId datatype for xsd:anyURI is to distinguish IRIs that are SpdxIds (e.g., declaredLicense, concludedLicense, originatedBy, suppliedBy) from IRIs that are not SpdxIds (e.g., downloadLocation, packageUrl, homepage). These are all of type xsd:anyURL but they are serialized differently - namespaceMap is not used to compress IRIs that are not SpdxIds. Look at Package for a concrete example, and compare downloadLocation with concludedLicense. Naming the SpdxId datatype, just like naming the DateTime datatype, obviously adds to what would be in the model without them. No rationale has been presented for how assigning names to simple Datatypes is harmful, and the names are helpful. |
This is a long and interesting discussion, including Bard's AI opinions, but the questions to be answered are simple:
We stipulate that names are not necessary - the model can be implemented without them. The benefits are discussed above: SpdxId is used in many places, many more than DateTime. And URIs that are not SpdxIds are also used in many places. One example is Package, where every property is shown along with its type. The
In my opinion the benefits of naming Datatypes in general and SpdxId in particular are obvious and substantial. We have yet to hear any specific harms, costs, or disadvantages. PackageModel: 8dff2a3 2023-07-28T22:15:03Z
NOTE: in the time since this PR was submitted, datatypes have been created, moved to a different directory, SpdxId was added and then deleted. Once a decision is made, this PR can be closed and a new PR (if approved) will create SpdxId in the Datatypes directory. |
I'll register one disadvantage. It is specific to RDF serializations. We currently make the If we create our own definitions of BTW - I don't have a strong opinion on whether this disadvantage outweighs the advantages - but it is something to consider. |
Using a semantic type for Element identifiers enables software to distinguish between IRIs used as Element IDs and IRIs used for other purposes such as Package download location, package URL, and home page.
This allows NamespaceMap to be applied to serialized Element IDs such as Relationship to and from without affecting other IRI properties.