-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace TokenDatatype pattern #770
Comments
IIRC using the Unicode character categories here ( This being the status quo, the main problem with the proposal as given is that it breaks backward compatibility for any data sets that already have tokens with special characters (which are of course not 'special' to their users). A secondary problem is that they can't use such characters in the future. Depending on your requirements and planned uses for your data (anything declared as a This leads me to ask what an actual equivalent would be, which captures all the Unicode blocks matched by This would be very useful information even if you are just patching a schema. Whether the released schemas can be altered (compatibly) depends on whether such an equivalent exists. @RS-Credentive IIRC did you have info bearing on this? Note also: you could make this change in a local schema variant and you would only face problems receiving tokens using accented characters or characters in many/most writing systems.... |
Thanks for the insights, much appreciated!
Thanks for the hint. I already did this but opened this issue to save others the trouble. However, I think it's not easily possible to support |
@wendellpiez , thanks for tagging me on this. It was indeed a challenge for me to handle \p{L} in python. I discovered a library called "elementpath" which is a part of the "xmlschema" package on pypi. I can process the paths in python like this see here: xml_pattern = datatype.patterns.regexps[0]
pcre_pattern = elementpath.regex.translate_pattern(xml_pattern) The equivalent of \p{L} is approximately (may be garbled due to cut and past from here):
I think that python specifically may include a lot of the XML characters in the pattern [A-Za-z] but I'm not sure that that is true of every language, or that every character in the list above would be represented by [A-Za-z] in python, but regardless the regex [A-Za-z0-9] would not include international character sets when used in XML, as wendell says. In other languages YMMV |
@Rojax you are quite welcome, thanks again for posting. "Hints", of course, are not only for you ... trying to spell it all out for the record and other readers also, who knows? 🤔 @RS-Credentive this is very helpful indeed, thanks to you as well. |
User Story:
As a Metaschema user, I want to use the OSCAL catalog schema to validate my catalog files. I use https://github.com/python-jsonschema/check-jsonschema to validate my catalog against the schema https://github.com/usnistgov/OSCAL/releases/download/v1.1.2/oscal_catalog_schema.json.
However, Python's
re
module, for example, does not support\p{L}
and\p{N}
directly.Also all other patterns in the same file are using
[a-zA-Z]
and[0-9]
instead of\p{L}
and\p{N}
. That's why I'm opening the issue here and not in <https://github.com/python-jsonschema/check-jsonschemaGoals:
metaschema/schema/json/metaschema-datatypes.json
Line 123 in 2673565
I'm suggesting to replace the above line with
This way it's more consistent with other patterns and more regex validators are supported.
Dependencies:
Not sure about the dependencies, because this is my first issue here.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: