Replace TokenDatatype pattern #770

Rojax · 2024-10-11T10:43:29Z

User Story:

As a Metaschema user, I want to use the OSCAL catalog schema to validate my catalog files. I use https://github.com/python-jsonschema/check-jsonschema to validate my catalog against the schema https://github.com/usnistgov/OSCAL/releases/download/v1.1.2/oscal_catalog_schema.json.

However, Python's re module, for example, does not support \p{L} and \p{N} directly.

Error: schemafile was not valid: '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$' is not a 'regex'
Failed validating 'format' in metaschema['properties']['definitions']['additionalProperties']['properties']['pattern']:
    {'type': 'string', 'format': 'regex'}
On schema['definitions']['TokenDatatype']['pattern']:
    '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$'
SchemaError: '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$' is not a 'regex'
Failed validating 'format' in metaschema['properties']['definitions']['additionalProperties']['properties']['pattern']:
    {'type': 'string', 'format': 'regex'}
On schema['definitions']['TokenDatatype']['pattern']:
    '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$'

Also all other patterns in the same file are using [a-zA-Z] and [0-9] instead of \p{L} and \p{N}. That's why I'm opening the issue here and not in <https://github.com/python-jsonschema/check-jsonschema

Goals:

metaschema/schema/json/metaschema-datatypes.json

Line 123 in 2673565

"pattern": "^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$"

I'm suggesting to replace the above line with

"pattern": "^([a-zA-Z_])([a-zA-Z0-9.\\-_])*$"

This way it's more consistent with other patterns and more regex validators are supported.

Dependencies:

Not sure about the dependencies, because this is my first issue here.

Acceptance Criteria

All website and readme documentation affected by the changes in this issue have been updated. Changes to the website can be made in the docs/content directory of your branch.
A Pull Request (PR) is submitted that fully addresses the goals of this User Story. This issue is referenced in the PR.
The CI-CD build process runs without any reported errors on the PR. This can be confirmed by reviewing that all checks have passed in the PR.

The text was updated successfully, but these errors were encountered:

wendellpiez · 2024-10-16T17:20:43Z

IIRC using the Unicode character categories here (\p{L} and \p{N}) was deliberate inasmuch as we wanted tokens (which are sometimes user-facing) to support all Unicode 'letters' and 'numbers', not just those matching [A-Za-z0-9]+ ('lower ASCII'). Otherwise tokens are as tight as we thought we could make them, to align with the XML 'name' construct (as being a more restricted value space than keys in JSON).

This being the status quo, the main problem with the proposal as given is that it breaks backward compatibility for any data sets that already have tokens with special characters (which are of course not 'special' to their users). A secondary problem is that they can't use such characters in the future. Depending on your requirements and planned uses for your data (anything declared as a token) this is or is not a real problem.

This leads me to ask what an actual equivalent would be, which captures all the Unicode blocks matched by \p{L} and \p{N}, and which more libraries (or any preferred library) would support?

This would be very useful information even if you are just patching a schema. Whether the released schemas can be altered (compatibly) depends on whether such an equivalent exists.

@RS-Credentive IIRC did you have info bearing on this?

Note also: you could make this change in a local schema variant and you would only face problems receiving tokens using accented characters or characters in many/most writing systems....

Rojax · 2024-10-17T07:05:04Z

Thanks for the insights, much appreciated!

Note also: you could make this change in a local schema variant and you would only face problems receiving tokens using accented characters or characters in many/most writing systems....

Thanks for the hint. I already did this but opened this issue to save others the trouble.

However, I think it's not easily possible to support \p{L} and \p{N} easily by replacing them with "custom ranges". Therefore it might really be a better option to use other regex validators such as https://pypi.org/project/regex/ with support for \p{L} and \p{N}. After researching with more insights I got from your reply I also found a related issue here python-jsonschema/check-jsonschema#353.

RS-Credentive · 2024-10-17T08:48:46Z

@wendellpiez , thanks for tagging me on this. It was indeed a challenge for me to handle \p{L} in python. I discovered a library called "elementpath" which is a part of the "xmlschema" package on pypi.

I can process the paths in python like this see here:

            xml_pattern = datatype.patterns.regexps[0]
            pcre_pattern = elementpath.regex.translate_pattern(xml_pattern)

The equivalent of \p{L} is approximately (may be garbled due to cut and past from here):

A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮٯٱ-ۓەۥۦۮۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࢽऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএঐও-নপ-রলশ-হঽৎড়ঢ়য়-ৡৰৱৼਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલળવ-હઽૐૠૡૹଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହଽଡ଼ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠೡೱೲഅ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาำเ-ๆກຂຄຆ-ຊຌ-ຣລວ-ະາຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢄᢇ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々〆〱-〵〻〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆺㇰ-ㇿ㐀-䶵一-鿯ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪꘫꙀ-ꙮꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-Ᶎꟷ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭧꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּסּףּפּצּ-ﮱﯓ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷻﹰ-ﹴﹶ-ﻼ

I think that python specifically may include a lot of the XML characters in the pattern [A-Za-z] but I'm not sure that that is true of every language, or that every character in the list above would be represented by [A-Za-z] in python, but regardless the regex [A-Za-z0-9] would not include international character sets when used in XML, as wendell says.

In other languages YMMV

wendellpiez · 2024-10-17T13:55:49Z

@Rojax you are quite welcome, thanks again for posting.

"Hints", of course, are not only for you ... trying to spell it all out for the record and other readers also, who knows? 🤔

@RS-Credentive this is very helpful indeed, thanks to you as well.

Rojax added the enhancement New feature or request label Oct 11, 2024

Rojax mentioned this issue Oct 17, 2024

Support ECMAScript unicode-mode RegExp usage for 'pattern' and 'patternProperties' python-jsonschema/check-jsonschema#353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace TokenDatatype pattern #770

Replace TokenDatatype pattern #770

Rojax commented Oct 11, 2024 •

edited

Loading

wendellpiez commented Oct 16, 2024

Rojax commented Oct 17, 2024

RS-Credentive commented Oct 17, 2024

wendellpiez commented Oct 17, 2024

Replace TokenDatatype pattern #770

Replace TokenDatatype pattern #770

Comments

Rojax commented Oct 11, 2024 • edited Loading

User Story:

Goals:

Dependencies:

Acceptance Criteria

wendellpiez commented Oct 16, 2024

Rojax commented Oct 17, 2024

RS-Credentive commented Oct 17, 2024

wendellpiez commented Oct 17, 2024

Rojax commented Oct 11, 2024 •

edited

Loading