-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic tokenizer makes mistakes when a syntax
tree (with category) has syntax
children
#456
Comments
Related issue: #90 |
I remember one of the related problems is that what if the nested syntax has it's own categories? For example: lexical Ints = @category="integer" [0-9]+ !>> [0-9];
lexical Reals = @category="real" Ints "." Ints?; Like in this case you might want to have the whole thing listed as |
Setting the stageDavy and I have discussed this a bit more. There's a design decision to be made about the intended workings of categories. There seem to be two possible semantics:
ProblemNeither of the two possible semantics always works as intended:
Current implementationThe current implementation applies mostly inner-over-outer semantics, except that it applies "nothing-over-outer" semantics (i.e., the outer category is ignored, even if there is no inner category) when the parent is Considerations
Questions for the audienceWhat to do? 🙂 |
For now, I'd choose inner-over-outer if it were up to me alone 😉 |
Excellent overview of the problem. Thanks for the examples. It's always been inner-over-outer, also in Eclipse. That works better because it simulates better what a token is in a normal tokenizer. Tokens typically do not overlap there and so inner-over-outer more naturally aligns with that semantics. Example 1 is a common but annoying case for which we ask the grammar author to rewrite. A category should only be made on tokens and not on things that may become sub-tokens. The "reuse" of the Typicallye there is a common non-terminal about lexical Exp
= @category="integer" Int;
| @category="real" Real;
; However, a lot of structure in lexicals is not always appreciated (big trees), and the so-called "reuse" is a red herring anyway. So I'd write: lexical Exp
= @category="integer" [0-9]+;
| @category="real" [0-9]+ "." [0-9]*;
; If there is common functionality to write for |
Note that it should "work" even if inner-over-outer is in effect and there are nested categories. That can be very useful for things like javadoc which has nested tokens inside of a broader comment context. The inner-over-outer token could be used to highlight everything as comments, and add additional styling for things like With outer-over-inner such a thing would be nearly impossible. The grammar author would have to remove recursion and "serialize" a grammar with a grammatical form of continuation passing style. It's been done but it's a headache. |
How about the case of nesting? Like example 2?
The example was contrived, but it came from a more complex case at one of our customers, where there exists a family of languages, and they share a common base, but sometimes (especially for identifiers) want to give the identifier a different category. So in this example, yes I would also inline the
What kind of alias do you refer to? Can we do aliasses in grammars? |
If we can not merge or overlap regions in the VScode back-end, then we could consider serializing the categories. Say Let's have a look here: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#:~:text=The%20client%20capability%20overlappingTokenSupport%20defines%20whether%20tokens%20can%20overlap%20each%20other. I think that's the only meaningful solution; and it works closely to what we did in Eclipse with overlapping tokens. |
Inner-over-outer feels the most intuitive for me as well.
I remember that the "nothing-over-outer" for "syntax-in-syntax" part was there to prevent grammar developers from being forced to provide (possibly) many category annotations. Consider a syntactic nonterminal of which only some productions have a category annotation. What does the absence of the category annotation then mean? Inherit (i.e., inner-over-outer)? Explicitly no category (i.e., nothing-over-outer)? |
When semantic tokenization was developed for |
Thank you @jurgenvinju, @DavyLandman, and @rodinaarssen for contributing to the discussion. SummaryJurgen:
Davy:
Jurgen:
Rodin:
Thoughts
Do you agree? |
Concrete proposal for next VS Code release:
|
is there maybe a way a user can annotate the grammar to get the old behavior? So that they aren't forced to rewrite the grammar at this moment? |
Yes, one simple option would be to have a tag on a production that indicates "old behavior" for any subtree produced by that production. So, just put it on the start production(s) of a grammar to have old behavior for everything. Alternatively, it could be a configuration param of a language server to toggle it for all grammars hosted by that server. Not sure if these are the best long-term solutions, but at least it's low-friction for the user right now. |
I would like it if we could do some kind of deprecation warning, or a way to get the old behavior back without much pain. Without making our code harder to maintain. And then note that in a coming release it will be dropped. |
Describe the bug
Imagine a grammar of the following shape:
The semantic tokenizer will not tokenize input
"foo bar"
to a token with categorystring
.To Reproduce
Define the following function:
Expected behavior
3
,3.14
, and3r14
are highlighted as numbers3
isn't highlightedScreenshots
Additional context
The current Rascal grammar doesn't include categories for literals, but this absence is "patched" inside the semantic tokenizer, so each
int
,real
, andrat
should be tokenized as a number. As a result, effectively, the semantic tokenizer works with the following grammar:IntegerLiteral
does fit the shape above ("Describe the bug"), whileRealLiteral
andRationalLiteral
don't:IntegerLiteral
issyntax
-in-syntax
, whileRealLiteral
/RationalLiteral
arelexical
-in-syntax
. The precise place in the code that causes the difference in tokenization behavior is this:rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java
Lines 468 to 469 in 6162e0c
rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java
Lines 473 to 475 in 6162e0c
The text was updated successfully, but these errors were encountered: