Semantic tokenizer makes mistakes when a `syntax` tree (with category) has `syntax` children #456

sungshik · 2024-09-27T09:06:14Z

Describe the bug

Imagine a grammar of the following shape:

syntax FooBarWithCategory = @category="string" FooBar;
syntax FooBar             = "foo" "bar";

The semantic tokenizer will not tokenize input "foo bar" to a token with category string.

To Reproduce

Define the following function:

void f() {
    int  x = 3;
    real y = 3.14;
    rat  z = 3/14;
}

Expected behavior

Expected: 3,3.14, and 3r14 are highlighted as numbers
Actual: 3 isn't highlighted

Screenshots

Additional context

The current Rascal grammar doesn't include categories for literals, but this absence is "patched" inside the semantic tokenizer, so each int, real, and rat should be tokenized as a number. As a result, effectively, the semantic tokenizer works with the following grammar:

syntax Literal
    = @category="number" integer:  IntegerLiteral  integerLiteral 
    | @category="number" \real:    RealLiteral     realLiteral 
    | @category="number" rational: RationalLiteral rationalLiteral
    | ...
    ;

syntax  IntegerLiteral  = ...
lexical RealLiteral     = ...
lexical RationalLiteral = ...

IntegerLiteral does fit the shape above ("Describe the bug"), while RealLiteral and RationalLiteral don't: IntegerLiteral is syntax-in-syntax, while RealLiteral/RationalLiteral are lexical-in-syntax. The precise place in the code that causes the difference in tokenization behavior is this:

rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

Lines 468 to 469 in 6162e0c

    
           //Propagate current category to child unless currently in a syntax nonterminal 
        
           //*AND* the current child is a syntax nonterminal too

rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

Lines 473 to 475 in 6162e0c

    
           else if (!TreeAdapter.isChar((ITree) child) && ProductionAdapter.isSort(prod) && 
        
                   ProductionAdapter.isSort(TreeAdapter.getProduction((ITree) child))) { 
        
               collect((ITree) child, null);

The text was updated successfully, but these errors were encountered:

sungshik · 2024-09-27T09:25:58Z

Related issue: #90

DavyLandman · 2024-09-27T09:33:40Z

I remember one of the related problems is that what if the nested syntax has it's own categories?

For example:

lexical Ints = @category="integer" [0-9]+ !>> [0-9];
lexical Reals = @category="real" Ints "." Ints?;

Like in this case you might want to have the whole thing listed as real category. So I remember @rodinaarssen did some fixes around this code to support this nesting of categories.

sungshik · 2024-09-27T12:02:49Z

Setting the stage

Davy and I have discussed this a bit more. There's a design decision to be made about the intended workings of categories. There seem to be two possible semantics:

Outer-over-inner: A category defined by a parent takes precedence over a category defined by a child. To illustrate the outer-over-inner semantics, consider the following grammar:
```
// Example 1
lexical Int  = @category="integer" [0-9]+;
lexical Real = @category="real"    Int "." Int?;
```
With outer-over-inner semantics, 314 has category integer, while 3.14 has category real (i.e., the inner category is ignored), as intended.
Inner-over-outer: A category defined by a child takes precedence over a category defined by a parent. To illustrate the inner-over-outer semantics, consider the following grammar:
```
// Example 2
syntax  String     = @category="string" "\"" StringElem* "\"";
syntax  StringElem = [0-9 A-Z a-z] | "\<" Number "\>";
lexical Number     = @category="number" [0-9]+;
```
With inner-over-outer semantics, in "value: <123>", "value: < and >" have category string, while 123 has category number (i.e., the outer category is ignored), as intended.

Problem

Neither of the two possible semantics always works as intended:

Example 1 works as intended with outer-over-inner semantics, but not with inner-over-outer semantics.
Example 2 works as intended with inner-over-outer semantics, but not with outer-over-inner semantics.

Current implementation

The current implementation applies mostly inner-over-outer semantics, except that it applies "nothing-over-outer" semantics (i.e., the outer category is ignored, even if there is no inner category) when the parent is syntax, and the child is syntax, too. (E.g., as a result, integer literals in Rascal aren't highlighted.)

Considerations

When creating a new grammar, I believe it should be possible to express the intended categorization in one way or the other, regardless of whether inner-over-outer or outer-over-inner semantics are used (e.g., both Examples 1/2 can be rewritten to work as intended with either semantics).
When extending an existing grammar (without changing the existing one), though, outer-over-inner semantics seem to give more expressive power in terms of overwriting categories (in the existing grammar) that are considered unsuitable in the context of the extension.
However, inner-over-outer seems more "lexical" in spirt (hence, intuitive?).
...

Questions for the audience

What to do? 🙂

sungshik · 2024-09-27T12:19:12Z

What to do? 🙂

For now, I'd choose inner-over-outer if it were up to me alone 😉

jurgenvinju · 2024-09-27T15:21:13Z

Excellent overview of the problem. Thanks for the examples.

It's always been inner-over-outer, also in Eclipse. That works better because it simulates better what a token is in a normal tokenizer. Tokens typically do not overlap there and so inner-over-outer more naturally aligns with that semantics.

Example 1 is a common but annoying case for which we ask the grammar author to rewrite. A category should only be made on tokens and not on things that may become sub-tokens. The "reuse" of the Int non-terminal contradicts that idiom. And so if you want to "reuse" Int then you have to annotate higher in the tree.

Typicallye there is a common non-terminal about Int and Real where you could do that:

lexical Exp 
   = @category="integer" Int;
   |  @category="real"   Real;
   ;

However, a lot of structure in lexicals is not always appreciated (big trees), and the so-called "reuse" is a red herring anyway. So I'd write:

lexical Exp 
   = @category="integer" [0-9]+;
   |  @category="real"   [0-9]+ "." [0-9]*;
   ;

If there is common functionality to write for Int trees, you could just as well write it for [0-9]+. Or use an alias.

jurgenvinju · 2024-09-27T15:27:40Z

Note that it should "work" even if inner-over-outer is in effect and there are nested categories. That can be very useful for things like javadoc which has nested tokens inside of a broader comment context. The inner-over-outer token could be used to highlight everything as comments, and add additional styling for things like @see and @param inside.

With outer-over-inner such a thing would be nearly impossible. The grammar author would have to remove recursion and "serialize" a grammar with a grammatical form of continuation passing style. It's been done but it's a headache.

DavyLandman · 2024-09-30T10:49:19Z

It's always been inner-over-outer, also in Eclipse. That works better because it simulates better what a token is in a normal tokenizer. Tokens typically do not overlap there and so inner-over-outer more naturally aligns with that semantics.

How about the case of nesting? Like example 2?

However, a lot of structure in lexicals is not always appreciated (big trees), and the so-called "reuse" is a red herring anyway.

The example was contrived, but it came from a more complex case at one of our customers, where there exists a family of languages, and they share a common base, but sometimes (especially for identifiers) want to give the identifier a different category. So in this example, yes I would also inline the [0-9]+, but sometimes we found we only wanted to override an existing category on 1 place, and don't want to rewrite the original grammar (or add an extra intermediate lexical)

If there is common functionality to write for Int trees, you could just as well write it for [0-9]+. Or use an alias.

What kind of alias do you refer to? Can we do aliasses in grammars?

jurgenvinju · 2024-10-09T14:12:29Z

aliases in grammars are only supported by the compiler, afaik
inner-over-outer works by default if there is no overlap. If there is overlap, it depends on the back-end how this is mappable. In the Eclipse case, we merged the inner over the outer category. So for example if a background color was set by the outer category, and a foreground color was set by the inner, then the combined style was that background with that foreground. This works well with nested categories such as javadoc keywords like @see in comments like /** ... */

If we can not merge or overlap regions in the VScode back-end, then we could consider serializing the categories. Say outer overlaps with inner then we can create 3 regions: outer.start-inner.start, inner.start-inner.end, inner.end-outer.end.
I think this does not really lead to the desired effect; it's more complex to predict and the styles do not merge. It's probably better to ignore the outer style or find a way to merge them anyway in the back-end. The VScode tokenizer does understand nested styling...

Let's have a look here: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#:~:text=The%20client%20capability%20overlappingTokenSupport%20defines%20whether%20tokens%20can%20overlap%20each%20other.

I think that's the only meaningful solution; and it works closely to what we did in Eclipse with overlapping tokens.

rodinaarssen · 2024-10-14T13:41:26Z

Inner-over-outer feels the most intuitive for me as well.

Current implementation

The current implementation applies mostly inner-over-outer semantics, except that it applies "nothing-over-outer" semantics (i.e., the outer category is ignored, even if there is no inner category) when the parent is syntax, and the child is syntax, too. (E.g., as a result, integer literals in Rascal aren't highlighted.)

I remember that the "nothing-over-outer" for "syntax-in-syntax" part was there to prevent grammar developers from being forced to provide (possibly) many category annotations.

Consider a syntactic nonterminal of which only some productions have a category annotation. What does the absence of the category annotation then mean? Inherit (i.e., inner-over-outer)? Explicitly no category (i.e., nothing-over-outer)?

rodinaarssen · 2024-10-14T13:44:09Z

If we can not merge or overlap regions in the VScode back-end, then we could consider serializing the categories. Say outer overlaps with inner then we can create 3 regions: outer.start-inner.start, inner.start-inner.end, inner.end-outer.end. I think this does not really lead to the desired effect; it's more complex to predict and the styles do not merge. It's probably better to ignore the outer style or find a way to merge them anyway in the back-end. The VScode tokenizer does understand nested styling...

When semantic tokenization was developed for rascal-lsp, nested tokens were not supported. Therefore, what you describe about creating multiple regions is what is happening right now.

sungshik · 2024-10-15T10:00:44Z

Thank you @jurgenvinju, @DavyLandman, and @rodinaarssen for contributing to the discussion.

Summary

Jurgen:

It has been inner-over-outer in Eclipse
Principle: categories should be on tokens and not sub-tokens
Principle: hierarchical structure in lexicals is better avoided anyway
Rewrite grammars according to these principles
Inner-over-outer is more practical when categories should be nested

Davy:

Rewriting grammars to abide by Jurgen's principle is not always practical (e.g., in case of families of languages with a common base)

Jurgen:

If tokens overlap, then their categories should be merged (as was done in Eclipse, and which VSCode seems to support)

Rodin:

Inner-over-outer feels the most intuitive
The nothing-over-outer special case (for syntax in syntax) was created to avoid having to explicitly assign NoCategory to productions whose parent(s) have a category
If tokens overlap, then their categories are assigned per "segment"

Thoughts

In principle, inner-over-outer seems preferred/most intuitive
However, there are concerns about special cases:
1. Families of languages with a common base, where the categories in the base should never be overwritten by extensions
2. A production has a category, but the majority of it child productions should not have a category
Overlapping of tokens requires extra care: instead of splitting them into <prefix-parent, child, suffix-parent> separate segments, the child styling should ideally be superimposed on the parent styling (to retain elements of the parent style for the child that the child doesn't set itself)

Do you agree?

sungshik · 2024-10-16T10:02:39Z

Concrete proposal for next VS Code release:

For next VS Code release:
- We adopt inner-over-outer.
- We do not add new special code for special case i to the semantic tokenizer. This is fully backward-compatible.
- We remove the existing special code for special case ii from the semantic tokenizer.
  - Consequence: This fixes the bug for which this issue was originally opened.
  - Consequence: Rascal users that rely on this special support (I expect not many) might need to rewrite their grammar a bit. We should state this (and give an example of a rewrite) in the release notes.
For future VS Code release:
- We consider adding special support for cases i and ii in a way that Rascal users can explicitly control/configure. The principle would be: outer-over-inner by default for everything, except if the user explicitly indicates otherwise.

DavyLandman · 2024-10-16T10:29:37Z

is there maybe a way a user can annotate the grammar to get the old behavior? So that they aren't forced to rewrite the grammar at this moment?

sungshik · 2024-10-16T11:22:26Z

Yes, one simple option would be to have a tag on a production that indicates "old behavior" for any subtree produced by that production. So, just put it on the start production(s) of a grammar to have old behavior for everything. Alternatively, it could be a configuration param of a language server to toggle it for all grammars hosted by that server. Not sure if these are the best long-term solutions, but at least it's low-friction for the user right now.

DavyLandman · 2024-10-16T12:55:05Z

I would like it if we could do some kind of deprecation warning, or a way to get the old behavior back without much pain. Without making our code harder to maintain.

And then note that in a coming release it will be dropped.

sungshik added the bug Something isn't working label Sep 27, 2024

sungshik self-assigned this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic tokenizer makes mistakes when a `syntax` tree (with category) has `syntax` children #456

Semantic tokenizer makes mistakes when a `syntax` tree (with category) has `syntax` children #456

sungshik commented Sep 27, 2024

sungshik commented Sep 27, 2024

DavyLandman commented Sep 27, 2024

sungshik commented Sep 27, 2024 •

edited

Loading

sungshik commented Sep 27, 2024

jurgenvinju commented Sep 27, 2024 •

edited

Loading

jurgenvinju commented Sep 27, 2024 •

edited

Loading

DavyLandman commented Sep 30, 2024

jurgenvinju commented Oct 9, 2024

rodinaarssen commented Oct 14, 2024

Current implementation

rodinaarssen commented Oct 14, 2024

sungshik commented Oct 15, 2024

sungshik commented Oct 16, 2024 •

edited

Loading

DavyLandman commented Oct 16, 2024

sungshik commented Oct 16, 2024

DavyLandman commented Oct 16, 2024

Semantic tokenizer makes mistakes when a syntax tree (with category) has syntax children #456

Semantic tokenizer makes mistakes when a syntax tree (with category) has syntax children #456

Comments

sungshik commented Sep 27, 2024

sungshik commented Sep 27, 2024

DavyLandman commented Sep 27, 2024

sungshik commented Sep 27, 2024 • edited Loading

Setting the stage

Problem

Current implementation

Considerations

Questions for the audience

sungshik commented Sep 27, 2024

jurgenvinju commented Sep 27, 2024 • edited Loading

jurgenvinju commented Sep 27, 2024 • edited Loading

DavyLandman commented Sep 30, 2024

jurgenvinju commented Oct 9, 2024

rodinaarssen commented Oct 14, 2024

Current implementation

rodinaarssen commented Oct 14, 2024

sungshik commented Oct 15, 2024

Summary

Thoughts

sungshik commented Oct 16, 2024 • edited Loading

DavyLandman commented Oct 16, 2024

sungshik commented Oct 16, 2024

DavyLandman commented Oct 16, 2024

Semantic tokenizer makes mistakes when a `syntax` tree (with category) has `syntax` children #456

Semantic tokenizer makes mistakes when a `syntax` tree (with category) has `syntax` children #456

sungshik commented Sep 27, 2024 •

edited

Loading

jurgenvinju commented Sep 27, 2024 •

edited

Loading

jurgenvinju commented Sep 27, 2024 •

edited

Loading

sungshik commented Oct 16, 2024 •

edited

Loading