Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic tokenizer makes mistakes when a syntax tree (with category) has syntax children #456

Open
sungshik opened this issue Sep 27, 2024 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@sungshik
Copy link
Contributor

Describe the bug

Imagine a grammar of the following shape:

syntax FooBarWithCategory = @category="string" FooBar;
syntax FooBar             = "foo" "bar";

The semantic tokenizer will not tokenize input "foo bar" to a token with category string.

To Reproduce

Define the following function:

void f() {
    int  x = 3;
    real y = 3.14;
    rat  z = 3/14;
}

Expected behavior

  • Expected: 3,3.14, and 3r14 are highlighted as numbers
  • Actual: 3 isn't highlighted

Screenshots

image

Additional context

The current Rascal grammar doesn't include categories for literals, but this absence is "patched" inside the semantic tokenizer, so each int, real, and rat should be tokenized as a number. As a result, effectively, the semantic tokenizer works with the following grammar:

syntax Literal
    = @category="number" integer:  IntegerLiteral  integerLiteral 
    | @category="number" \real:    RealLiteral     realLiteral 
    | @category="number" rational: RationalLiteral rationalLiteral
    | ...
    ;

syntax  IntegerLiteral  = ...
lexical RealLiteral     = ...
lexical RationalLiteral = ...

IntegerLiteral does fit the shape above ("Describe the bug"), while RealLiteral and RationalLiteral don't: IntegerLiteral is syntax-in-syntax, while RealLiteral/RationalLiteral are lexical-in-syntax. The precise place in the code that causes the difference in tokenization behavior is this:

//Propagate current category to child unless currently in a syntax nonterminal
//*AND* the current child is a syntax nonterminal too

else if (!TreeAdapter.isChar((ITree) child) && ProductionAdapter.isSort(prod) &&
ProductionAdapter.isSort(TreeAdapter.getProduction((ITree) child))) {
collect((ITree) child, null);

@sungshik sungshik added the bug Something isn't working label Sep 27, 2024
@sungshik sungshik self-assigned this Sep 27, 2024
@sungshik
Copy link
Contributor Author

Related issue: #90

@DavyLandman
Copy link
Member

I remember one of the related problems is that what if the nested syntax has it's own categories?

For example:

lexical Ints = @category="integer" [0-9]+ !>> [0-9];
lexical Reals = @category="real" Ints "." Ints?;

Like in this case you might want to have the whole thing listed as real category. So I remember @rodinaarssen did some fixes around this code to support this nesting of categories.

@sungshik
Copy link
Contributor Author

sungshik commented Sep 27, 2024

Setting the stage

Davy and I have discussed this a bit more. There's a design decision to be made about the intended workings of categories. There seem to be two possible semantics:

  • Outer-over-inner: A category defined by a parent takes precedence over a category defined by a child. To illustrate the outer-over-inner semantics, consider the following grammar:

    // Example 1
    lexical Int  = @category="integer" [0-9]+;
    lexical Real = @category="real"    Int "." Int?;

    With outer-over-inner semantics, 314 has category integer, while 3.14 has category real (i.e., the inner category is ignored), as intended.

  • Inner-over-outer: A category defined by a child takes precedence over a category defined by a parent. To illustrate the inner-over-outer semantics, consider the following grammar:

    // Example 2
    syntax  String     = @category="string" "\"" StringElem* "\"";
    syntax  StringElem = [0-9 A-Z a-z] | "\<" Number "\>";
    lexical Number     = @category="number" [0-9]+;

    With inner-over-outer semantics, in "value: <123>", "value: < and >" have category string, while 123 has category number (i.e., the outer category is ignored), as intended.

Problem

Neither of the two possible semantics always works as intended:

  • Example 1 works as intended with outer-over-inner semantics, but not with inner-over-outer semantics.
  • Example 2 works as intended with inner-over-outer semantics, but not with outer-over-inner semantics.

Current implementation

The current implementation applies mostly inner-over-outer semantics, except that it applies "nothing-over-outer" semantics (i.e., the outer category is ignored, even if there is no inner category) when the parent is syntax, and the child is syntax, too. (E.g., as a result, integer literals in Rascal aren't highlighted.)

Considerations

  • When creating a new grammar, I believe it should be possible to express the intended categorization in one way or the other, regardless of whether inner-over-outer or outer-over-inner semantics are used (e.g., both Examples 1/2 can be rewritten to work as intended with either semantics).
  • When extending an existing grammar (without changing the existing one), though, outer-over-inner semantics seem to give more expressive power in terms of overwriting categories (in the existing grammar) that are considered unsuitable in the context of the extension.
  • However, inner-over-outer seems more "lexical" in spirt (hence, intuitive?).
  • ...

Questions for the audience

What to do? 🙂

@sungshik
Copy link
Contributor Author

What to do? 🙂

For now, I'd choose inner-over-outer if it were up to me alone 😉

@jurgenvinju
Copy link
Member

jurgenvinju commented Sep 27, 2024

Excellent overview of the problem. Thanks for the examples.

It's always been inner-over-outer, also in Eclipse. That works better because it simulates better what a token is in a normal tokenizer. Tokens typically do not overlap there and so inner-over-outer more naturally aligns with that semantics.

Example 1 is a common but annoying case for which we ask the grammar author to rewrite. A category should only be made on tokens and not on things that may become sub-tokens. The "reuse" of the Int non-terminal contradicts that idiom. And so if you want to "reuse" Int then you have to annotate higher in the tree.

Typicallye there is a common non-terminal about Int and Real where you could do that:

lexical Exp 
   = @category="integer" Int;
   |  @category="real"   Real;
   ;

However, a lot of structure in lexicals is not always appreciated (big trees), and the so-called "reuse" is a red herring anyway. So I'd write:

lexical Exp 
   = @category="integer" [0-9]+;
   |  @category="real"   [0-9]+ "." [0-9]*;
   ;

If there is common functionality to write for Int trees, you could just as well write it for [0-9]+. Or use an alias.

@jurgenvinju
Copy link
Member

jurgenvinju commented Sep 27, 2024

Note that it should "work" even if inner-over-outer is in effect and there are nested categories. That can be very useful for things like javadoc which has nested tokens inside of a broader comment context. The inner-over-outer token could be used to highlight everything as comments, and add additional styling for things like @see and @param inside.

With outer-over-inner such a thing would be nearly impossible. The grammar author would have to remove recursion and "serialize" a grammar with a grammatical form of continuation passing style. It's been done but it's a headache.

@DavyLandman
Copy link
Member

It's always been inner-over-outer, also in Eclipse. That works better because it simulates better what a token is in a normal tokenizer. Tokens typically do not overlap there and so inner-over-outer more naturally aligns with that semantics.

How about the case of nesting? Like example 2?

However, a lot of structure in lexicals is not always appreciated (big trees), and the so-called "reuse" is a red herring anyway.

The example was contrived, but it came from a more complex case at one of our customers, where there exists a family of languages, and they share a common base, but sometimes (especially for identifiers) want to give the identifier a different category. So in this example, yes I would also inline the [0-9]+, but sometimes we found we only wanted to override an existing category on 1 place, and don't want to rewrite the original grammar (or add an extra intermediate lexical)

If there is common functionality to write for Int trees, you could just as well write it for [0-9]+. Or use an alias.

What kind of alias do you refer to? Can we do aliasses in grammars?

@jurgenvinju
Copy link
Member

  • aliases in grammars are only supported by the compiler, afaik
  • inner-over-outer works by default if there is no overlap. If there is overlap, it depends on the back-end how this is mappable. In the Eclipse case, we merged the inner over the outer category. So for example if a background color was set by the outer category, and a foreground color was set by the inner, then the combined style was that background with that foreground. This works well with nested categories such as javadoc keywords like @see in comments like /** ... */

If we can not merge or overlap regions in the VScode back-end, then we could consider serializing the categories. Say outer overlaps with inner then we can create 3 regions: outer.start-inner.start, inner.start-inner.end, inner.end-outer.end.
I think this does not really lead to the desired effect; it's more complex to predict and the styles do not merge. It's probably better to ignore the outer style or find a way to merge them anyway in the back-end. The VScode tokenizer does understand nested styling...

Let's have a look here: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#:~:text=The%20client%20capability%20overlappingTokenSupport%20defines%20whether%20tokens%20can%20overlap%20each%20other.

I think that's the only meaningful solution; and it works closely to what we did in Eclipse with overlapping tokens.

@rodinaarssen
Copy link
Member

Inner-over-outer feels the most intuitive for me as well.

Current implementation

The current implementation applies mostly inner-over-outer semantics, except that it applies "nothing-over-outer" semantics (i.e., the outer category is ignored, even if there is no inner category) when the parent is syntax, and the child is syntax, too. (E.g., as a result, integer literals in Rascal aren't highlighted.)

I remember that the "nothing-over-outer" for "syntax-in-syntax" part was there to prevent grammar developers from being forced to provide (possibly) many category annotations.

Consider a syntactic nonterminal of which only some productions have a category annotation. What does the absence of the category annotation then mean? Inherit (i.e., inner-over-outer)? Explicitly no category (i.e., nothing-over-outer)?

@rodinaarssen
Copy link
Member

If we can not merge or overlap regions in the VScode back-end, then we could consider serializing the categories. Say outer overlaps with inner then we can create 3 regions: outer.start-inner.start, inner.start-inner.end, inner.end-outer.end. I think this does not really lead to the desired effect; it's more complex to predict and the styles do not merge. It's probably better to ignore the outer style or find a way to merge them anyway in the back-end. The VScode tokenizer does understand nested styling...

When semantic tokenization was developed for rascal-lsp, nested tokens were not supported. Therefore, what you describe about creating multiple regions is what is happening right now.

@sungshik
Copy link
Contributor Author

Thank you @jurgenvinju, @DavyLandman, and @rodinaarssen for contributing to the discussion.

Summary

Jurgen:

  • It has been inner-over-outer in Eclipse
  • Principle: categories should be on tokens and not sub-tokens
  • Principle: hierarchical structure in lexicals is better avoided anyway
  • Rewrite grammars according to these principles
  • Inner-over-outer is more practical when categories should be nested

Davy:

  • Rewriting grammars to abide by Jurgen's principle is not always practical (e.g., in case of families of languages with a common base)

Jurgen:

  • If tokens overlap, then their categories should be merged (as was done in Eclipse, and which VSCode seems to support)

Rodin:

  • Inner-over-outer feels the most intuitive
  • The nothing-over-outer special case (for syntax in syntax) was created to avoid having to explicitly assign NoCategory to productions whose parent(s) have a category
  • If tokens overlap, then their categories are assigned per "segment"

Thoughts

  • In principle, inner-over-outer seems preferred/most intuitive
  • However, there are concerns about special cases:
    1. Families of languages with a common base, where the categories in the base should never be overwritten by extensions
    2. A production has a category, but the majority of it child productions should not have a category
  • Overlapping of tokens requires extra care: instead of splitting them into <prefix-parent, child, suffix-parent> separate segments, the child styling should ideally be superimposed on the parent styling (to retain elements of the parent style for the child that the child doesn't set itself)

Do you agree?

@sungshik
Copy link
Contributor Author

sungshik commented Oct 16, 2024

Concrete proposal for next VS Code release:

  • For next VS Code release:
    • We adopt inner-over-outer.
    • We do not add new special code for special case i to the semantic tokenizer. This is fully backward-compatible.
    • We remove the existing special code for special case ii from the semantic tokenizer.
      • Consequence: This fixes the bug for which this issue was originally opened.
      • Consequence: Rascal users that rely on this special support (I expect not many) might need to rewrite their grammar a bit. We should state this (and give an example of a rewrite) in the release notes.
  • For future VS Code release:
    • We consider adding special support for cases i and ii in a way that Rascal users can explicitly control/configure. The principle would be: outer-over-inner by default for everything, except if the user explicitly indicates otherwise.

@DavyLandman
Copy link
Member

is there maybe a way a user can annotate the grammar to get the old behavior? So that they aren't forced to rewrite the grammar at this moment?

@sungshik
Copy link
Contributor Author

Yes, one simple option would be to have a tag on a production that indicates "old behavior" for any subtree produced by that production. So, just put it on the start production(s) of a grammar to have old behavior for everything. Alternatively, it could be a configuration param of a language server to toggle it for all grammars hosted by that server. Not sure if these are the best long-term solutions, but at least it's low-friction for the user right now.

@DavyLandman
Copy link
Member

I would like it if we could do some kind of deprecation warning, or a way to get the old behavior back without much pain. Without making our code harder to maintain.

And then note that in a coming release it will be dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants