Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] #153

Open
lsf37 opened this issue Feb 15, 2015 · 8 comments
Open

[Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] #153

lsf37 opened this issue Feb 15, 2015 · 8 comments
Labels
enhancement Feature requests
Milestone

Comments

@lsf37
Copy link
Member

lsf37 commented Feb 15, 2015

Reported by briansmith on 2006-10-29 06:30 UTC
My suggestion is simple:

  1. Everywhere that the code generate currently generates zzBuffer[xxxx], it should instead generate
    zzCharAt(xxxx).

  2. The standard skeletons define zzCharAt to be:

private char zzCharAt(int index) {
  return zzBuffer[index];
}

The benefit is that, once this is done, then user-defined skeletons can totally replace zzBuffer with any data structure they choose. In particular, they can replace it with a CharSequence.

The JetBrains team has already created such a modified version of JFlex which they recommend people to use to implement on-the-fly lexical analyzis for syntax highlighting and other in-editor uses. Please see the "Implementing a Lexer" section of http://www.jetbrains.com/idea/documentation/idea\_5.0.html
for information.

My suggested variation allows for the same functionality as theirs, while remaining compatible with older JDK versions (without CharSequence). I believe, but haven't verified, that modern JVm's should have no problems inlining zzCharAt() to result in minimal performance impact. At least in my application, performance wasn't impacted.

I will attach a patch in diff -u format.

@lsf37 lsf37 changed the title replace zzBuffer[x] with a zzChar(x) [Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] Feb 15, 2015
@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by briansmith on 2006-10-29 06:30 UTC
Patch in diff -u format

@lsf37 lsf37 added the enhancement Feature requests label Feb 15, 2015
@jvolkman
Copy link

It looks like the original suggestion has become more difficult now that Emitter is calling Character methods that assume a char array (i.e., Character.offsetByCodePoints, Character.codePointAt).

An alternative approach is to modify JFlex's Emitter to require a CharSequence, and update the default skeletons to wrap an incoming char array in a lightweight CharSequence implementation backed by the array. A quick benchmark shows minimal difference in performance between array[n] and sequence.charAt(n).

import java.util.Random;

public class Test {
  static Random random = new Random();
  private static char[] createCharacters() {
    char[] characters = new char[100000];
    for (int i = 0; i < characters.length; i++) {
      characters[i] = (char)('a' + (i % 26));
    }
    return characters;
  }
  private static CharSequence createCharSequence(final char[] array) {
    return createCharSequence(array, 0, array.length);
  }
  private static CharSequence createCharSequence(final char[] array, final int min, final int max) {
    return new CharSequence() {
      @Override
      public int length() {
        return max - min;
      }
      @Override
      public char charAt(int index) {
        return array[min + index];
      }
      @Override
      public CharSequence subSequence(int start, int end) {
        return createCharSequence(array, min + start, min + end);
      }
    };
  }
  private static char testCharArray(char[] array) {
    char c = '\0';
    for (int i = 0; i < 10000000; i++) {
      c = array[random.nextInt(array.length)];
    }
    return c;
  }
  private static char testCharSequence(CharSequence sequence) {
    char c = '\0';
    for (int i = 0; i < 10000000; i++) {
      c = sequence.charAt(random.nextInt(sequence.length()));
    }
    return c;
  }
  private static void runTest(char[] array) {
    CharSequence sequence = createCharSequence(array);
    long start = System.currentTimeMillis();
    testCharArray(array);
    System.out.printf("ARRAY took %d ms\n", System.currentTimeMillis() - start);
    start = System.currentTimeMillis();
    testCharSequence(sequence);
    System.out.printf("SEQUENCE took %d ms\n", System.currentTimeMillis() - start);
  }
  public static void main(String[] args) {
    char[] array = createCharacters();
    for (int i = 0; i < 100; i++) {
      runTest(array);
    }
  }
}

Output snippet:

ARRAY took 98 ms
SEQUENCE took 101 ms
ARRAY took 98 ms
SEQUENCE took 100 ms
ARRAY took 99 ms
SEQUENCE took 101 ms
ARRAY took 100 ms
SEQUENCE took 100 ms
ARRAY took 99 ms
SEQUENCE took 100 ms

@lsf37
Copy link
Member Author

lsf37 commented Oct 26, 2015

The numbers are encouraging. I think we still need to run it in the JFlex context (i.e. a full scanner loop) to make sure, because it might influence cache-locality etc, which can have surprising effects, but it looks much better than I feared it might.

@VladRassokhin
Copy link

It seems @jvolkman benchmark spends almost all time in Random#nextInt(int).
I've recreated similar benchmark using JMH, but with pre-computed random indices array and intermediate results are:

Benchmark                       (size)  Mode  Cnt        Score       Error  Units
MyBenchmark.testCharArray     10000000  avgt   50  3999587.111 ± 43656.101  ns/op
MyBenchmark.testCharSequence  10000000  avgt   50  4591648.902 ± 66154.991  ns/op

In ms: 4.09 vs 4.59. Will try to investigate more.
Maybe this numbers are ok (don't think so) and JFlex have performance issues in other places (as any other software) ;)

@regisd
Copy link
Member

regisd commented Apr 1, 2020

I think we should prioritize this, because most of the lexers I see are "generated by JFlex 1.7.0 tweaked for IntelliJ platform" because they are IntellJ plugins.

@lsf37
Copy link
Member Author

lsf37 commented Apr 2, 2020

Happy to put this at the top of the list, although I don't think makes sense as a standard setting. Going from 4.0 to 4.6 is significant slowdown for one specific application.

We could provide it as option, though, for when flexibility matters more than performance (like here).

I do not think that IntelliJ lexers are anywhere close to the majority of applications, they are just prominent, which is fine. Very happy to support them in any case, it shouldn't be necessary to tweak the generator to use the lexers.

@VladRassokhin
Copy link

I'd suggest not to rely on my benchmark because since then everything changed a lot, new JDKs released, they may behave differently and moreover it tests only simple completely out-of-the-real-world use case. Exactlyfor one specific application as you said.
My point was to show that another person's benchmark was completely wrong.

@lsf37
Copy link
Member Author

lsf37 commented Apr 13, 2020

Good point, we should definitely add this to the benchmark suite and see what it does in context. If it's provided as an option, I'm less worried about the performance impact -- developers can then make the trade-off themselves. It's just if it's an interface change for everyone that we should be more careful.

@regisd regisd removed the performance label Jan 2, 2021
@lsf37 lsf37 modified the milestones: 1.9.0, 1.10.0 Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests
Projects
None yet
Development

No branches or pull requests

4 participants