simplify jflex grammars by using difference rather than negation #515

rmuir · 2021-12-04T16:32:01Z

This change uses a new jflex feature (jflex-de/jflex#654) to simplify emoji processing in the grammar. We can do a set difference rather than workaround it with complement + demorgan stuff.

It is cosmetic: doesn't change the resulting tokenizers (see diff), but makes the emoji parts easier to read.

bonus: major speed up to regenerating that huge UAX29UrlEmail DFA.

Before:

> Task :lucene:analysis:common:generateUAX29URLEmailTokenizer
Aggregate task times (possibly running in parallel!):
 918.87 sec.  generateUAX29URLEmailTokenizerInternal

After:

> Task :lucene:analysis:common:generateUAX29URLEmailTokenizer
Aggregate task times (possibly running in parallel!):
 285.26 sec.  generateUAX29URLEmailTokenizerInternal

This was suggested by jflex developers to help with the very-slow-regeneration on jflex-de/jflex#715 . It doesn't solve all of our problems there, but it makes things a lot less painful :)

rmuir · 2021-12-04T16:37:14Z

I'll take care of the precommit. There's some build wierdness in the regeneration where we pull the TLDs and make the new included-TLD.jflex after we recompile the grammar. Maybe dependencies are backwards. Anyay, I didn't really want to suck in today's new TLDs and trigger any changes to the parsers anyway, as I wanted to show this change "makes no difference".

Oh well, ill just force-regenerate everything (you really need to do it twice right now to really pull in the new TLDs) and push again.

rmuir · 2021-12-04T16:41:03Z

and don't worry about the gradle build, I think part of the issue is that this TLDs file annoyingly changed while I was iterating and working on this:

# Version 2021120400, Last Updated Sat Dec  4 07:07:01 2021 UTC

rmuir · 2021-12-04T17:15:16Z

I opened https://issues.apache.org/jira/browse/LUCENE-10285 about the TLD task dependency. For now, I just ran regenerate again and it brought in the TLD changes. It is a lot less painful at least now that it takes 1/3 of the time!

dweiss · 2021-12-04T17:23:28Z

That's a nice improvement!

rmuir · 2021-12-04T18:54:19Z

@dweiss i think it's good win just for getting a simpler grammar.

There is probably evil stuff we could do to speed up the monster. Yes, I am slightly tempted to import org.apache.lucene.util.automaton into GenerateJFlexTLDMacros.java... but I agree with your thoughts on the JFlex issue, it is better to just generate the simple "transparent" grammar of TLDs, despite how slow it makes things.

…che#515) Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.

simplify jflex grammars by using difference rather than negation

7f3eeb3

rmuir requested a review from dweiss December 4, 2021 16:32

run regenerate twice

969ea26

rmuir merged commit 5c746db into apache:main Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplify jflex grammars by using difference rather than negation #515

simplify jflex grammars by using difference rather than negation #515

rmuir commented Dec 4, 2021

rmuir commented Dec 4, 2021

rmuir commented Dec 4, 2021

rmuir commented Dec 4, 2021

dweiss commented Dec 4, 2021

rmuir commented Dec 4, 2021

simplify jflex grammars by using difference rather than negation #515

simplify jflex grammars by using difference rather than negation #515

Conversation

rmuir commented Dec 4, 2021

rmuir commented Dec 4, 2021

rmuir commented Dec 4, 2021

rmuir commented Dec 4, 2021

dweiss commented Dec 4, 2021

rmuir commented Dec 4, 2021