Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify jflex grammars by using difference rather than negation #515

Merged
merged 2 commits into from
Dec 7, 2021

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Dec 4, 2021

This change uses a new jflex feature (jflex-de/jflex#654) to simplify emoji processing in the grammar. We can do a set difference rather than workaround it with complement + demorgan stuff.

It is cosmetic: doesn't change the resulting tokenizers (see diff), but makes the emoji parts easier to read.

bonus: major speed up to regenerating that huge UAX29UrlEmail DFA.

Before:

> Task :lucene:analysis:common:generateUAX29URLEmailTokenizer
Aggregate task times (possibly running in parallel!):
 918.87 sec.  generateUAX29URLEmailTokenizerInternal

After:

> Task :lucene:analysis:common:generateUAX29URLEmailTokenizer
Aggregate task times (possibly running in parallel!):
 285.26 sec.  generateUAX29URLEmailTokenizerInternal

This was suggested by jflex developers to help with the very-slow-regeneration on jflex-de/jflex#715 . It doesn't solve all of our problems there, but it makes things a lot less painful :)

@rmuir rmuir requested a review from dweiss December 4, 2021 16:32
@rmuir
Copy link
Member Author

rmuir commented Dec 4, 2021

I'll take care of the precommit. There's some build wierdness in the regeneration where we pull the TLDs and make the new included-TLD.jflex after we recompile the grammar. Maybe dependencies are backwards. Anyay, I didn't really want to suck in today's new TLDs and trigger any changes to the parsers anyway, as I wanted to show this change "makes no difference".

Oh well, ill just force-regenerate everything (you really need to do it twice right now to really pull in the new TLDs) and push again.

@rmuir
Copy link
Member Author

rmuir commented Dec 4, 2021

and don't worry about the gradle build, I think part of the issue is that this TLDs file annoyingly changed while I was iterating and working on this:

# Version 2021120400, Last Updated Sat Dec  4 07:07:01 2021 UTC

@rmuir
Copy link
Member Author

rmuir commented Dec 4, 2021

I opened https://issues.apache.org/jira/browse/LUCENE-10285 about the TLD task dependency. For now, I just ran regenerate again and it brought in the TLD changes. It is a lot less painful at least now that it takes 1/3 of the time!

@dweiss
Copy link
Contributor

dweiss commented Dec 4, 2021

That's a nice improvement!

@rmuir
Copy link
Member Author

rmuir commented Dec 4, 2021

@dweiss i think it's good win just for getting a simpler grammar.

There is probably evil stuff we could do to speed up the monster. Yes, I am slightly tempted to import org.apache.lucene.util.automaton into GenerateJFlexTLDMacros.java... but I agree with your thoughts on the JFlex issue, it is better to just generate the simple "transparent" grammar of TLDs, despite how slow it makes things.

@rmuir rmuir merged commit 5c746db into apache:main Dec 7, 2021
rmuir added a commit to rmuir/lucene that referenced this pull request Dec 7, 2021
…che#515)

Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants