Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

dweiss · 2020-01-27T09:46:35Z

Apache Lucene has a jflex definition file (UAX29URLEmailTokenizerImpl.jflex) that is 21kb and includes some other jflex files that add ~40kb. The generated file (DFA) is 2.2MB so it's still relatively small. The generation process takes a whooping 10 minutes and requires 10 gigs of ram though – most of it is spent in constructing NFA and computing epsilon closures.

I though it'd be interesting to look into why this behaves this way. Perhaps it'd provide some clues for optimization.

You can reproduce construction on Lucene repository (https://github.com/apache/lucene-solr/)

java -jar c:\_tmp\jflex-1.7.0\lib\jflex-full-1.7.0.jar --verbose --encoding "UTF-8" lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.jflex --skel lucene/core/src/data/jflex/skeleton.disable.buffer.expansion.txt

The text was updated successfully, but these errors were encountered:

dweiss · 2020-01-27T11:45:59Z

Seems like most of the problem is caused by this:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/ASCIITLD.jflex-macro

I don't know if there is any other replacement that wouldn't explode the NFA internally...

dweiss · 2020-01-30T14:53:11Z

Right... I think we're experiencing this exactly.

  /**
   * Constructs an NFA accepting the complement of the language of a given NFA.
   *
   * <p>Converts the NFA into a DFA, then negates that DFA. Exponential state blowup possible and
   * common.
...
  private IntPair complement(IntPair nfa) {

lsf37 · 2020-02-03T00:55:26Z

Sorry for the silence, was mostly offline last week. I didn't directly spot any negation operator (!) in ASCIITLD.jflex-macro, but I might have overlooked it. Or is it in a spec that includes this one?

Basically, the NFA->DFA transformation is always potentially exponential, and the negation operator includes one. There are cases that must be exponential, because even the minimal DFA contains exponentially more states than the NFA, and there are cases where it's "just" exponential in the construction. 2.2MB in the final DFA is pretty big, so it sounds like at least some small part of it is not reducible, but compared to 10GB it's tiny, so at least some part should be reducible.

One thing that is notable in ASCIITLD.jflex-macro is that it's a very large | of basically strings. The | constructs an NFA for each part and then connects them all to an initial state by epsilon transitions. This would explain why you are seeing a lot of time spent in epsilon closures, and it also explains the state explosion, because the result of that are basically DFA states that track which set of these alternatives are still possible. The size of these sets is potentially roughly exponential in the number of alternatives and maybe their length, depending on how soon the alternatives diverge from each other.

One thing to try would be to factor out common prefixes, e.g.

| [aA][bB][aA][rR][tT][hH]
| [aA][bB][bB]
| [aA][bB][bB][oO][tT][tT]
| [aA][bB][bB][vV][iI][eE]

=

 [aA][bB] ([aA][rR][tT][hH] | [bB] ([oO][tT][tT] | [vV][iI][eE])?

This should reduce the number of DFA states in initial construction, because the first character in the alternative already distinguishes which one it is. It basically does something that DFA minimisation would take care of later. The spec is big, of course, and manually transforming it will be error prone, so before even attempting it, it would be worth checking out if it helps enough.

The other question is if there is a negation operator on the big expression, if that operator could be somehow avoided.

lsf37 · 2020-02-03T06:13:19Z

Just had a look at UAX29URLEmailTokenizerImpl.jflex: it uses ! to work around the no-macros-in-charclasses restriction. This is now implemented in JFlex master (#654) and will be in the next release. That might well fix your problem.

Are you able to try out the master branch?

dweiss · 2020-02-03T16:46:22Z

I am on short holidays this week, thanks for looking into it. I will get back to this once I am back. D.

lsf37 · 2020-02-04T00:47:13Z

No worries. If the new feature helps, we can probably push out the JFlex 1.8.0 release next week, so you can have the build on an official version.

dweiss · 2020-02-19T10:03:29Z

Hi Gerwin. Nah, I observe the same behavior on master, sadly. Compilation and takes forever and requires huge amounts of ram.

2.2MB in the final DFA is pretty big, so it sounds like at least some small part of it is not reducible, but compared to 10GB it's tiny, so at least some part should be reducible.

Yes, that's my impression too. I bet something could be optimized along the way (maybe at the cost of algorithm clarity) to reduce the number of epsilon transitions before the conversion to a DFA takes place. I wish I had more time to help out!

Tweaking ASCIITLD.jflex-macro is certainly possible although it'll make the definition very complex and hard to understand. As it stands now you can tell what it's doing; I think the compiler should implement this state reduction internally - this shouldn't be too difficult even with a hash-based state deduplication? It's easy for me to say, of course. :)

lsf37 · 2020-02-20T01:24:44Z

Alight, I'll look into it more deeply, surely there must be something we can do. It'll be a bit, though, next week (hopefully) for is the 1.8.0 release first.

dweiss · 2020-02-20T06:58:35Z

No problem at all - this isn't a very pressing matter as we regenerate those automata very infrequently. I was just surprised to see such vast disproportion in memory consumption between generation and final automaton.

The hash state reduction idea (in case you didn't understand my brief note) is about deduplicating states based on an associative container where you can detect nodes with the same input/ output transitions (pointing at the same neighbouring nodes). I glanced at the code and saw bitsets used heavily so it may not be a trivial change...

lsf37 · 2020-02-20T07:07:42Z

Ok, good to know that you're not stuck because of it.

The hash state reduction idea (in case you didn't understand my brief note) is about deduplicating states based on an associative container where you can detect nodes with the same input/ output transitions (pointing at the same neighbouring nodes). I glanced at the code and saw bitsets used heavily so it may not be a trivial change...

Yes, that might be a bit of a project. I think I'll try extracting common factors on the regexp AST first, and see if that has an impact.

rmuir · 2021-12-04T17:20:04Z

Just had a look at UAX29URLEmailTokenizerImpl.jflex: it uses ! to work around the no-macros-in-charclasses restriction. This is now implemented in JFlex master (#654) and will be in the next release. That might well fix your problem.

Are you able to try out the master branch?

@lsf37 I finally tried out the new feature (macros in charclasses) and it's definitely a significant improvement! 285.26 seconds versus 918.87 sec.

So this NFA-complement that was happening due to the workaround was at least amplifying the problem significantly for us, progress.

Thanks!

lsf37 · 2023-01-06T05:52:28Z

Closing this as the maco-in-char-classes feature seems to have alleviated the problem and opened #1026 as a tracking issue for the potential common-prefix optimisation mentioned in this issue.

lsf37 self-assigned this Feb 3, 2020

lsf37 added the question More information is needed label Feb 3, 2020

lsf37 added the enhancement Feature requests label Feb 20, 2020

regisd added the performance label Mar 28, 2020

regisd removed the performance label Jan 2, 2021

rmuir mentioned this issue Dec 4, 2021

simplify jflex grammars by using difference rather than negation apache/lucene#515

Merged

lsf37 mentioned this issue Jan 6, 2023

optimisation: factor out common prefixes in large alternatives #1026

Open

3 tasks

lsf37 closed this as completed Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

dweiss commented Jan 27, 2020 •

edited by regisd

Loading

dweiss commented Jan 27, 2020

dweiss commented Jan 30, 2020

lsf37 commented Feb 3, 2020 •

edited

Loading

lsf37 commented Feb 3, 2020 •

edited

Loading

dweiss commented Feb 3, 2020 via email •

edited by regisd

Loading

lsf37 commented Feb 4, 2020

dweiss commented Feb 19, 2020

lsf37 commented Feb 20, 2020

dweiss commented Feb 20, 2020

lsf37 commented Feb 20, 2020

rmuir commented Dec 4, 2021

lsf37 commented Jan 6, 2023

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

Comments

dweiss commented Jan 27, 2020 • edited by regisd Loading

dweiss commented Jan 27, 2020

dweiss commented Jan 30, 2020

lsf37 commented Feb 3, 2020 • edited Loading

lsf37 commented Feb 3, 2020 • edited Loading

dweiss commented Feb 3, 2020 via email • edited by regisd Loading

lsf37 commented Feb 4, 2020

dweiss commented Feb 19, 2020

lsf37 commented Feb 20, 2020

dweiss commented Feb 20, 2020

lsf37 commented Feb 20, 2020

rmuir commented Dec 4, 2021

lsf37 commented Jan 6, 2023

dweiss commented Jan 27, 2020 •

edited by regisd

Loading

lsf37 commented Feb 3, 2020 •

edited

Loading

lsf37 commented Feb 3, 2020 •

edited

Loading

dweiss commented Feb 3, 2020 via email •

edited by regisd

Loading