loosen ASCII compatible rules + improve reverse suffix optimization #1105
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Basically, patterns like
(?-u:☃)
are now allowed. Previously they were banned since-u
disables Unicode mode. But since it's just a literal and patterns must be valid UTF-8, there is a simple and unambiguous interpretation: the UTF-8 encoding of the codepoint. Note though that Unicode character classes, including even(?-u:[☃])
, are still banned. I think this restriction could probably be lifted, but it's not quite as obvious since disabling Unicode mode is supposed to switch the atom of matching from the codepoint to the byte, and something like[☃]
seems to require that the atom of matching is the codepoint.This PR also contains a tweak to the reverse suffix optimization to make it a bit more broadly applicable. This actually brings it in line with the reverse inner optimization. Basically, instead of only limiting its use to when there is a non-empty and single common suffix, we expand its use to whenever the prefilter build from the suffixes of the pattern is believed to be "fast."