Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loosen ASCII compatible rules + improve reverse suffix optimization #1105

Merged
merged 3 commits into from
Oct 14, 2023

Conversation

BurntSushi
Copy link
Member

Basically, patterns like (?-u:☃) are now allowed. Previously they were banned since -u disables Unicode mode. But since it's just a literal and patterns must be valid UTF-8, there is a simple and unambiguous interpretation: the UTF-8 encoding of the codepoint. Note though that Unicode character classes, including even (?-u:[☃]), are still banned. I think this restriction could probably be lifted, but it's not quite as obvious since disabling Unicode mode is supposed to switch the atom of matching from the codepoint to the byte, and something like [☃] seems to require that the atom of matching is the codepoint.

This PR also contains a tweak to the reverse suffix optimization to make it a bit more broadly applicable. This actually brings it in line with the reverse inner optimization. Basically, instead of only limiting its use to when there is a non-empty and single common suffix, we expand its use to whenever the prefilter build from the suffixes of the pattern is believed to be "fast."

In some ad hoc profiling, I noticed an extra function call that really
didn't need to be there.
Previously, patterns like `(?-u:☃)` were banned under the logic that
Unicode scalar values shouldn't be available unless Unicode mode is
enabled. But since patterns are required to be UTF-8, there really isn't
any difficulty in just interpreting Unicode literals as their
corresponding UTF-8 encoding.

Note though that Unicode character classes, even things like
`(?-u:[☃])`, remain banned. We probably could make character classes
work too, but it's unclear how that plays with ASCII compatible mode
requiring that a single byte is the fundamental atom of matching (where
as Unicode mode requires that Unicode scalar values are the fundamental
atom of matching).
Previously, we were only use the reverse suffix optimization if it found
a non-empty longest common suffix *and* if the prefilter thought itself
was fast. This was a heuristic used in the old regex crate before we
grew the "is prefilter fast" heuristic. We change this optimization to
just use the "is prefilter fast" heuristic instead of requiring a
non-empty longest common suffix.

This is, after all, what the inner literal optimization does. And in the
inner literal case, one should probably be even more conservative
because of the extra work that needs to be done. So if things are going
okay with the inner literal optimization, then we should be fine with
the reverse suffix optimization doing essentially the same thing.
@BurntSushi BurntSushi merged commit 8a8d599 into master Oct 14, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant