Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for rewrite_repeat #250

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

hvdijk
Copy link
Collaborator

@hvdijk hvdijk commented Sep 10, 2020

rewrite_repeat, used for rewriting nested repeats contained a comment
that said repeats can only be combined if the range of the result is not
more than the sum of the ranges of the inputs, but the assertion
actually tested the opposite: it tested that the range of the result is
more than the sum of the ranges of the inputs. This assertion does not
universally hold either way: a{2}{2} can be rewritten to a{4}, but the
range as calculated of the latter (0) is equal to the sum of the ranges
of the inputs (also 0).

Separate from the assert, the rewrite was not performed in some cases
where it is valid to do so. It is valid whenever the inner repeat has
a lower bound of 0 or 1, whenever the inner repeat has no upper bound
and the outer repeat has a positive lower bound, and whenever the outer
range has a lower bound equal to the upper bound. The last case was
missing. An example is a{2,3}{2}, which would previously be preserved
in that form, but is now rewritten as a{4,6}.

@hvdijk
Copy link
Collaborator Author

hvdijk commented Sep 10, 2020

Thinking about this a bit more, it occurs to me that there are more cases that weren't handled, and aren't handled by my PR. a{2,3}{2,3} can be rewritten to a{4,9}. I am not sure just yet how to improve the check to handle this.

@hvdijk
Copy link
Collaborator Author

hvdijk commented Sep 10, 2020

Thinking about this a bit more, it occurs to me that there are more cases that weren't handled, and aren't handled by my PR. a{2,3}{2,3} can be rewritten to a{4,9}. I am not sure just yet how to improve the check to handle this.

I have updated the PR to handle this and included a comment in the code explaining when and why the transformation is valid.

@katef
Copy link
Owner

katef commented Sep 10, 2020

Hm thank you, I'll try to re-think some of this, too. Maybe we can find a few more situations to cover with tests, although I did think I had covered all combinations there :/

@sfstewman
Copy link
Collaborator

To be pedantic, the PCRE (v2) syntax does not allow a{2,3}{2,3}. pcregrep returns an error:

% echo 'aaaaa' | pcregrep 'a{2,3}{2,3}'
pcregrep: Error in command-line regex at offset 10: nothing to repeat

The PCRE spec doesn't clearly state the grammar, so it took a bit of looking, but the section on repetition is quite clear:

Repetition is specified by quantifiers, which can follow any of the following items:
a literal data character
the dot metacharacter
the \C escape sequence
the \R escape sequence
the \X escape sequence
an escape such as \d or \pL that matches a single character
a character class
a backreference
a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise)

The question becomes... does libfsm want to support this as an addition?

@sfstewman
Copy link
Collaborator

To be pedantic, the PCRE (v2) syntax does not allow a{2,3}{2,3}. pcregrep returns an error:

% echo 'aaaaa' | pcregrep 'a{2,3}{2,3}'
pcregrep: Error in command-line regex at offset 10: nothing to repeat

The PCRE spec doesn't clearly state the grammar, so it took a bit of looking, but the section on repetition is quite clear:

Repetition is specified by quantifiers, which can follow any of the following items:
a literal data character
the dot metacharacter
the \C escape sequence
the \R escape sequence
the \X escape sequence
an escape such as \d or \pL that matches a single character
a character class
a backreference
a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise)

The question becomes... does libfsm want to support this as an addition?

Not sure why I was thinking that this applied to PCRE. Using a{i,j}{k,l} to denote nested repeats in the AST tree makes perfect sense. Apologies.

@katef
Copy link
Owner

katef commented Sep 11, 2020

That's fine! Nothing to apologise about!!

@hvdijk
Copy link
Collaborator Author

hvdijk commented Sep 11, 2020

Yeah, that's fine, that's a good observation to make. It still needs to be handled in the AST, but perhaps all the existing a{...}{...} PCRE tests and the new one in this PR should be updated to (?:a{...}){...}, and the parser updated to not accept a{...}{...}?

PCRE does allow + and ? to follow another quantifier, but like in Perl, they behave specially, they do not mean the same thing as + or ? applied to the subexpression. In Perl/PCRE, a*+a can never match, because + makes the * greedy. In libfsm's PCRE mode, a*+a just means (?:a*)+a. Accepting the same syntax as PCRE but with different semantics seems risky.

@hvdijk
Copy link
Collaborator Author

hvdijk commented Sep 11, 2020

A brute force search found one edge case that would be mishandled if it hadn't already been handled by earlier code: it is invalid to rewrite a{0,0}{x,} or a{x,}{0,0} as a{0,}. This is not a problem as even if somehow this gets constructed in the AST as AST_EXPR_REPEAT nodes, the code already special cases upper bounds of zero. The problem is not actually in the logic I changed, it is in the multiplication that happens earlier:

if (i == AST_COUNT_UNBOUNDED || k == AST_COUNT_UNBOUNDED) {
	w = AST_COUNT_UNBOUNDED;
} else if (__builtin_umul_overflow(i, k, &w)) {
	return 1;
}

This effectively treats AST_COUNT_UNBOUNDED * 0 and 0 * AST_COUNT_UNBOUNDED as AST_COUNT_UNBOUNDED, but it should be treated as zero. I used the same logic in my brute force search.

This could be changed to

if (__builtin_umul_overflow(i, k, &w)) {
	if (i == AST_COUNT_UNBOUNDED || k == AST_COUNT_UNBOUNDED) {
		w = AST_COUNT_UNBOUNDED;
	} else {
		return 1;
	}
}

to fix this, but as it can never come up, and the rewritten logic is harder to understand, I do not know whether it is worth the effort.

The brute force search found no cases that my check fails to handle. I searched all possibilities for upper bounds up to 10, plus AST_COUNT_UNBOUNDED, like so:

for h in $(seq 0 10); do
for i in $(seq $h 10) ""; do
for j in $(seq 0 10); do
for k in $(seq $j 10) ""; do
    re1="a{$h,$i}{$j,$k}"
    re2="a{$((h*j)),${i:+${k:+$((i*k))}}}"
    ex=$((!(j == ${k:-99} || ${i:-99} == 99 && j > 0 || h * (j + 1) <= i * j + 1)))
    build/bin/re -pb -rpcre "$re1" >fsm1.fsm
    build/bin/re -pb -rpcre "$re2" >fsm2.fsm
    build/bin/fsm -t equal fsm1.fsm fsm2.fsm
    test $? -eq $ex || echo "$re1 $re2"
done
done
done
done

@katef
Copy link
Owner

katef commented Sep 17, 2020

Note to myself; I'm just waiting on thinking a bit about some of the AST behaviour here before I go ahead and merge this.

rewrite_repeat, used for rewriting nested repeats contained a comment
that said repeats can only be combined if the range of the result is not
more than the sum of the ranges of the inputs, but the assertion
actually tested the opposite: it tested that the range of the result is
more than the sum of the ranges of the inputs. This assertion does not
universally hold either way: a{2}{2} can be rewritten to a{4}, but the
range as calculated of the latter (0) is equal to the sum of the ranges
of the inputs (also 0).

Separate from the assert, the rewrite was not performed in some cases
where it is valid to do so. It is valid whenever the inner repeat has
a lower bound of 0 or 1, whenever the inner repeat has no upper bound
and the outer repeat has a positive lower bound, and whenever the outer
range has a lower bound equal to the upper bound. The last case was
missing. An example is a{2,3}{2}, which would previously be preserved
in that form, but is now rewritten as a{4,6}.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants