Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimisation should not combine end states with distinct sets of endids. #461

Merged
merged 5 commits into from
Apr 26, 2024

Conversation

silentbicycle
Copy link
Collaborator

This was adapted from the group capture integration branch. It isn't specific to group capture.

Currently, metadata on end states will not prohibit merging, which can lead to false positives when using endids to detect which of the original regexes are matching when several regexes are compiled and then unioned into a single DFA. This adds a pass before the main minimisation loop that splits the ECs of end states with distinct sets of end IDs, which is sufficient to prevent them from merging later.

I reworked the test tests/endids/endids2_union_many_endids.c somewhat, since it was previously checking that the end states WERE merged. Unfortunately the other checks are somewhat weakened -- I removed the part checking that endids[j] == j+1 because that ordering isn't guaranteed anymore: the patterns aren't anchored, so some examples will match more than one of them, and the endids collected can be nonconsecutive.

I added a separate test, endids10_minimise_partial_overlap.c, which checks the minimal case that when the two regexes /^abc$/ and /^ab*c$/ are combined and a distinct endid is set on each, matching "ac" "abc" "abbc" gets one or both endids as appropriate. Ideally, we'd have a property test that checks that any arbitrary set of regex patterns has the same set of inputs matching -> endids when they are run individually vs. when combined into a single FSM, determinised, and minimised.

This was adapted from the group capture integration branch. It isn't
specific to group capture.

Currently, metadata on end states will not prohibit merging, which can
lead to false positives when using endids to detect which of the
original regexes are matching when several regexes are compiled and then
unioned into a single DFA. This adds a pass before the main minimisation
loop that splits the ECs of end states with distinct sets of end IDs,
which is sufficient to prevent them from merging later.

I reworked the test tests/endids/endids2_union_many_endids.c somewhat,
since it was previously checking that the end states WERE merged.
Unfortunately the other checks are somewhat weakened -- I removed the
part checking that `endids[j] == j+1` because that ordering isn't
guaranteed anymore: the patterns aren't anchored, so some examples
will match more than one of them, and the endids collected can be
nonconsecutive.

I added a separate test, endids10_minimise_partial_overlap.c, which
checks the minimal case that when the two regexes /^abc$/ and /^ab*c$/
are combined and a distinct endid is set on each, matching "ac" "abc"
"abbc" gets one or both endids as appropriate. Ideally, we'd have a
property test that checks that any arbitrary set of regex patterns has
the same set of inputs matching -> endids when they are run individually
vs. when combined into a single FSM, determinised, and minimised.
These got missed.

Also, rename EXPENSIVE_INTEGRITY_CHECKS in minimise.c to
EXPENSIVE_CHECKS and move the <fsm/print.h> header inclusion.
@silentbicycle
Copy link
Collaborator Author

I fixed two things caught by the CI run:

  • there were a couple functions called in the EXPENSIVE_CHECKS build that didn't get integrated
  • the new test wasn't freeing the endid buffers

But I still need to update the minimization test oracle to prevent it from merging end states with distinct sets of endids.

If two end states have distinct sets of endids associated with
them, mark them as distinguishable before entering the main
fixpoint loop. This brings the results from the test oracle back
in line with fsm_minimise.
@katef katef merged commit d05126a into main Apr 26, 2024
322 checks passed
@katef katef deleted the sv/preserve_ecs_split_by_endids branch April 26, 2024 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants